Blameless Postmortems: A Practical Process and Template for SRE Teams

Incidents Are Data, Not Failures

Every incident is a free lesson in how your system actually behaves under stress. The postmortem is where you extract that lesson. But only if you do it right.

The Google SRE book is clear on this: a postmortem must be blameless. The moment you assign blame to a person, everyone in the room starts protecting themselves instead of sharing what they know. The quality of your root cause analysis drops to zero.

I've facilitated over a hundred postmortems across different organizations. The ones that produce real improvements all share the same structure. Let me walk through it.

When to Write a Postmortem

Not every incident needs a postmortem. Writing them for every blip creates fatigue and the documents pile up unread. Write a postmortem when any of these conditions are met:

User-visible downtime exceeded your SLO error budget for the period
The incident required an escalation beyond the primary on-call
Data was lost or corrupted
A manual intervention was needed to restore service
The detection time exceeded 10 minutes for a user-facing issue
The same root cause class has appeared more than once

The SLO connection is important. Your error budget is the objective measure of whether an incident mattered. If you're within budget, a brief incident note suffices. If budget was consumed, a full postmortem is warranted.

Measuring the Incident: Key PromQL Queries

Before you sit down to write, gather the data. Opinions vary. Metrics don't.

Impact Duration and Magnitude

# Error rate during the incident window
sum(rate(http_requests_total{service="affected-service", status_class="5xx"}[1m]))
/
sum(rate(http_requests_total{service="affected-service"}[1m]))

Plot this over the incident window. The area above your SLO threshold is your actual impact — the number of excess errors served to users.

# Total failed requests during incident (instant query at end of window)
sum(increase(http_requests_total{service="affected-service", status_class="5xx"}[2h]))

Latency Impact

# P99 latency during incident window
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="affected-service"}[1m])) by (le)
)

Error Budget Consumed

# Fraction of monthly error budget consumed by this incident
# Assumes 30-day window and 99.9% SLO
(
  sum(increase(http_requests_total{service="affected-service", status_class="5xx"}[2h]))
  /
  sum(increase(http_requests_total{service="affected-service"}[30d]))
) / 0.001 * 100

This last query tells you what percentage of your monthly error budget this single incident consumed. That number belongs in the postmortem header. It grounds the entire conversation in objective impact.

The Postmortem Template

Here is the template I use. Every section has a purpose.

# Postmortem: [Incident Title]
# Date: YYYY-MM-DD
# Severity: SEV-1 | SEV-2 | SEV-3
# Author: [On-call engineer]
# Status: Draft | In Review | Complete

## Summary
# 2-3 sentences. What happened, how long, who was affected.
# Include: error budget consumed as a percentage.

## Impact
# Quantified impact. Not "some users were affected" — use numbers.
#   - Duration: XX minutes
#   - Requests affected: XX,XXX
#   - Error rate peak: XX%
#   - Error budget consumed: XX% of monthly budget
#   - Revenue impact (if measurable): $X,XXX

## Timeline
# Use UTC. Every entry should be a fact, not an interpretation.
#   HH:MM - Event description
#   HH:MM - Event description
# Key timestamps to always include:
#   - When the triggering change was deployed
#   - When monitoring first detected the issue
#   - When a human was alerted
#   - When mitigation began
#   - When service was restored
#   - When root cause was confirmed

## Detection
# How was the incident detected?
# Was it detected by monitoring or reported by a user?
# How long between the start of impact and detection?
# Detection gap = time_human_alerted - time_impact_started

## Root Cause
# Technical explanation. Go deep. What specifically broke and why.
# Use the "five whys" method but write the final narrative form.
# This section should be understandable by any engineer on the team.

## Contributing Factors
# What conditions made this incident possible or worse?
# Examples: missing monitoring, test gap, config drift,
# knowledge concentration in one person.

## Mitigation
# What was done to stop the bleeding?
# Was it a rollback, config change, scaling event, or manual fix?

## Lessons Learned
# What went well:
#   - (things that worked: detection, response, tooling)
# What went poorly:
#   - (things that failed or slowed response)
# Where we got lucky:
#   - (things that could have been worse)

## Action Items
# Every action item must have:
#   - Owner (a person, not a team)
#   - Priority (P1-P3)
#   - Due date
#   - Tracking ticket link
# | Action | Owner | Priority | Due | Ticket |
# |--------|-------|----------|-----|--------|

Running the Postmortem Meeting

The meeting is as important as the document. Here's the structure I follow.

Before the Meeting

The on-call engineer fills in the timeline and impact sections from data
Share the draft 24 hours before the meeting so people can review
Invite everyone involved in the response, plus the service owner

The Facilitator's Role

The facilitator is not the author. Their job is to keep the discussion blameless and productive. When someone says "they should have caught that in review," the facilitator redirects: "What about our process made it possible for this to reach production?"

Redirect phrases that work:

"What about the system allowed this?"
"What information was missing at the time?"
"If we assume everyone made reasonable decisions, what was the gap?"

Meeting Structure (60 minutes)

00:00 - 05:00  Read the timeline silently (everyone on the same page)
05:00 - 15:00  Timeline corrections and additions
15:00 - 30:00  Root cause discussion (five whys, facilitated)
30:00 - 40:00  Contributing factors and lessons learned
40:00 - 55:00  Action items (assign owners and due dates)
55:00 - 60:00  Review action items, confirm next steps

The Five Whys in Practice

This is the core of root cause analysis. Here's an example from a real incident:

Why did users see 500 errors? The payment service returned errors for all requests.
Why was the payment service returning errors? It couldn't connect to the database.
Why couldn't it connect to the database? The connection pool was exhausted.
Why was the connection pool exhausted? A new query pattern held connections open 10x longer than expected.
Why did the new query pattern reach production? Load testing doesn't cover this code path, and there's no connection pool saturation alert.

The root cause isn't "someone wrote a bad query." It's "we lack load testing coverage for this code path and have no alerting on connection pool saturation." Those are systemic issues with systemic fixes.

Monitoring Gaps: The Postmortem's Best Output

The most valuable action items from postmortems are almost always monitoring improvements. After every postmortem, ask: what query would have detected this faster?

# Connection pool saturation — the alert we should have had
(
  sum(db_pool_active_connections{service="payment-service"}) by (pod)
  /
  sum(db_pool_max_connections{service="payment-service"}) by (pod)
) > 0.8

# Alert rule born from the postmortem
groups:
  - name: postmortem-action-items
    rules:
      - alert: ConnectionPoolNearSaturation
        expr: |
          (
            sum(db_pool_active_connections{service="payment-service"}) by (pod)
            /
            sum(db_pool_max_connections{service="payment-service"}) by (pod)
          ) > 0.8
        for: 5m
        labels:
          severity: warning
          source: postmortem-2026-03-15
        annotations:
          summary: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.pod }}"
          runbook_url: "https://wiki.internal/runbooks/connection-pool-saturation"

Tagging the alert with its postmortem source creates traceability. Six months from now, someone can see why this alert exists.

Tracking Action Item Completion

A postmortem without completed action items is just a document. Track completion rates:

Review open action items in weekly SRE team meetings
Set a target: 90% of P1 action items closed within 14 days
If an action item is repeatedly deferred, escalate or remove it — stale items erode trust in the process

Common Anti-Patterns

The blame postmortem. "The engineer deployed without testing." This kills psychological safety. Reframe: "Our deployment pipeline allows changes to reach production without automated validation."

The shallow root cause. "The server ran out of memory." Why did it run out of memory? What made that possible? Keep asking why until you reach a systemic issue.

No action items. A postmortem that ends with "we'll be more careful" has produced nothing. Every postmortem should produce at least one concrete, measurable change.

Never following up. The same root cause appearing in two postmortems is a process failure. If your action items aren't being completed, the postmortem process is providing an illusion of improvement.

Building the Culture

Blameless postmortems are a practice, not a policy. You build the culture by consistently modeling it. Celebrate thorough postmortems publicly. Share them across teams. When someone identifies a systemic gap that affects other services, that's a win for the entire organization.

The Google SRE book puts it simply: "We aim to understand what happened and how to prevent it, not who is to blame." Every meeting, every document, every conversation should reflect that principle.

Incidents will happen. The only question is whether you learn from them systematically or repeat them indefinitely.

On this page

Blameless Postmortems: A Practical Process and Template for SRE Teams

Incidents Are Data, Not Failures

When to Write a Postmortem

Measuring the Incident: Key PromQL Queries

Impact Duration and Magnitude

Latency Impact

Error Budget Consumed

The Postmortem Template

Running the Postmortem Meeting

Before the Meeting

The Facilitator's Role

Meeting Structure (60 minutes)

The Five Whys in Practice

Monitoring Gaps: The Postmortem's Best Output

Tracking Action Item Completion

Common Anti-Patterns

Building the Culture

Related Articles

Writing Blameless Postmortems That Actually Prevent Recurrence

On-Call Rotation Practices That Actually Prevent Burnout

Implementing SLOs and Error Budgets From Scratch

Automating Incident Severity Classification With PagerDuty Event Rules And Custom Fields

Runbook Automation With PagerDuty And Rundeck For Faster MTTR

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Discussion

Related Articles

On-Call Rotation Practices That Actually Prevent Burnout

Implementing SLOs and Error Budgets From Scratch

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch