DevOpsil

Blameless Postmortems: A Practical Process and Template for SRE Teams

Riku TanakaRiku Tanaka8 min read

Incidents Are Data, Not Failures

Every incident is a free lesson in how your system actually behaves under stress. The postmortem is where you extract that lesson. But only if you do it right.

The Google SRE book is clear on this: a postmortem must be blameless. The moment you assign blame to a person, everyone in the room starts protecting themselves instead of sharing what they know. The quality of your root cause analysis drops to zero.

I've facilitated over a hundred postmortems across different organizations. The ones that produce real improvements all share the same structure. Let me walk through it.

When to Write a Postmortem

Not every incident needs a postmortem. Writing them for every blip creates fatigue and the documents pile up unread. Write a postmortem when any of these conditions are met:

  • User-visible downtime exceeded your SLO error budget for the period
  • The incident required an escalation beyond the primary on-call
  • Data was lost or corrupted
  • A manual intervention was needed to restore service
  • The detection time exceeded 10 minutes for a user-facing issue
  • The same root cause class has appeared more than once

The SLO connection is important. Your error budget is the objective measure of whether an incident mattered. If you're within budget, a brief incident note suffices. If budget was consumed, a full postmortem is warranted.

Measuring the Incident: Key PromQL Queries

Before you sit down to write, gather the data. Opinions vary. Metrics don't.

Impact Duration and Magnitude

# Error rate during the incident window
sum(rate(http_requests_total{service="affected-service", status_class="5xx"}[1m]))
/
sum(rate(http_requests_total{service="affected-service"}[1m]))

Plot this over the incident window. The area above your SLO threshold is your actual impact — the number of excess errors served to users.

# Total failed requests during incident (instant query at end of window)
sum(increase(http_requests_total{service="affected-service", status_class="5xx"}[2h]))

Latency Impact

# P99 latency during incident window
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="affected-service"}[1m])) by (le)
)

Error Budget Consumed

# Fraction of monthly error budget consumed by this incident
# Assumes 30-day window and 99.9% SLO
(
  sum(increase(http_requests_total{service="affected-service", status_class="5xx"}[2h]))
  /
  sum(increase(http_requests_total{service="affected-service"}[30d]))
) / 0.001 * 100

This last query tells you what percentage of your monthly error budget this single incident consumed. That number belongs in the postmortem header. It grounds the entire conversation in objective impact.

The Postmortem Template

Here is the template I use. Every section has a purpose.

# Postmortem: [Incident Title]
# Date: YYYY-MM-DD
# Severity: SEV-1 | SEV-2 | SEV-3
# Author: [On-call engineer]
# Status: Draft | In Review | Complete

## Summary
# 2-3 sentences. What happened, how long, who was affected.
# Include: error budget consumed as a percentage.

## Impact
# Quantified impact. Not "some users were affected" — use numbers.
#   - Duration: XX minutes
#   - Requests affected: XX,XXX
#   - Error rate peak: XX%
#   - Error budget consumed: XX% of monthly budget
#   - Revenue impact (if measurable): $X,XXX

## Timeline
# Use UTC. Every entry should be a fact, not an interpretation.
#   HH:MM - Event description
#   HH:MM - Event description
# Key timestamps to always include:
#   - When the triggering change was deployed
#   - When monitoring first detected the issue
#   - When a human was alerted
#   - When mitigation began
#   - When service was restored
#   - When root cause was confirmed

## Detection
# How was the incident detected?
# Was it detected by monitoring or reported by a user?
# How long between the start of impact and detection?
# Detection gap = time_human_alerted - time_impact_started

## Root Cause
# Technical explanation. Go deep. What specifically broke and why.
# Use the "five whys" method but write the final narrative form.
# This section should be understandable by any engineer on the team.

## Contributing Factors
# What conditions made this incident possible or worse?
# Examples: missing monitoring, test gap, config drift,
# knowledge concentration in one person.

## Mitigation
# What was done to stop the bleeding?
# Was it a rollback, config change, scaling event, or manual fix?

## Lessons Learned
# What went well:
#   - (things that worked: detection, response, tooling)
# What went poorly:
#   - (things that failed or slowed response)
# Where we got lucky:
#   - (things that could have been worse)

## Action Items
# Every action item must have:
#   - Owner (a person, not a team)
#   - Priority (P1-P3)
#   - Due date
#   - Tracking ticket link
# | Action | Owner | Priority | Due | Ticket |
# |--------|-------|----------|-----|--------|

Running the Postmortem Meeting

The meeting is as important as the document. Here's the structure I follow.

Before the Meeting

  • The on-call engineer fills in the timeline and impact sections from data
  • Share the draft 24 hours before the meeting so people can review
  • Invite everyone involved in the response, plus the service owner

The Facilitator's Role

The facilitator is not the author. Their job is to keep the discussion blameless and productive. When someone says "they should have caught that in review," the facilitator redirects: "What about our process made it possible for this to reach production?"

Redirect phrases that work:

  • "What about the system allowed this?"
  • "What information was missing at the time?"
  • "If we assume everyone made reasonable decisions, what was the gap?"

Meeting Structure (60 minutes)

00:00 - 05:00  Read the timeline silently (everyone on the same page)
05:00 - 15:00  Timeline corrections and additions
15:00 - 30:00  Root cause discussion (five whys, facilitated)
30:00 - 40:00  Contributing factors and lessons learned
40:00 - 55:00  Action items (assign owners and due dates)
55:00 - 60:00  Review action items, confirm next steps

The Five Whys in Practice

This is the core of root cause analysis. Here's an example from a real incident:

  1. Why did users see 500 errors? The payment service returned errors for all requests.
  2. Why was the payment service returning errors? It couldn't connect to the database.
  3. Why couldn't it connect to the database? The connection pool was exhausted.
  4. Why was the connection pool exhausted? A new query pattern held connections open 10x longer than expected.
  5. Why did the new query pattern reach production? Load testing doesn't cover this code path, and there's no connection pool saturation alert.

The root cause isn't "someone wrote a bad query." It's "we lack load testing coverage for this code path and have no alerting on connection pool saturation." Those are systemic issues with systemic fixes.

Monitoring Gaps: The Postmortem's Best Output

The most valuable action items from postmortems are almost always monitoring improvements. After every postmortem, ask: what query would have detected this faster?

# Connection pool saturation — the alert we should have had
(
  sum(db_pool_active_connections{service="payment-service"}) by (pod)
  /
  sum(db_pool_max_connections{service="payment-service"}) by (pod)
) > 0.8
# Alert rule born from the postmortem
groups:
  - name: postmortem-action-items
    rules:
      - alert: ConnectionPoolNearSaturation
        expr: |
          (
            sum(db_pool_active_connections{service="payment-service"}) by (pod)
            /
            sum(db_pool_max_connections{service="payment-service"}) by (pod)
          ) > 0.8
        for: 5m
        labels:
          severity: warning
          source: postmortem-2026-03-15
        annotations:
          summary: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.pod }}"
          runbook_url: "https://wiki.internal/runbooks/connection-pool-saturation"

Tagging the alert with its postmortem source creates traceability. Six months from now, someone can see why this alert exists.

Tracking Action Item Completion

A postmortem without completed action items is just a document. Track completion rates:

  • Review open action items in weekly SRE team meetings
  • Set a target: 90% of P1 action items closed within 14 days
  • If an action item is repeatedly deferred, escalate or remove it — stale items erode trust in the process

Common Anti-Patterns

The blame postmortem. "The engineer deployed without testing." This kills psychological safety. Reframe: "Our deployment pipeline allows changes to reach production without automated validation."

The shallow root cause. "The server ran out of memory." Why did it run out of memory? What made that possible? Keep asking why until you reach a systemic issue.

No action items. A postmortem that ends with "we'll be more careful" has produced nothing. Every postmortem should produce at least one concrete, measurable change.

Never following up. The same root cause appearing in two postmortems is a process failure. If your action items aren't being completed, the postmortem process is providing an illusion of improvement.

Building the Culture

Blameless postmortems are a practice, not a policy. You build the culture by consistently modeling it. Celebrate thorough postmortems publicly. Share them across teams. When someone identifies a systemic gap that affects other services, that's a win for the entire organization.

The Google SRE book puts it simply: "We aim to understand what happened and how to prevent it, not who is to blame." Every meeting, every document, every conversation should reflect that principle.

Incidents will happen. The only question is whether you learn from them systematically or repeat them indefinitely.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles