DevOpsil

On-Call Rotation Practices That Actually Prevent Burnout

Riku TanakaRiku Tanaka10 min read

On-Call Shouldn't Mean On-Edge

I've watched good engineers leave teams — not because of the work, but because of the on-call. Three pages at 2 AM for a noisy alert nobody fixed. A rotation of two people covering a system that should have five. A culture where "just deal with it" replaced actual improvement.

The Google SRE book puts it clearly: if a human operator needs to touch the system during normal operations, your system is broken. On-call should be boring most of the time. When it isn't, that's a signal to fix the system, not the person.

Let me walk through the practices that actually keep on-call sustainable.

The Minimum Viable Rotation

Team Size

You need at least eight engineers in a rotation to make it humane. Here's why:

  • Each engineer is on-call one week out of eight — roughly 12% of their time.
  • With fewer people, the rotation becomes too frequent. Five people means 20% on-call time, and that's the threshold where research shows cognitive load starts degrading code quality.
  • The Google SRE book recommends a maximum of two events per on-call shift (12 hours). If you're exceeding that regularly, the system needs work, not more on-call engineers.

Shift Structure

rotation:
  name: platform-primary
  type: weekly
  handoff_time: "10:00 UTC"  # During business hours, never midnight
  handoff_day: wednesday      # Mid-week, not Monday or Friday
  participants: 8
  escalation:
    - level: 1
      target: primary-on-call
      timeout: 15m
    - level: 2
      target: secondary-on-call
      timeout: 15m
    - level: 3
      target: engineering-manager
      timeout: 30m

Wednesday handoffs are deliberate. Monday handoffs mean the new on-call inherits weekend context they don't have. Friday handoffs mean the outgoing on-call can't debrief before the weekend. Mid-week gives overlap time for knowledge transfer.

Primary and Secondary

Always have a secondary. The primary handles the page. If they can't resolve it in 15 minutes, the secondary joins. If neither can resolve it, it escalates to management — not to decide the technical fix, but to authorize broader response (wake more people, declare an incident, pull in specialists).

The secondary is not "on standby." They live their life normally but keep their phone on. The mental model is: primary is actively reachable, secondary is a safety net.

Measuring On-Call Health

You can't improve what you don't measure. Track these metrics monthly.

Pages Per Shift

# Total pages per on-call shift (7-day window per engineer)
sum(increase(alertmanager_alerts_received_total{severity="critical"}[7d])) by (on_call_engineer)

# Pages that occurred outside business hours (the ones that hurt)
sum(increase(alertmanager_alerts_received_total{severity="critical"}[7d]
  and on(instance) hour() < 8 or hour() > 18
)) by (on_call_engineer)

Target: fewer than two pages per on-call shift. If you consistently exceed this, your alerts need tuning or your system needs investment.

Time to Acknowledge (TTA)

# P50 time from page to acknowledgment
histogram_quantile(0.50,
  sum(rate(oncall_page_acknowledge_seconds_bucket[30d])) by (le)
)

A healthy TTA is under 5 minutes. If it's consistently over 10, your engineers may be ignoring pages — a sign of alert fatigue.

Mean Time to Resolve (MTTR)

# P50 resolution time for incidents
histogram_quantile(0.50,
  sum(rate(incident_resolution_seconds_bucket[30d])) by (le, severity)
)

Track MTTR by severity. Critical incidents should resolve in under an hour. If they don't, your runbooks or tooling need improvement.

Toil Ratio

The Google SRE book defines toil as manual, repetitive, automatable work that scales with service growth. Track what percentage of on-call time is spent on toil versus incident response versus proactive work.

# Weekly on-call report template
on_call_report:
  engineer: riku
  week: 2026-W12
  pages_total: 3
  pages_outside_hours: 1
  false_positives: 1
  incidents_declared: 0
  time_breakdown:
    incident_response: 2h
    toil: 4h
    proactive_improvement: 2h
    idle: 32h
  toil_items:
    - "Manually restarted cache pods after OOM (3 times)"
    - "Rotated expired TLS certificate for internal API"
  improvement_suggestions:
    - "Automate cache pod memory limit adjustment"
    - "Implement cert-manager for automatic rotation"

If toil is more than 50% of operational work, your team is a human cron job. That's the path to burnout.

Policies That Protect People

The Follow-the-Sun Exception

If your team spans time zones, use follow-the-sun rotations so nobody handles pages at night. This is the single highest-impact change for on-call quality of life.

rotation:
  name: platform-global
  type: follow-the-sun
  shifts:
    - name: apac
      hours: "00:00-08:00 UTC"
      participants: [tokyo-team]
    - name: emea
      hours: "08:00-16:00 UTC"
      participants: [berlin-team]
    - name: americas
      hours: "16:00-00:00 UTC"
      participants: [nyc-team]

Compensation

On-call without compensation is exploitation dressed as team spirit. At minimum:

  • Time in lieu: One day off for every week of on-call.
  • Interrupt pay: Extra compensation for out-of-hours pages.
  • No on-call during PTO: This sounds obvious, but I've seen it violated.

The 50% Rule

Engineers on-call should spend no more than 50% of their business-hours time on operational work. The other 50% should be project work. If operational load consistently exceeds this, the team needs more people or fewer services.

Reducing Page Volume Systematically

Every page is a bug — either in your system or in your alerts. Track and categorize them.

Post-Page Review

After every on-call shift, the outgoing engineer files a brief report. For each page:

  1. Was this a real incident or a false positive?
  2. Was the runbook helpful?
  3. Could this have been prevented by automation?
  4. Should this alert be removed, adjusted, or kept as-is?

The Noise Budget

Just as you have an error budget for SLOs, create a noise budget for alerts.

noise_budget:
  target_false_positive_rate: 5%   # Max 5% of pages should be false positives
  review_period: monthly
  action_threshold: 10%            # If false positives exceed 10%, freeze new alerts

When false positives exceed your noise budget, stop adding new alerts and fix the existing ones. This creates the same kind of accountability that error budgets create for reliability.

# False positive rate over 30 days
sum(increase(oncall_page_false_positive_total[30d]))
/
sum(increase(oncall_page_total[30d]))

Automate the Repeat Offenders

Look at your top 5 most frequent pages over the last quarter. I guarantee at least three of them are automatable.

PageFrequencyAutomation
Cache pod OOM restart12/monthIncrease memory limit, add VPA
TLS cert expiry4/monthDeploy cert-manager
Disk usage > 90%8/monthAdd log rotation CronJob
Deployment rollback needed3/monthAutomated canary with Argo Rollouts
DNS resolution failures6/monthFix ndots config in pod spec

Every automated page is a future 3 AM interruption that never happens.

SLO-Driven On-Call Improvement

Connect your on-call metrics to your SLOs. If your service meets its SLOs consistently, on-call should be quiet. If it doesn't, the error budget policy should trigger engineering investment — not just more heroic on-call effort.

error_budget_policy:
  budget_remaining_gt_50pct:
    action: "Normal development velocity"
    on_call_expectation: "Quiet shifts, focus on proactive work"

  budget_remaining_25_to_50pct:
    action: "Prioritize reliability work over features"
    on_call_expectation: "Increasing pages expected, review alerts weekly"

  budget_remaining_lt_25pct:
    action: "Freeze feature work, all hands on reliability"
    on_call_expectation: "Active incident response, daily toil review"

  budget_exhausted:
    action: "Halt all changes, postmortem required for every incident"
    on_call_expectation: "System is unreliable, escalate to leadership"

This framework makes it explicit: on-call pain is not a people problem, it's a system problem. When the error budget burns, the team shifts focus. When on-call is quiet, the team builds.

What Sustainable On-Call Looks Like

A healthy on-call rotation has these properties:

  • Fewer than two pages per week-long shift
  • Less than 5% false positive rate
  • Every critical alert has a runbook
  • Handoffs happen mid-week during business hours
  • Engineers are compensated for their time
  • The top 5 repeat pages are tracked and actively being automated
  • Monthly review of on-call health metrics

Building Effective Runbooks

A page without a runbook is a guessing game at 3 AM. Every critical alert needs a corresponding runbook, and that runbook needs to be actionable — not a wall of text that starts with "this alert fires when..."

Structure every runbook the same way:

# Alert: CacheClusterOOMKills

## Impact
Cache pods are being OOM-killed, causing increased latency on all API endpoints
that depend on cached data. Users may see 2-5s response times instead of <200ms.

## Diagnosis
1. Check which pods were OOM-killed:
   kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

2. Check current memory usage vs limits:
   kubectl top pods -n production -l app=cache-cluster

3. Check if traffic spike is the cause:
   Open Grafana: https://grafana.internal/d/cache-overview
   Look at "Requests per Second" panel for the last 2 hours.

## Mitigation (stop the bleeding)
1. If fewer than 2 cache pods are healthy:
   kubectl scale deployment cache-cluster -n production --replicas=5

2. If memory is consistently above 90% of limit:
   kubectl set resources deployment cache-cluster -n production \
     --limits=memory=2Gi --requests=memory=1.5Gi

## Root Cause Resolution (after the incident)
- File a ticket to tune cache eviction policy
- Review cache key TTLs — stale keys may be accumulating
- Consider switching to a VPA to auto-adjust limits

## Escalation
If the cache cluster is fully down and cannot be recovered within 15 minutes,
escalate to the Platform team lead via PagerDuty.

The key principles: diagnosis steps use exact commands someone can copy-paste. Mitigation comes before root cause. And escalation criteria are explicit — no ambiguity about when to wake someone else up.

Every alert should link directly to its runbook. If the on-call engineer has to search a wiki to find the right page, you've already failed:

groups:
  - name: cache-alerts
    rules:
      - alert: CacheClusterOOMKills
        expr: increase(kube_pod_container_status_restarts_total{container="redis", reason="OOMKilled"}[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Cache pod OOM kills detected in {{ $labels.namespace }}"
          runbook_url: "https://wiki.internal/runbooks/cache-oom-kills"
          dashboard_url: "https://grafana.internal/d/cache-overview?var-namespace={{ $labels.namespace }}"

Configure Alertmanager to include runbook_url in Slack and PagerDuty notifications. The on-call engineer should go from page to runbook in one click.

The On-Call Handoff Ritual

A sloppy handoff is a recipe for missed context. Formalize it with a 15-minute sync between the outgoing and incoming on-call engineer:

handoff_agenda:
  duration: 15m
  timing: "Wednesday 10:00 UTC (30 min after rotation starts)"
  template:
    - item: "Active incidents"
      description: "Any ongoing incidents or degraded services"
    - item: "Recent changes"
      description: "Deployments, config changes, or infra work in the last 7 days"
    - item: "Known risks"
      description: "Upcoming maintenance windows, expiring certs, planned rollouts"
    - item: "Noisy alerts"
      description: "Any alerts that fired but weren't real incidents"
    - item: "Toil to watch for"
      description: "Recurring manual tasks that might need attention"

Document this in a shared channel so the rest of the team has visibility. If the incoming engineer gets paged at midnight, they should be able to pull up the handoff notes and have full context without waking anyone.

On-call is a necessary part of running reliable systems. But it should be a manageable part of the job, not the thing that makes people update their resumes. Build the structure, measure the outcomes, and fix the systems — not the people.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles