DevOpsil
Monitoring
95%
Fresh

Prometheus Alerting Rules That Don't Wake You Up for Nothing

Riku TanakaRiku Tanaka9 min read

The Dashboard Doesn't Lie, But Your Alerts Can Mislead

At 2:47 AM, your phone buzzes. CPU at 85%. You drag yourself to the laptop, check Grafana, and discover it was a 30-second deployment spike that resolved itself. You go back to bed. This happens three times a week.

This is alert fatigue, and it's the number one reason on-call engineers burn out. The fix isn't fewer alerts — it's better alerts. Let me show you the patterns that actually work.

The Golden Rule: Alert on Symptoms, Not Causes

# BAD: Alerting on a cause
- alert: HighCPU
  expr: node_cpu_seconds_total{mode="idle"} < 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU on {{ $labels.instance }}"

High CPU is a cause. Users don't care about your CPU. They care about whether the service works.

# GOOD: Alerting on a symptom
- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket[5m])
    ) > 1.0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "p99 latency above 1s for {{ $labels.service }}"
    runbook: "https://wiki.internal/runbooks/high-latency"

p99 latency above 1 second means users are experiencing slow responses. THAT is worth waking someone up for.

SLO-Based Alerting

The most reliable alerting pattern I've used ties directly to your SLOs.

Define Your SLOs

# Service: API Gateway
# SLO: 99.9% of requests complete successfully within 500ms
# Error budget: 0.1% = ~43 minutes of downtime per month

Burn Rate Alerts

Instead of alerting when something crosses a static threshold, alert when you're burning through your error budget too fast.

# Fast burn: will exhaust error budget in 1 hour
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 14.4 * 0.001
  for: 2m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Error budget burning 14.4x faster than allowed"
    description: "At this rate, monthly error budget exhausts in ~1 hour"
    runbook: "https://wiki.internal/runbooks/error-budget-burn"

# Slow burn: will exhaust error budget in 3 days
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > 3 * 0.001
  for: 15m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Error budget burning 3x faster than allowed"
    description: "At this rate, monthly error budget exhausts in ~3 days"

Fast burn (14.4x) = page immediately, something is very wrong. Slow burn (3x) = warn during business hours, investigate before it becomes critical.

Essential Alert Patterns

Application Health

groups:
  - name: application
    rules:
      # Service is down
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"

      # Error rate spike
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "5xx error rate above 5% for {{ $labels.service }}"

      # Request rate drop (potential outage indicator)
      - alert: TrafficDrop
        expr: |
          sum(rate(http_requests_total[5m])) by (service)
          <
          sum(rate(http_requests_total[5m] offset 1h)) by (service) * 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Traffic dropped 50%+ for {{ $labels.service }}"

Infrastructure

      # Disk filling up (predictive)
      - alert: DiskWillFill
        expr: |
          predict_linear(
            node_filesystem_avail_bytes{mountpoint="/"}[6h],
            24 * 3600
          ) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} will fill within 24h"

      # Memory pressure
      - alert: HighMemoryPressure
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

Notice the DiskWillFill alert uses predict_linear — it alerts BEFORE the disk fills, giving you hours to respond instead of minutes. If you're getting paged for full disks, your alerts need work.

The for Clause: Your Anti-Noise Shield

Every alert should have a for duration. This prevents flapping.

SeverityMinimum for DurationRationale
Critical (page)2-5 minutesConfirm it's real, not a blip
Warning (ticket)10-30 minutesAvoid noise during transient states
Info (dashboard)Don't alertDisplay on dashboard only

Runbooks Are Mandatory

Every alert with severity: critical needs a runbook URL in its annotations:

annotations:
  runbook: "https://wiki.internal/runbooks/{{ $labels.alertname }}"

If you can't write a runbook for an alert, you probably shouldn't have the alert. An alert without a runbook is just anxiety with a timestamp.

Alertmanager Routing: Get It to the Right Person

Prometheus fires alerts. Alertmanager routes them. A well-structured routing tree ensures the right team gets paged for the right problems.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/HERE"

route:
  group_by: ['alertname', 'service', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default-slack

  routes:
    # Critical alerts: PagerDuty immediately
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts: Slack channel
    - match:
        severity: warning
      receiver: team-slack
      group_wait: 1m
      repeat_interval: 8h

    # Team-specific routing
    - match_re:
        team: "platform|infra"
      receiver: platform-slack
    - match_re:
        team: "payments|billing"
      receiver: payments-pagerduty

receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-general'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.severity | toUpper }}* - {{ .Annotations.summary }}
          {{ .Annotations.runbook }}
          {{ end }}

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: YOUR_PAGERDUTY_KEY
        severity: critical

  - name: team-slack
    slack_configs:
      - channel: '#alerts-warning'

  - name: platform-slack
    slack_configs:
      - channel: '#platform-alerts'

  - name: payments-pagerduty
    pagerduty_configs:
      - routing_key: PAYMENTS_TEAM_KEY

Key routing decisions:

  • group_wait: 30s for warnings, 10s for critical. Critical alerts need speed. Warnings can wait.
  • repeat_interval: 4h prevents alert storms. If the problem hasn't been fixed in 4 hours, remind again.
  • group_by: ['alertname', 'service'] groups related alerts. Five pods down on the same service sends one notification, not five.

Inhibition Rules: Prevent Alert Cascading

When a service is down, you don't need latency and error rate alerts too. Inhibition suppresses child alerts when a parent fires.

# alertmanager.yml — inhibit_rules section
inhibit_rules:
  # If the service is down, suppress latency and error alerts
  - source_matchers:
      - alertname="ServiceDown"
    target_matchers:
      - alertname=~"HighLatency|HighErrorRate|TrafficDrop"
    equal: ['service']

  # If the cluster node is down, suppress pod-level alerts
  - source_matchers:
      - alertname="NodeDown"
    target_matchers:
      - severity="warning"
    equal: ['instance']

Without inhibition, a single node failure generates 30+ alerts — one for every pod, plus latency, plus error rate, plus traffic drop. With inhibition, you get one alert: "NodeDown."

Testing Alerting Rules Before Deploying

Deploying broken alerting rules blinds your monitoring. Test them first.

Unit Testing with promtool

# Check syntax
promtool check rules alert-rules.yml

# Run unit tests
promtool test rules alert-tests.yml

The test file:

# alert-tests.yml
rule_files:
  - alert-rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", code="500"}'
        values: '0+10x15'  # 10 errors per minute for 15 minutes
      - series: 'http_requests_total{service="api", code="200"}'
        values: '0+100x15' # 100 successes per minute

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              service: api
              severity: critical
            exp_annotations:
              summary: "5xx error rate above 5% for api"

This creates synthetic metrics, runs them through your rules, and verifies the right alerts fire at the right time. Run it in CI on every change to your alerting rules.

Staging Alerts

Before deploying a new critical alert, run it in "dry-run" mode:

- alert: HighLatencyNew
  expr: |
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket[5m])
    ) > 0.8
  for: 5m
  labels:
    severity: info  # Start as info — monitor before upgrading to critical
    experimental: "true"
  annotations:
    summary: "STAGING: p99 latency above 800ms for {{ $labels.service }}"

Run it as severity: info for a week. Check Alertmanager history for false positives. If it fires only on real issues, promote it to warning or critical.

Alerting Anti-Patterns

The Copy-Paste Alert. Someone finds a blog post with 50 alerting rules and dumps them all in. Half don't match your metric names. A quarter fire constantly. The rest never fire because the thresholds are wrong. Start with 5 alerts. Tune them. Add more only when you have a gap.

The Stale Alert. An alert that hasn't fired in 6 months is either perfectly tuned or completely broken. Check it. If the PromQL returns no data, the alert is dead weight.

# Find alerts that have never fired
ALERTS{alertstate="firing"} == 0

The "FYI" Page. A critical alert that the on-call engineer looks at, shrugs, and goes back to sleep is a severity miscategorization. If it doesn't require immediate action, it's not critical.

No Deduplication. Five instances of the same service each firing the same alert generates five pages. Use group_by in Alertmanager to deduplicate.

Recording Rules for Alert Performance

Complex PromQL in alerting rules puts load on Prometheus at evaluation time. Pre-compute expensive queries with recording rules.

groups:
  - name: slo-recording
    interval: 30s
    rules:
      # Pre-compute error ratio — used by multiple alerts
      - record: service:http_error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # Pre-compute burn rate
      - record: service:error_budget_burn_rate:1h
        expr: |
          service:http_error_ratio:rate5m / (1 - 0.999)

Then simplify your alert expressions:

- alert: ErrorBudgetFastBurn
  expr: service:error_budget_burn_rate:1h > 14.4
  for: 2m
  labels:
    severity: critical

Cleaner, faster, and easier to maintain. Recording rules evaluate once. Multiple alerts referencing the same recording rule don't multiply the query cost.

Conclusion

Measure what matters. Everything else is noise. Start with SLO-based burn rate alerts, add symptom-based checks for critical user-facing metrics, and use predict_linear for capacity alerts. Every alert should have a for clause, a severity, and a runbook. Route alerts to the right team with Alertmanager. Suppress cascading noise with inhibition rules. Test your rules in CI before they go live.

The goal isn't zero alerts. The goal is zero false positives at 3 AM.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles