Prometheus Alerting Rules That Don't Wake You Up for Nothing
The Dashboard Doesn't Lie, But Your Alerts Can Mislead
At 2:47 AM, your phone buzzes. CPU at 85%. You drag yourself to the laptop, check Grafana, and discover it was a 30-second deployment spike that resolved itself. You go back to bed. This happens three times a week.
This is alert fatigue, and it's the number one reason on-call engineers burn out. The fix isn't fewer alerts — it's better alerts. Let me show you the patterns that actually work.
The Golden Rule: Alert on Symptoms, Not Causes
# BAD: Alerting on a cause
- alert: HighCPU
expr: node_cpu_seconds_total{mode="idle"} < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
High CPU is a cause. Users don't care about your CPU. They care about whether the service works.
# GOOD: Alerting on a symptom
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: critical
annotations:
summary: "p99 latency above 1s for {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/high-latency"
p99 latency above 1 second means users are experiencing slow responses. THAT is worth waking someone up for.
SLO-Based Alerting
The most reliable alerting pattern I've used ties directly to your SLOs.
Define Your SLOs
# Service: API Gateway
# SLO: 99.9% of requests complete successfully within 500ms
# Error budget: 0.1% = ~43 minutes of downtime per month
Burn Rate Alerts
Instead of alerting when something crosses a static threshold, alert when you're burning through your error budget too fast.
# Fast burn: will exhaust error budget in 1 hour
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 14.4 * 0.001
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget burning 14.4x faster than allowed"
description: "At this rate, monthly error budget exhausts in ~1 hour"
runbook: "https://wiki.internal/runbooks/error-budget-burn"
# Slow burn: will exhaust error budget in 3 days
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 3 * 0.001
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "Error budget burning 3x faster than allowed"
description: "At this rate, monthly error budget exhausts in ~3 days"
Fast burn (14.4x) = page immediately, something is very wrong. Slow burn (3x) = warn during business hours, investigate before it becomes critical.
Essential Alert Patterns
Application Health
groups:
- name: application
rules:
# Service is down
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"
# Error rate spike
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "5xx error rate above 5% for {{ $labels.service }}"
# Request rate drop (potential outage indicator)
- alert: TrafficDrop
expr: |
sum(rate(http_requests_total[5m])) by (service)
<
sum(rate(http_requests_total[5m] offset 1h)) by (service) * 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Traffic dropped 50%+ for {{ $labels.service }}"
Infrastructure
# Disk filling up (predictive)
- alert: DiskWillFill
expr: |
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[6h],
24 * 3600
) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} will fill within 24h"
# Memory pressure
- alert: HighMemoryPressure
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
Notice the DiskWillFill alert uses predict_linear — it alerts BEFORE the disk fills, giving you hours to respond instead of minutes. If you're getting paged for full disks, your alerts need work.
The for Clause: Your Anti-Noise Shield
Every alert should have a for duration. This prevents flapping.
| Severity | Minimum for Duration | Rationale |
|---|---|---|
| Critical (page) | 2-5 minutes | Confirm it's real, not a blip |
| Warning (ticket) | 10-30 minutes | Avoid noise during transient states |
| Info (dashboard) | Don't alert | Display on dashboard only |
Runbooks Are Mandatory
Every alert with severity: critical needs a runbook URL in its annotations:
annotations:
runbook: "https://wiki.internal/runbooks/{{ $labels.alertname }}"
If you can't write a runbook for an alert, you probably shouldn't have the alert. An alert without a runbook is just anxiety with a timestamp.
Alertmanager Routing: Get It to the Right Person
Prometheus fires alerts. Alertmanager routes them. A well-structured routing tree ensures the right team gets paged for the right problems.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/HERE"
route:
group_by: ['alertname', 'service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default-slack
routes:
# Critical alerts: PagerDuty immediately
- match:
severity: critical
receiver: pagerduty-critical
group_wait: 10s
repeat_interval: 1h
# Warning alerts: Slack channel
- match:
severity: warning
receiver: team-slack
group_wait: 1m
repeat_interval: 8h
# Team-specific routing
- match_re:
team: "platform|infra"
receiver: platform-slack
- match_re:
team: "payments|billing"
receiver: payments-pagerduty
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts-general'
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.severity | toUpper }}* - {{ .Annotations.summary }}
{{ .Annotations.runbook }}
{{ end }}
- name: pagerduty-critical
pagerduty_configs:
- routing_key: YOUR_PAGERDUTY_KEY
severity: critical
- name: team-slack
slack_configs:
- channel: '#alerts-warning'
- name: platform-slack
slack_configs:
- channel: '#platform-alerts'
- name: payments-pagerduty
pagerduty_configs:
- routing_key: PAYMENTS_TEAM_KEY
Key routing decisions:
group_wait: 30sfor warnings,10sfor critical. Critical alerts need speed. Warnings can wait.repeat_interval: 4hprevents alert storms. If the problem hasn't been fixed in 4 hours, remind again.group_by: ['alertname', 'service']groups related alerts. Five pods down on the same service sends one notification, not five.
Inhibition Rules: Prevent Alert Cascading
When a service is down, you don't need latency and error rate alerts too. Inhibition suppresses child alerts when a parent fires.
# alertmanager.yml — inhibit_rules section
inhibit_rules:
# If the service is down, suppress latency and error alerts
- source_matchers:
- alertname="ServiceDown"
target_matchers:
- alertname=~"HighLatency|HighErrorRate|TrafficDrop"
equal: ['service']
# If the cluster node is down, suppress pod-level alerts
- source_matchers:
- alertname="NodeDown"
target_matchers:
- severity="warning"
equal: ['instance']
Without inhibition, a single node failure generates 30+ alerts — one for every pod, plus latency, plus error rate, plus traffic drop. With inhibition, you get one alert: "NodeDown."
Testing Alerting Rules Before Deploying
Deploying broken alerting rules blinds your monitoring. Test them first.
Unit Testing with promtool
# Check syntax
promtool check rules alert-rules.yml
# Run unit tests
promtool test rules alert-tests.yml
The test file:
# alert-tests.yml
rule_files:
- alert-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{service="api", code="500"}'
values: '0+10x15' # 10 errors per minute for 15 minutes
- series: 'http_requests_total{service="api", code="200"}'
values: '0+100x15' # 100 successes per minute
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
service: api
severity: critical
exp_annotations:
summary: "5xx error rate above 5% for api"
This creates synthetic metrics, runs them through your rules, and verifies the right alerts fire at the right time. Run it in CI on every change to your alerting rules.
Staging Alerts
Before deploying a new critical alert, run it in "dry-run" mode:
- alert: HighLatencyNew
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 0.8
for: 5m
labels:
severity: info # Start as info — monitor before upgrading to critical
experimental: "true"
annotations:
summary: "STAGING: p99 latency above 800ms for {{ $labels.service }}"
Run it as severity: info for a week. Check Alertmanager history for false positives. If it fires only on real issues, promote it to warning or critical.
Alerting Anti-Patterns
The Copy-Paste Alert. Someone finds a blog post with 50 alerting rules and dumps them all in. Half don't match your metric names. A quarter fire constantly. The rest never fire because the thresholds are wrong. Start with 5 alerts. Tune them. Add more only when you have a gap.
The Stale Alert. An alert that hasn't fired in 6 months is either perfectly tuned or completely broken. Check it. If the PromQL returns no data, the alert is dead weight.
# Find alerts that have never fired
ALERTS{alertstate="firing"} == 0
The "FYI" Page. A critical alert that the on-call engineer looks at, shrugs, and goes back to sleep is a severity miscategorization. If it doesn't require immediate action, it's not critical.
No Deduplication. Five instances of the same service each firing the same alert generates five pages. Use group_by in Alertmanager to deduplicate.
Recording Rules for Alert Performance
Complex PromQL in alerting rules puts load on Prometheus at evaluation time. Pre-compute expensive queries with recording rules.
groups:
- name: slo-recording
interval: 30s
rules:
# Pre-compute error ratio — used by multiple alerts
- record: service:http_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Pre-compute burn rate
- record: service:error_budget_burn_rate:1h
expr: |
service:http_error_ratio:rate5m / (1 - 0.999)
Then simplify your alert expressions:
- alert: ErrorBudgetFastBurn
expr: service:error_budget_burn_rate:1h > 14.4
for: 2m
labels:
severity: critical
Cleaner, faster, and easier to maintain. Recording rules evaluate once. Multiple alerts referencing the same recording rule don't multiply the query cost.
Conclusion
Measure what matters. Everything else is noise. Start with SLO-based burn rate alerts, add symptom-based checks for critical user-facing metrics, and use predict_linear for capacity alerts. Every alert should have a for clause, a severity, and a runbook. Route alerts to the right team with Alertmanager. Suppress cascading noise with inhibition rules. Test your rules in CI before they go live.
The goal isn't zero alerts. The goal is zero false positives at 3 AM.
Related Articles
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
Designing Grafana Dashboards That SREs Actually Use
Build Grafana dashboards that surface real signals instead of decorating walls — a structured approach rooted in SRE principles.
Implementing SLOs and Error Budgets From Scratch
A step-by-step guide to implementing SLOs and error budgets using Prometheus — from defining SLIs to building burn-rate alerts.