Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana
When Your Dashboard Takes 30 Seconds to Load
You open your Grafana dashboard. The panels spin. Ten seconds. Twenty seconds. Thirty seconds. Some panels time out entirely. You reload, same thing. The Prometheus server's CPU is pinned at 100%, grinding through a query that aggregates millions of time series across a 24-hour window.
This is the point where most teams either throw more hardware at Prometheus or reduce their retention period. Both miss the actual fix: recording rules.
Recording rules pre-compute expensive PromQL expressions at regular intervals and store the results as new time series. Instead of recalculating a complex aggregation every time someone loads a dashboard, Prometheus evaluates it once and serves the pre-computed result. The Google SRE book's guidance on monitoring applies here — your monitoring system should be fast and reliable. If querying it is slow, it fails at its job during the moments that matter most.
How Recording Rules Work
A recording rule evaluates a PromQL expression on a schedule (typically every evaluation interval, default 1 minute) and writes the result as a new time series with a name you define.
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: example
interval: 1m
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
After this rule is loaded, querying job:http_requests_total:rate5m is a simple series lookup instead of an aggregation across potentially thousands of series. The original high-cardinality data still exists — the recording rule just creates a pre-computed summary alongside it.
Naming Convention
Prometheus has an official naming convention for recorded metrics. Follow it — your future self and your teammates will thank you.
level:metric:operations
- level: The labels preserved in the output (
job,instance,namespace) - metric: The original metric name
- operations: The operations applied (
rate5m,sum,ratio)
# Examples following the convention
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: instance:node_cpu_seconds:ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
- record: namespace:container_memory_usage_bytes:sum
expr: sum(container_memory_working_set_bytes) by (namespace)
The Queries That Need Recording Rules
Not every query needs a recording rule. Target these patterns.
High-Cardinality Aggregations
When you aggregate across many series, the query cost grows with cardinality.
# BEFORE: This scans every HTTP metric across every pod, endpoint, method, status
# On a cluster with 500 pods and 50 endpoints, that's 25,000+ series per evaluation
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# AFTER: Pre-compute the aggregation
groups:
- name: latency-recording
interval: 30s
rules:
- record: service:http_request_duration_seconds_bucket:rate5m
expr: sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
- record: service:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, service:http_request_duration_seconds_bucket:rate5m)
- record: service:http_request_duration_seconds:p50
expr: histogram_quantile(0.50, service:http_request_duration_seconds_bucket:rate5m)
Now your Grafana panel queries service:http_request_duration_seconds:p99 — a direct series lookup instead of a multi-thousand-series aggregation.
SLO Calculations
SLO queries are evaluated constantly — by dashboards, alerts, and error budget reports. They should always be recording rules.
groups:
- name: slo-recording
interval: 1m
rules:
# Total requests per service (denominator for SLI)
- record: service:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
# Failed requests per service (numerator for SLI)
- record: service:http_requests_errors:rate5m
expr: sum(rate(http_requests_total{status_class="5xx"}[5m])) by (service)
# Availability SLI (success ratio)
- record: service:http_requests:availability
expr: |
1 - (
service:http_requests_errors:rate5m
/
service:http_requests_total:rate5m
)
# Error budget remaining (assuming 99.9% SLO, 30-day window)
- record: service:error_budget:remaining
expr: |
1 - (
(1 - service:http_requests:availability)
/
(1 - 0.999)
)
# Burn rate — 1-hour fast burn
- record: service:error_budget:burn_rate_1h
expr: |
(
sum(rate(http_requests_total{status_class="5xx"}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
) / (1 - 0.999)
# Burn rate — 6-hour slow burn
- record: service:error_budget:burn_rate_6h
expr: |
(
sum(rate(http_requests_total{status_class="5xx"}[6h])) by (service)
/
sum(rate(http_requests_total[6h])) by (service)
) / (1 - 0.999)
Your alerting rules can now reference these pre-computed series instead of re-evaluating the full expressions:
groups:
- name: slo-alerts
rules:
- alert: ErrorBudgetFastBurn
expr: service:error_budget:burn_rate_1h > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} burning error budget at {{ $value }}x rate"
runbook: "https://wiki.internal/runbooks/error-budget-burn"
- alert: ErrorBudgetSlowBurn
expr: service:error_budget:burn_rate_6h > 6
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} slow-burning error budget at {{ $value }}x rate"
Node and Cluster Resource Summaries
Infrastructure dashboards that show cluster-wide resource usage hit every node exporter series.
groups:
- name: infrastructure-recording
interval: 1m
rules:
# CPU usage per node as a ratio
- record: instance:node_cpu:ratio
expr: |
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# Memory usage per node as a ratio
- record: instance:node_memory:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
# Disk usage per node as a ratio
- record: instance:node_filesystem:ratio
expr: |
1 - (
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
)
# Cluster-wide CPU usage
- record: cluster:node_cpu:ratio
expr: avg(instance:node_cpu:ratio)
# Namespace CPU usage in cores
- record: namespace:container_cpu_usage:sum
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
# Namespace memory usage in bytes
- record: namespace:container_memory_usage:sum
expr: |
sum(container_memory_working_set_bytes) by (namespace)
Notice the layering. instance:node_cpu:ratio is used by cluster:node_cpu:ratio. Recording rules can reference other recording rules, creating an efficient computation graph.
Validating Your Rules
Before deploying, always validate with promtool:
promtool check rules /etc/prometheus/rules/recording_rules.yml
Checking /etc/prometheus/rules/recording_rules.yml
SUCCESS: 12 rules found
You can also test the output of recording rules against sample data:
promtool test rules test_recording_rules.yml
# test_recording_rules.yml
rule_files:
- recording_rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{service="api", status_class="2xx"}'
values: "0+100x10"
- series: 'http_requests_total{service="api", status_class="5xx"}'
values: "0+1x10"
promql_expr_test:
- expr: service:http_requests:availability
eval_time: 10m
exp_samples:
- labels: 'service:http_requests:availability{service="api"}'
value: 0.9900990099
Test your recording rules like you test your code. A broken recording rule silently produces wrong data — your dashboards show green while the service is on fire.
Performance Impact and Monitoring
Recording rules consume resources on your Prometheus server. Monitor their evaluation.
# Rule evaluation duration — should be well under your evaluation interval
histogram_quantile(0.99,
sum(rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) by (le, rule_group)
)
# Rules that are taking too long
prometheus_rule_group_last_duration_seconds > 30
# Failed rule evaluations — should always be zero
sum(rate(prometheus_rule_evaluation_failures_total[5m])) by (rule_group)
# Series created by recording rules
prometheus_tsdb_head_series
groups:
- name: prometheus-self-monitoring
rules:
- alert: RecordingRuleEvaluationSlow
expr: prometheus_rule_group_last_duration_seconds > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Rule group {{ $labels.rule_group }} taking {{ $value }}s to evaluate"
- alert: RecordingRuleEvaluationFailure
expr: rate(prometheus_rule_evaluation_failures_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Recording rule evaluation failing for {{ $labels.rule_group }}"
runbook: "https://wiki.internal/runbooks/prometheus-rule-failure"
When Not to Use Recording Rules
Recording rules are not free. Each one creates a new time series that Prometheus stores. Don't create recording rules for:
- Queries that run rarely — if only one person runs a query once a week, the storage cost of a recording rule outweighs the compute savings.
- Low-cardinality queries — aggregating 10 series is fast. Recording rules shine when aggregating thousands.
- Queries with variable time ranges — recording rules evaluate at a fixed interval. If your dashboard uses variable time ranges extensively, the recording rule might not match.
The Practical Path
Start by identifying your slowest Grafana panels. Check the Prometheus query log or use Grafana's built-in query inspector to find queries that take more than a few seconds.
For each slow query:
- Create a recording rule with proper naming convention.
- Validate with
promtool. - Deploy and wait for one evaluation interval.
- Update the Grafana panel to use the new recorded metric.
- Verify the panel loads in under a second.
Recording rules are one of the simplest, highest-impact optimizations in the Prometheus ecosystem. They turn your monitoring from something that struggles under load into something that stays fast when you need it most — during an incident, when everyone is staring at the same dashboard, and every second of query latency is a second of delayed response.
Organizing Recording Rules at Scale
As your recording rule library grows past 20-30 rules, organization becomes critical. A single monolithic rules file is unmaintainable. Structure your rules by domain:
/etc/prometheus/rules/
recording/
slo.yml # SLO calculations
infrastructure.yml # Node and cluster resource summaries
latency.yml # Latency percentiles by service
application.yml # Application-specific business metrics
alerting/
slo-alerts.yml # SLO burn-rate alerts
infra-alerts.yml # Infrastructure alerts
Each file should contain one groups block with a descriptive name. This makes it easy to identify which rule group is slow or failing in Prometheus metrics:
# recording/latency.yml
groups:
- name: recording:latency
interval: 30s
rules:
- record: service:http_request_duration_seconds_bucket:rate5m
expr: sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
# ... more rules
Use a consistent interval within each group. Mixing 15s and 60s intervals in the same group doesn't work — the group evaluates at the shortest interval, and the longer-interval rules waste compute.
Validating All Rules in CI
Add promtool validation to your CI pipeline so broken rules never reach Prometheus:
# .github/workflows/prometheus-rules.yml
name: Validate Prometheus Rules
on:
pull_request:
paths: ["prometheus/rules/**"]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install promtool
run: |
VERSION="2.53.0"
curl -sL "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz" | \
tar xz --strip-components=1 -C /usr/local/bin "prometheus-${VERSION}.linux-amd64/promtool"
- name: Check rule syntax
run: promtool check rules prometheus/rules/**/*.yml
- name: Run unit tests
run: promtool test rules prometheus/tests/**/*.yml
Migrating Grafana Dashboards to Use Recording Rules
The most tedious part of adopting recording rules is updating every Grafana panel that used the original expressions. Here's a systematic approach.
First, identify which panels need updating. Export your dashboard JSON and search for the original expressions:
# Find all Grafana panels using the expensive original query
curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
"https://grafana.internal/api/search?type=dash-db" | \
jq -r '.[].uid' | while read uid; do
curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
"https://grafana.internal/api/dashboards/uid/$uid" | \
jq -r --arg uid "$uid" \
'.dashboard.panels[]? | select(.targets[]?.expr | test("http_request_duration_seconds_bucket")) |
"Dashboard: \($uid) Panel: \(.title)"'
done
Then update panels one at a time. Replace the raw expression with the recorded metric:
# Before (in Grafana panel query)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# After
service:http_request_duration_seconds:p99
The panel should load in milliseconds instead of seconds. If the values don't match exactly, check whether the recording rule's interval is producing slightly different aggregation windows than what the dashboard was using. A 30-second evaluation interval produces smoother curves than the dashboard's default 15-second step.
Verifying Recording Rule Accuracy
After creating a recording rule, verify its output matches the original expression:
# Compare recorded metric vs live calculation
# These should produce nearly identical results
# (small differences are expected due to evaluation timing)
# Recorded value
service:http_request_duration_seconds:p99{service="api"}
# Live calculation
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le)
)
Run both side by side in Grafana for a few hours. If the delta between them is consistently greater than 1-2%, the recording rule may have different label matching or aggregation than the original. Fix the rule expression before switching dashboards over.
Build your recording rules like you build your SLOs: methodically, with clear ownership, and with monitoring to confirm they're working correctly.
Related Articles
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
Designing Grafana Dashboards That SREs Actually Use
Build Grafana dashboards that surface real signals instead of decorating walls — a structured approach rooted in SRE principles.
PromQL: Cheat Sheet
PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.