DevOpsil
Monitoring
85%
Fresh

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Riku TanakaRiku Tanaka10 min read

When Your Dashboard Takes 30 Seconds to Load

You open your Grafana dashboard. The panels spin. Ten seconds. Twenty seconds. Thirty seconds. Some panels time out entirely. You reload, same thing. The Prometheus server's CPU is pinned at 100%, grinding through a query that aggregates millions of time series across a 24-hour window.

This is the point where most teams either throw more hardware at Prometheus or reduce their retention period. Both miss the actual fix: recording rules.

Recording rules pre-compute expensive PromQL expressions at regular intervals and store the results as new time series. Instead of recalculating a complex aggregation every time someone loads a dashboard, Prometheus evaluates it once and serves the pre-computed result. The Google SRE book's guidance on monitoring applies here — your monitoring system should be fast and reliable. If querying it is slow, it fails at its job during the moments that matter most.

How Recording Rules Work

A recording rule evaluates a PromQL expression on a schedule (typically every evaluation interval, default 1 minute) and writes the result as a new time series with a name you define.

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: example
    interval: 1m
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

After this rule is loaded, querying job:http_requests_total:rate5m is a simple series lookup instead of an aggregation across potentially thousands of series. The original high-cardinality data still exists — the recording rule just creates a pre-computed summary alongside it.

Naming Convention

Prometheus has an official naming convention for recorded metrics. Follow it — your future self and your teammates will thank you.

level:metric:operations
  • level: The labels preserved in the output (job, instance, namespace)
  • metric: The original metric name
  • operations: The operations applied (rate5m, sum, ratio)
# Examples following the convention
- record: job:http_requests_total:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

- record: instance:node_cpu_seconds:ratio
  expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

- record: namespace:container_memory_usage_bytes:sum
  expr: sum(container_memory_working_set_bytes) by (namespace)

The Queries That Need Recording Rules

Not every query needs a recording rule. Target these patterns.

High-Cardinality Aggregations

When you aggregate across many series, the query cost grows with cardinality.

# BEFORE: This scans every HTTP metric across every pod, endpoint, method, status
# On a cluster with 500 pods and 50 endpoints, that's 25,000+ series per evaluation
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# AFTER: Pre-compute the aggregation
groups:
  - name: latency-recording
    interval: 30s
    rules:
      - record: service:http_request_duration_seconds_bucket:rate5m
        expr: sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)

      - record: service:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, service:http_request_duration_seconds_bucket:rate5m)

      - record: service:http_request_duration_seconds:p50
        expr: histogram_quantile(0.50, service:http_request_duration_seconds_bucket:rate5m)

Now your Grafana panel queries service:http_request_duration_seconds:p99 — a direct series lookup instead of a multi-thousand-series aggregation.

SLO Calculations

SLO queries are evaluated constantly — by dashboards, alerts, and error budget reports. They should always be recording rules.

groups:
  - name: slo-recording
    interval: 1m
    rules:
      # Total requests per service (denominator for SLI)
      - record: service:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      # Failed requests per service (numerator for SLI)
      - record: service:http_requests_errors:rate5m
        expr: sum(rate(http_requests_total{status_class="5xx"}[5m])) by (service)

      # Availability SLI (success ratio)
      - record: service:http_requests:availability
        expr: |
          1 - (
            service:http_requests_errors:rate5m
            /
            service:http_requests_total:rate5m
          )

      # Error budget remaining (assuming 99.9% SLO, 30-day window)
      - record: service:error_budget:remaining
        expr: |
          1 - (
            (1 - service:http_requests:availability)
            /
            (1 - 0.999)
          )

      # Burn rate — 1-hour fast burn
      - record: service:error_budget:burn_rate_1h
        expr: |
          (
            sum(rate(http_requests_total{status_class="5xx"}[1h])) by (service)
            /
            sum(rate(http_requests_total[1h])) by (service)
          ) / (1 - 0.999)

      # Burn rate — 6-hour slow burn
      - record: service:error_budget:burn_rate_6h
        expr: |
          (
            sum(rate(http_requests_total{status_class="5xx"}[6h])) by (service)
            /
            sum(rate(http_requests_total[6h])) by (service)
          ) / (1 - 0.999)

Your alerting rules can now reference these pre-computed series instead of re-evaluating the full expressions:

groups:
  - name: slo-alerts
    rules:
      - alert: ErrorBudgetFastBurn
        expr: service:error_budget:burn_rate_1h > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} burning error budget at {{ $value }}x rate"
          runbook: "https://wiki.internal/runbooks/error-budget-burn"

      - alert: ErrorBudgetSlowBurn
        expr: service:error_budget:burn_rate_6h > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} slow-burning error budget at {{ $value }}x rate"

Node and Cluster Resource Summaries

Infrastructure dashboards that show cluster-wide resource usage hit every node exporter series.

groups:
  - name: infrastructure-recording
    interval: 1m
    rules:
      # CPU usage per node as a ratio
      - record: instance:node_cpu:ratio
        expr: |
          1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      # Memory usage per node as a ratio
      - record: instance:node_memory:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )

      # Disk usage per node as a ratio
      - record: instance:node_filesystem:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{mountpoint="/"}
            /
            node_filesystem_size_bytes{mountpoint="/"}
          )

      # Cluster-wide CPU usage
      - record: cluster:node_cpu:ratio
        expr: avg(instance:node_cpu:ratio)

      # Namespace CPU usage in cores
      - record: namespace:container_cpu_usage:sum
        expr: |
          sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

      # Namespace memory usage in bytes
      - record: namespace:container_memory_usage:sum
        expr: |
          sum(container_memory_working_set_bytes) by (namespace)

Notice the layering. instance:node_cpu:ratio is used by cluster:node_cpu:ratio. Recording rules can reference other recording rules, creating an efficient computation graph.

Validating Your Rules

Before deploying, always validate with promtool:

promtool check rules /etc/prometheus/rules/recording_rules.yml
Checking /etc/prometheus/rules/recording_rules.yml
  SUCCESS: 12 rules found

You can also test the output of recording rules against sample data:

promtool test rules test_recording_rules.yml
# test_recording_rules.yml
rule_files:
  - recording_rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", status_class="2xx"}'
        values: "0+100x10"
      - series: 'http_requests_total{service="api", status_class="5xx"}'
        values: "0+1x10"

    promql_expr_test:
      - expr: service:http_requests:availability
        eval_time: 10m
        exp_samples:
          - labels: 'service:http_requests:availability{service="api"}'
            value: 0.9900990099

Test your recording rules like you test your code. A broken recording rule silently produces wrong data — your dashboards show green while the service is on fire.

Performance Impact and Monitoring

Recording rules consume resources on your Prometheus server. Monitor their evaluation.

# Rule evaluation duration — should be well under your evaluation interval
histogram_quantile(0.99,
  sum(rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) by (le, rule_group)
)

# Rules that are taking too long
prometheus_rule_group_last_duration_seconds > 30

# Failed rule evaluations — should always be zero
sum(rate(prometheus_rule_evaluation_failures_total[5m])) by (rule_group)

# Series created by recording rules
prometheus_tsdb_head_series
groups:
  - name: prometheus-self-monitoring
    rules:
      - alert: RecordingRuleEvaluationSlow
        expr: prometheus_rule_group_last_duration_seconds > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rule group {{ $labels.rule_group }} taking {{ $value }}s to evaluate"

      - alert: RecordingRuleEvaluationFailure
        expr: rate(prometheus_rule_evaluation_failures_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Recording rule evaluation failing for {{ $labels.rule_group }}"
          runbook: "https://wiki.internal/runbooks/prometheus-rule-failure"

When Not to Use Recording Rules

Recording rules are not free. Each one creates a new time series that Prometheus stores. Don't create recording rules for:

  • Queries that run rarely — if only one person runs a query once a week, the storage cost of a recording rule outweighs the compute savings.
  • Low-cardinality queries — aggregating 10 series is fast. Recording rules shine when aggregating thousands.
  • Queries with variable time ranges — recording rules evaluate at a fixed interval. If your dashboard uses variable time ranges extensively, the recording rule might not match.

The Practical Path

Start by identifying your slowest Grafana panels. Check the Prometheus query log or use Grafana's built-in query inspector to find queries that take more than a few seconds.

For each slow query:

  1. Create a recording rule with proper naming convention.
  2. Validate with promtool.
  3. Deploy and wait for one evaluation interval.
  4. Update the Grafana panel to use the new recorded metric.
  5. Verify the panel loads in under a second.

Recording rules are one of the simplest, highest-impact optimizations in the Prometheus ecosystem. They turn your monitoring from something that struggles under load into something that stays fast when you need it most — during an incident, when everyone is staring at the same dashboard, and every second of query latency is a second of delayed response.

Organizing Recording Rules at Scale

As your recording rule library grows past 20-30 rules, organization becomes critical. A single monolithic rules file is unmaintainable. Structure your rules by domain:

/etc/prometheus/rules/
  recording/
    slo.yml              # SLO calculations
    infrastructure.yml   # Node and cluster resource summaries
    latency.yml          # Latency percentiles by service
    application.yml      # Application-specific business metrics
  alerting/
    slo-alerts.yml       # SLO burn-rate alerts
    infra-alerts.yml     # Infrastructure alerts

Each file should contain one groups block with a descriptive name. This makes it easy to identify which rule group is slow or failing in Prometheus metrics:

# recording/latency.yml
groups:
  - name: recording:latency
    interval: 30s
    rules:
      - record: service:http_request_duration_seconds_bucket:rate5m
        expr: sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      # ... more rules

Use a consistent interval within each group. Mixing 15s and 60s intervals in the same group doesn't work — the group evaluates at the shortest interval, and the longer-interval rules waste compute.

Validating All Rules in CI

Add promtool validation to your CI pipeline so broken rules never reach Prometheus:

# .github/workflows/prometheus-rules.yml
name: Validate Prometheus Rules
on:
  pull_request:
    paths: ["prometheus/rules/**"]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install promtool
        run: |
          VERSION="2.53.0"
          curl -sL "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz" | \
            tar xz --strip-components=1 -C /usr/local/bin "prometheus-${VERSION}.linux-amd64/promtool"

      - name: Check rule syntax
        run: promtool check rules prometheus/rules/**/*.yml

      - name: Run unit tests
        run: promtool test rules prometheus/tests/**/*.yml

Migrating Grafana Dashboards to Use Recording Rules

The most tedious part of adopting recording rules is updating every Grafana panel that used the original expressions. Here's a systematic approach.

First, identify which panels need updating. Export your dashboard JSON and search for the original expressions:

# Find all Grafana panels using the expensive original query
curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
  "https://grafana.internal/api/search?type=dash-db" | \
  jq -r '.[].uid' | while read uid; do
    curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
      "https://grafana.internal/api/dashboards/uid/$uid" | \
      jq -r --arg uid "$uid" \
      '.dashboard.panels[]? | select(.targets[]?.expr | test("http_request_duration_seconds_bucket")) |
       "Dashboard: \($uid) Panel: \(.title)"'
done

Then update panels one at a time. Replace the raw expression with the recorded metric:

# Before (in Grafana panel query)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# After
service:http_request_duration_seconds:p99

The panel should load in milliseconds instead of seconds. If the values don't match exactly, check whether the recording rule's interval is producing slightly different aggregation windows than what the dashboard was using. A 30-second evaluation interval produces smoother curves than the dashboard's default 15-second step.

Verifying Recording Rule Accuracy

After creating a recording rule, verify its output matches the original expression:

# Compare recorded metric vs live calculation
# These should produce nearly identical results
# (small differences are expected due to evaluation timing)

# Recorded value
service:http_request_duration_seconds:p99{service="api"}

# Live calculation
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le)
)

Run both side by side in Grafana for a few hours. If the delta between them is consistently greater than 1-2%, the recording rule may have different label matching or aggregation than the original. Fix the rule expression before switching dashboards over.

Build your recording rules like you build your SLOs: methodically, with clear ownership, and with monitoring to confirm they're working correctly.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles

MonitoringQuick RefFresh

PromQL: Cheat Sheet

PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.

Riku Tanaka·
2 min read