Designing Grafana Dashboards That SREs Actually Use

Most Dashboards Are Just Expensive Wallpaper

Walk into any engineering office and you'll see a wall of Grafana dashboards. Dozens of panels, rainbow gradients, numbers ticking. It looks impressive. But ask someone what any of those panels mean for the user, and the room goes quiet.

The Google SRE book makes a distinction that matters here: monitoring should answer two questions. Is the service working? Why isn't it working? If your dashboard doesn't clearly address one of those, it's decoration.

Let me walk through a systematic approach to dashboard design that I've refined over years of on-call rotations. It starts with structure, not aesthetics.

The Four Golden Signals as Your Foundation

Every service dashboard should begin with the four golden signals: latency, traffic, errors, and saturation. This isn't optional — it's the minimum viable dashboard.

Here's the PromQL for each, assuming a standard HTTP service instrumented with Prometheus:

Latency — The User's Experience

# P50 latency over 5 minutes
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)

# P99 latency over 5 minutes
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)

Display P50 and P99 on the same panel. The gap between them tells a story. A tight gap means consistent performance. A wide gap means tail latency is punishing a subset of users — the kind of thing averages hide.

Traffic — The Demand Signal

# Requests per second by status class
sum(rate(http_requests_total{service="$service"}[5m])) by (status_class)

Break traffic down by status class (2xx, 4xx, 5xx), not individual codes. You want to see shape, not noise. A sudden drop in 2xx traffic is often a more reliable incident signal than a spike in 5xx.

Errors — What's Actually Broken

# Error rate as a percentage of total traffic
sum(rate(http_requests_total{service="$service", status_class="5xx"}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
* 100

Always express errors as a ratio. Raw error counts are meaningless without traffic context. Ten errors per second during 100 RPS is a 10% failure rate. Ten errors per second during 100,000 RPS is rounding error.

Saturation — How Close to the Cliff

# Container CPU usage as a fraction of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$service.*"}) by (pod)

# Container memory usage as a fraction of limit
sum(container_memory_working_set_bytes{pod=~"$service.*"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", pod=~"$service.*"}) by (pod)

Saturation panels should show resources relative to their limits, not absolute values. A container using 2 GiB of memory means nothing without knowing its limit is 2.5 GiB.

Dashboard Hierarchy: Three Layers

One dashboard for everything is a failure mode. Structure your dashboards in three layers.

Layer 1: The Service Overview

This is the dashboard that lives on the wall. It answers one question: is the service healthy right now?

Four golden signals, one row
SLO burn rate indicator (green/yellow/red)
Active alerts count
Time range: last 1 hour

No drill-down complexity. No per-pod breakdowns. If someone glances at this for five seconds, they should know if something needs attention.

Layer 2: The Triage Dashboard

This is where the on-call engineer goes when Layer 1 shows a problem. It answers: where is the problem?

Golden signals broken down by endpoint, pod, or region
Dependency health (upstream/downstream latency and error rates)
Recent deployments overlay
Time range: last 6 hours

# Latency by endpoint to isolate the slow path
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le, endpoint)
)

# Dependency latency to check if the problem is upstream
histogram_quantile(0.99,
  sum(rate(grpc_client_handled_seconds_bucket{grpc_service=~".*"}[5m])) by (le, grpc_service)
)

Layer 3: The Debug Dashboard

This is the deep dive. Per-pod resource usage, goroutine counts, connection pool states, garbage collection metrics. This dashboard is dense and that's fine — it's for focused investigation, not scanning.

# Go GC pause duration for debugging memory pressure
sum(rate(go_gc_duration_seconds_sum{service="$service"}[5m])) by (pod)
/
sum(rate(go_gc_duration_seconds_count{service="$service"}[5m])) by (pod)

# Open file descriptors approaching limit
process_open_fds{service="$service"}
/
process_max_fds{service="$service"}

Practical Grafana Configuration

Use Variables, Not Hardcoded Labels

Every dashboard should start with a service variable at the top. This makes dashboards reusable across teams.

{
  "name": "service",
  "type": "query",
  "query": "label_values(http_requests_total, service)",
  "refresh": 2,
  "sort": 1
}

Set Sensible Thresholds with Overrides

Color thresholds should reflect your SLOs, not arbitrary values. If your latency SLO is P99 under 300ms, set your threshold there:

# Grafana panel threshold configuration
thresholds:
  mode: absolute
  steps:
    - color: green
      value: null
    - color: yellow
      value: 200    # 66% of SLO — early warning
    - color: red
      value: 300    # SLO boundary

Annotations for Deploy Markers

Overlaying deployment events on your dashboards is one of the highest-value, lowest-effort improvements you can make. Most incidents correlate with changes.

# Use Grafana annotations API from your CI pipeline
# POST /api/annotations
# {
#   "dashboardUID": "svc-overview",
#   "time": 1711000000000,
#   "tags": ["deploy", "v2.14.3"],
#   "text": "Deployed v2.14.3 by ci-pipeline"
# }

In your CI/CD pipeline, add a step after deployment:

# GitHub Actions step to annotate Grafana
- name: Annotate Grafana
  run: |
    curl -s -X POST "$GRAFANA_URL/api/annotations" \
      -H "Authorization: Bearer $GRAFANA_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "tags": ["deploy", "${{ github.sha }}"],
        "text": "Deploy ${{ github.ref_name }} by ${{ github.actor }}"
      }'

Anti-Patterns to Avoid

The God Dashboard. One dashboard with 40 panels covering everything from CPU to business metrics. Nobody knows where to look. Split by the three-layer model.

Gauges and single-stats everywhere. A gauge showing current CPU percentage tells you nothing about trend. Use time-series graphs as the default. Reserve single-stats for binary states (up/down) and SLO compliance percentages.

No time-range discipline. A dashboard with one panel showing 5-minute data and another showing 24-hour data on the same row creates confusion. Keep panels in the same row at the same time range.

Pretty but meaningless. Pie charts showing request distribution by status code look nice in demos. They tell an on-call engineer nothing about whether to page someone. Every panel should have an answer to: what do I do when this looks wrong?

The SLO Burn Rate Panel

This is the single most important panel on your overview dashboard. It tells you how fast you're consuming your error budget.

# SLO burn rate over 1 hour (fast burn)
(
  sum(rate(http_requests_total{service="$service", status_class="5xx"}[1h]))
  /
  sum(rate(http_requests_total{service="$service"}[1h]))
) / (1 - 0.999)

# SLO burn rate over 6 hours (slow burn)
(
  sum(rate(http_requests_total{service="$service", status_class="5xx"}[6h]))
  /
  sum(rate(http_requests_total{service="$service"}[6h]))
) / (1 - 0.999)

A burn rate of 1 means you're consuming error budget exactly at the pace that would exhaust it by month's end. A burn rate above 14.4 over a 1-hour window is a fast burn — that warrants a page. A burn rate above 6 over a 6-hour window is a slow burn — that warrants a ticket.

These thresholds come directly from the Google SRE Workbook's multi-window, multi-burn-rate alerting approach. They're not arbitrary.

Dashboard as Code: Grafana Provisioning

Stop building dashboards by clicking around in the UI. Provision them from code so they're versioned, reviewable, and reproducible.

Provisioning Configuration

# grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Service Dashboards"
    type: file
    disableDeletion: false
    editable: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

JSON Dashboard Model

{
  "dashboard": {
    "title": "API Service Overview",
    "uid": "api-svc-overview",
    "tags": ["sre", "api", "golden-signals"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"$service\"}[5m])) by (status_class)",
            "legendFormat": "{{status_class}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "drawStyle": "line",
              "fillOpacity": 10
            },
            "unit": "reqps"
          }
        }
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"$service\",status_class=\"5xx\"}[5m])) / sum(rate(http_requests_total{service=\"$service\"}[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 1 },
                { "color": "red", "value": 5 }
              ]
            }
          }
        }
      }
    ]
  }
}

Store these JSON files in Git. Deploy them via a ConfigMap mounted into Grafana's provisioning directory. Every dashboard change goes through a PR.

Terraform for Grafana Dashboards

resource "grafana_dashboard" "api_overview" {
  config_json = file("${path.module}/dashboards/api-overview.json")
  folder      = grafana_folder.sre.id
  overwrite   = true
}

resource "grafana_folder" "sre" {
  title = "SRE Dashboards"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus.observability:9090"

  json_data_encoded = jsonencode({
    httpMethod = "POST"
    timeInterval = "15s"
  })
}

Infrastructure as code for your dashboards, data sources, and folder structure. Reproducible across environments.

Units and Legends Matter

A panel showing "12,453" with no unit label is useless during an incident. Always configure units.

Metric Type	Grafana Unit	Example Display
Requests per second	`reqps`	1.2k req/s
Latency	`s` or `ms`	247ms
Error rate	`percentunit`	0.3%
Bytes	`bytes`	2.1 GiB
CPU usage	`percentunit`	73%
Duration	`dtdurations`	2h 15m

Legend format matters too. {{pod}} gives you unreadable Kubernetes pod names. {{service}} - {{endpoint}} gives you context that matters during triage.

What Good Looks Like

A well-designed dashboard is quiet most of the time. Green panels, steady lines, predictable patterns. When something goes wrong, the signal is obvious and the drill-down path is clear: overview to triage to debug.

Treat your dashboards like code. Version them in JSON, review changes in pull requests, and delete panels that nobody looks at. A dashboard with ten focused panels is worth more than one with fifty that generate noise.

Reliability is a practice of measurement, and your dashboards are the instrument panel. Build them with the same care you'd build the systems they monitor.

On this page

Designing Grafana Dashboards That SREs Actually Use

Most Dashboards Are Just Expensive Wallpaper

The Four Golden Signals as Your Foundation

Latency — The User's Experience

Traffic — The Demand Signal

Errors — What's Actually Broken

Saturation — How Close to the Cliff

Dashboard Hierarchy: Three Layers

Layer 1: The Service Overview

Layer 2: The Triage Dashboard

Layer 3: The Debug Dashboard

Practical Grafana Configuration

Use Variables, Not Hardcoded Labels

Set Sensible Thresholds with Overrides

Annotations for Deploy Markers

Anti-Patterns to Avoid

The SLO Burn Rate Panel

Dashboard as Code: Grafana Provisioning

Provisioning Configuration

JSON Dashboard Model

Terraform for Grafana Dashboards

Units and Legends Matter

What Good Looks Like

Related Articles

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Prometheus Alerting Rules That Don't Wake You Up for Nothing

Implementing SLOs and Error Budgets From Scratch

PromQL: Cheat Sheet

Scalable Log Aggregation with Grafana Loki and Promtail

More in Monitoring

Distributed Tracing With Jaeger: Pinpointing Latency Bottlenecks In Microservices

Prometheus Scrape Target Down: Diagnosing And Fixing "connection Refused" Errors Step By Step

DNS Troubleshooting for DevOps: dig, nslookup, and Common Failures

Elasticsearch Cluster Sizing for Production: Nodes, Shards, and Memory

Discussion

Related Articles

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Implementing SLOs and Error Budgets From Scratch