Designing Grafana Dashboards That SREs Actually Use
Most Dashboards Are Just Expensive Wallpaper
Walk into any engineering office and you'll see a wall of Grafana dashboards. Dozens of panels, rainbow gradients, numbers ticking. It looks impressive. But ask someone what any of those panels mean for the user, and the room goes quiet.
The Google SRE book makes a distinction that matters here: monitoring should answer two questions. Is the service working? Why isn't it working? If your dashboard doesn't clearly address one of those, it's decoration.
Let me walk through a systematic approach to dashboard design that I've refined over years of on-call rotations. It starts with structure, not aesthetics.
The Four Golden Signals as Your Foundation
Every service dashboard should begin with the four golden signals: latency, traffic, errors, and saturation. This isn't optional — it's the minimum viable dashboard.
Here's the PromQL for each, assuming a standard HTTP service instrumented with Prometheus:
Latency — The User's Experience
# P50 latency over 5 minutes
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)
# P99 latency over 5 minutes
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)
Display P50 and P99 on the same panel. The gap between them tells a story. A tight gap means consistent performance. A wide gap means tail latency is punishing a subset of users — the kind of thing averages hide.
Traffic — The Demand Signal
# Requests per second by status class
sum(rate(http_requests_total{service="$service"}[5m])) by (status_class)
Break traffic down by status class (2xx, 4xx, 5xx), not individual codes. You want to see shape, not noise. A sudden drop in 2xx traffic is often a more reliable incident signal than a spike in 5xx.
Errors — What's Actually Broken
# Error rate as a percentage of total traffic
sum(rate(http_requests_total{service="$service", status_class="5xx"}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
* 100
Always express errors as a ratio. Raw error counts are meaningless without traffic context. Ten errors per second during 100 RPS is a 10% failure rate. Ten errors per second during 100,000 RPS is rounding error.
Saturation — How Close to the Cliff
# Container CPU usage as a fraction of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$service.*"}) by (pod)
# Container memory usage as a fraction of limit
sum(container_memory_working_set_bytes{pod=~"$service.*"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", pod=~"$service.*"}) by (pod)
Saturation panels should show resources relative to their limits, not absolute values. A container using 2 GiB of memory means nothing without knowing its limit is 2.5 GiB.
Dashboard Hierarchy: Three Layers
One dashboard for everything is a failure mode. Structure your dashboards in three layers.
Layer 1: The Service Overview
This is the dashboard that lives on the wall. It answers one question: is the service healthy right now?
- Four golden signals, one row
- SLO burn rate indicator (green/yellow/red)
- Active alerts count
- Time range: last 1 hour
No drill-down complexity. No per-pod breakdowns. If someone glances at this for five seconds, they should know if something needs attention.
Layer 2: The Triage Dashboard
This is where the on-call engineer goes when Layer 1 shows a problem. It answers: where is the problem?
- Golden signals broken down by endpoint, pod, or region
- Dependency health (upstream/downstream latency and error rates)
- Recent deployments overlay
- Time range: last 6 hours
# Latency by endpoint to isolate the slow path
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le, endpoint)
)
# Dependency latency to check if the problem is upstream
histogram_quantile(0.99,
sum(rate(grpc_client_handled_seconds_bucket{grpc_service=~".*"}[5m])) by (le, grpc_service)
)
Layer 3: The Debug Dashboard
This is the deep dive. Per-pod resource usage, goroutine counts, connection pool states, garbage collection metrics. This dashboard is dense and that's fine — it's for focused investigation, not scanning.
# Go GC pause duration for debugging memory pressure
sum(rate(go_gc_duration_seconds_sum{service="$service"}[5m])) by (pod)
/
sum(rate(go_gc_duration_seconds_count{service="$service"}[5m])) by (pod)
# Open file descriptors approaching limit
process_open_fds{service="$service"}
/
process_max_fds{service="$service"}
Practical Grafana Configuration
Use Variables, Not Hardcoded Labels
Every dashboard should start with a service variable at the top. This makes dashboards reusable across teams.
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 2,
"sort": 1
}
Set Sensible Thresholds with Overrides
Color thresholds should reflect your SLOs, not arbitrary values. If your latency SLO is P99 under 300ms, set your threshold there:
# Grafana panel threshold configuration
thresholds:
mode: absolute
steps:
- color: green
value: null
- color: yellow
value: 200 # 66% of SLO — early warning
- color: red
value: 300 # SLO boundary
Annotations for Deploy Markers
Overlaying deployment events on your dashboards is one of the highest-value, lowest-effort improvements you can make. Most incidents correlate with changes.
# Use Grafana annotations API from your CI pipeline
# POST /api/annotations
# {
# "dashboardUID": "svc-overview",
# "time": 1711000000000,
# "tags": ["deploy", "v2.14.3"],
# "text": "Deployed v2.14.3 by ci-pipeline"
# }
In your CI/CD pipeline, add a step after deployment:
# GitHub Actions step to annotate Grafana
- name: Annotate Grafana
run: |
curl -s -X POST "$GRAFANA_URL/api/annotations" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tags": ["deploy", "${{ github.sha }}"],
"text": "Deploy ${{ github.ref_name }} by ${{ github.actor }}"
}'
Anti-Patterns to Avoid
The God Dashboard. One dashboard with 40 panels covering everything from CPU to business metrics. Nobody knows where to look. Split by the three-layer model.
Gauges and single-stats everywhere. A gauge showing current CPU percentage tells you nothing about trend. Use time-series graphs as the default. Reserve single-stats for binary states (up/down) and SLO compliance percentages.
No time-range discipline. A dashboard with one panel showing 5-minute data and another showing 24-hour data on the same row creates confusion. Keep panels in the same row at the same time range.
Pretty but meaningless. Pie charts showing request distribution by status code look nice in demos. They tell an on-call engineer nothing about whether to page someone. Every panel should have an answer to: what do I do when this looks wrong?
The SLO Burn Rate Panel
This is the single most important panel on your overview dashboard. It tells you how fast you're consuming your error budget.
# SLO burn rate over 1 hour (fast burn)
(
sum(rate(http_requests_total{service="$service", status_class="5xx"}[1h]))
/
sum(rate(http_requests_total{service="$service"}[1h]))
) / (1 - 0.999)
# SLO burn rate over 6 hours (slow burn)
(
sum(rate(http_requests_total{service="$service", status_class="5xx"}[6h]))
/
sum(rate(http_requests_total{service="$service"}[6h]))
) / (1 - 0.999)
A burn rate of 1 means you're consuming error budget exactly at the pace that would exhaust it by month's end. A burn rate above 14.4 over a 1-hour window is a fast burn — that warrants a page. A burn rate above 6 over a 6-hour window is a slow burn — that warrants a ticket.
These thresholds come directly from the Google SRE Workbook's multi-window, multi-burn-rate alerting approach. They're not arbitrary.
Dashboard as Code: Grafana Provisioning
Stop building dashboards by clicking around in the UI. Provision them from code so they're versioned, reviewable, and reproducible.
Provisioning Configuration
# grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Service Dashboards"
type: file
disableDeletion: false
editable: false
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
JSON Dashboard Model
{
"dashboard": {
"title": "API Service Overview",
"uid": "api-svc-overview",
"tags": ["sre", "api", "golden-signals"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"$service\"}[5m])) by (status_class)",
"legendFormat": "{{status_class}}"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"drawStyle": "line",
"fillOpacity": 10
},
"unit": "reqps"
}
}
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"$service\",status_class=\"5xx\"}[5m])) / sum(rate(http_requests_total{service=\"$service\"}[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
}
]
}
}
Store these JSON files in Git. Deploy them via a ConfigMap mounted into Grafana's provisioning directory. Every dashboard change goes through a PR.
Terraform for Grafana Dashboards
resource "grafana_dashboard" "api_overview" {
config_json = file("${path.module}/dashboards/api-overview.json")
folder = grafana_folder.sre.id
overwrite = true
}
resource "grafana_folder" "sre" {
title = "SRE Dashboards"
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus.observability:9090"
json_data_encoded = jsonencode({
httpMethod = "POST"
timeInterval = "15s"
})
}
Infrastructure as code for your dashboards, data sources, and folder structure. Reproducible across environments.
Units and Legends Matter
A panel showing "12,453" with no unit label is useless during an incident. Always configure units.
| Metric Type | Grafana Unit | Example Display |
|---|---|---|
| Requests per second | reqps | 1.2k req/s |
| Latency | s or ms | 247ms |
| Error rate | percentunit | 0.3% |
| Bytes | bytes | 2.1 GiB |
| CPU usage | percentunit | 73% |
| Duration | dtdurations | 2h 15m |
Legend format matters too. {{pod}} gives you unreadable Kubernetes pod names. {{service}} - {{endpoint}} gives you context that matters during triage.
What Good Looks Like
A well-designed dashboard is quiet most of the time. Green panels, steady lines, predictable patterns. When something goes wrong, the signal is obvious and the drill-down path is clear: overview to triage to debug.
Treat your dashboards like code. Version them in JSON, review changes in pull requests, and delete panels that nobody looks at. A dashboard with ten focused panels is worth more than one with fifty that generate noise.
Reliability is a practice of measurement, and your dashboards are the instrument panel. Build them with the same care you'd build the systems they monitor.
Related Articles
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana
Use Prometheus recording rules to pre-compute expensive queries, speed up dashboards, and make SLO calculations reliable at scale.
Prometheus Alerting Rules That Don't Wake You Up for Nothing
Design Prometheus alerting rules that catch real incidents and ignore noise — practical patterns from years of on-call experience.