DevOpsil
Monitoring
88%
Fresh

Designing Grafana Dashboards That SREs Actually Use

Riku TanakaRiku Tanaka9 min read

Most Dashboards Are Just Expensive Wallpaper

Walk into any engineering office and you'll see a wall of Grafana dashboards. Dozens of panels, rainbow gradients, numbers ticking. It looks impressive. But ask someone what any of those panels mean for the user, and the room goes quiet.

The Google SRE book makes a distinction that matters here: monitoring should answer two questions. Is the service working? Why isn't it working? If your dashboard doesn't clearly address one of those, it's decoration.

Let me walk through a systematic approach to dashboard design that I've refined over years of on-call rotations. It starts with structure, not aesthetics.

The Four Golden Signals as Your Foundation

Every service dashboard should begin with the four golden signals: latency, traffic, errors, and saturation. This isn't optional — it's the minimum viable dashboard.

Here's the PromQL for each, assuming a standard HTTP service instrumented with Prometheus:

Latency — The User's Experience

# P50 latency over 5 minutes
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)

# P99 latency over 5 minutes
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
)

Display P50 and P99 on the same panel. The gap between them tells a story. A tight gap means consistent performance. A wide gap means tail latency is punishing a subset of users — the kind of thing averages hide.

Traffic — The Demand Signal

# Requests per second by status class
sum(rate(http_requests_total{service="$service"}[5m])) by (status_class)

Break traffic down by status class (2xx, 4xx, 5xx), not individual codes. You want to see shape, not noise. A sudden drop in 2xx traffic is often a more reliable incident signal than a spike in 5xx.

Errors — What's Actually Broken

# Error rate as a percentage of total traffic
sum(rate(http_requests_total{service="$service", status_class="5xx"}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
* 100

Always express errors as a ratio. Raw error counts are meaningless without traffic context. Ten errors per second during 100 RPS is a 10% failure rate. Ten errors per second during 100,000 RPS is rounding error.

Saturation — How Close to the Cliff

# Container CPU usage as a fraction of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$service.*"}) by (pod)

# Container memory usage as a fraction of limit
sum(container_memory_working_set_bytes{pod=~"$service.*"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", pod=~"$service.*"}) by (pod)

Saturation panels should show resources relative to their limits, not absolute values. A container using 2 GiB of memory means nothing without knowing its limit is 2.5 GiB.

Dashboard Hierarchy: Three Layers

One dashboard for everything is a failure mode. Structure your dashboards in three layers.

Layer 1: The Service Overview

This is the dashboard that lives on the wall. It answers one question: is the service healthy right now?

  • Four golden signals, one row
  • SLO burn rate indicator (green/yellow/red)
  • Active alerts count
  • Time range: last 1 hour

No drill-down complexity. No per-pod breakdowns. If someone glances at this for five seconds, they should know if something needs attention.

Layer 2: The Triage Dashboard

This is where the on-call engineer goes when Layer 1 shows a problem. It answers: where is the problem?

  • Golden signals broken down by endpoint, pod, or region
  • Dependency health (upstream/downstream latency and error rates)
  • Recent deployments overlay
  • Time range: last 6 hours
# Latency by endpoint to isolate the slow path
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le, endpoint)
)

# Dependency latency to check if the problem is upstream
histogram_quantile(0.99,
  sum(rate(grpc_client_handled_seconds_bucket{grpc_service=~".*"}[5m])) by (le, grpc_service)
)

Layer 3: The Debug Dashboard

This is the deep dive. Per-pod resource usage, goroutine counts, connection pool states, garbage collection metrics. This dashboard is dense and that's fine — it's for focused investigation, not scanning.

# Go GC pause duration for debugging memory pressure
sum(rate(go_gc_duration_seconds_sum{service="$service"}[5m])) by (pod)
/
sum(rate(go_gc_duration_seconds_count{service="$service"}[5m])) by (pod)

# Open file descriptors approaching limit
process_open_fds{service="$service"}
/
process_max_fds{service="$service"}

Practical Grafana Configuration

Use Variables, Not Hardcoded Labels

Every dashboard should start with a service variable at the top. This makes dashboards reusable across teams.

{
  "name": "service",
  "type": "query",
  "query": "label_values(http_requests_total, service)",
  "refresh": 2,
  "sort": 1
}

Set Sensible Thresholds with Overrides

Color thresholds should reflect your SLOs, not arbitrary values. If your latency SLO is P99 under 300ms, set your threshold there:

# Grafana panel threshold configuration
thresholds:
  mode: absolute
  steps:
    - color: green
      value: null
    - color: yellow
      value: 200    # 66% of SLO — early warning
    - color: red
      value: 300    # SLO boundary

Annotations for Deploy Markers

Overlaying deployment events on your dashboards is one of the highest-value, lowest-effort improvements you can make. Most incidents correlate with changes.

# Use Grafana annotations API from your CI pipeline
# POST /api/annotations
# {
#   "dashboardUID": "svc-overview",
#   "time": 1711000000000,
#   "tags": ["deploy", "v2.14.3"],
#   "text": "Deployed v2.14.3 by ci-pipeline"
# }

In your CI/CD pipeline, add a step after deployment:

# GitHub Actions step to annotate Grafana
- name: Annotate Grafana
  run: |
    curl -s -X POST "$GRAFANA_URL/api/annotations" \
      -H "Authorization: Bearer $GRAFANA_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "tags": ["deploy", "${{ github.sha }}"],
        "text": "Deploy ${{ github.ref_name }} by ${{ github.actor }}"
      }'

Anti-Patterns to Avoid

The God Dashboard. One dashboard with 40 panels covering everything from CPU to business metrics. Nobody knows where to look. Split by the three-layer model.

Gauges and single-stats everywhere. A gauge showing current CPU percentage tells you nothing about trend. Use time-series graphs as the default. Reserve single-stats for binary states (up/down) and SLO compliance percentages.

No time-range discipline. A dashboard with one panel showing 5-minute data and another showing 24-hour data on the same row creates confusion. Keep panels in the same row at the same time range.

Pretty but meaningless. Pie charts showing request distribution by status code look nice in demos. They tell an on-call engineer nothing about whether to page someone. Every panel should have an answer to: what do I do when this looks wrong?

The SLO Burn Rate Panel

This is the single most important panel on your overview dashboard. It tells you how fast you're consuming your error budget.

# SLO burn rate over 1 hour (fast burn)
(
  sum(rate(http_requests_total{service="$service", status_class="5xx"}[1h]))
  /
  sum(rate(http_requests_total{service="$service"}[1h]))
) / (1 - 0.999)

# SLO burn rate over 6 hours (slow burn)
(
  sum(rate(http_requests_total{service="$service", status_class="5xx"}[6h]))
  /
  sum(rate(http_requests_total{service="$service"}[6h]))
) / (1 - 0.999)

A burn rate of 1 means you're consuming error budget exactly at the pace that would exhaust it by month's end. A burn rate above 14.4 over a 1-hour window is a fast burn — that warrants a page. A burn rate above 6 over a 6-hour window is a slow burn — that warrants a ticket.

These thresholds come directly from the Google SRE Workbook's multi-window, multi-burn-rate alerting approach. They're not arbitrary.

Dashboard as Code: Grafana Provisioning

Stop building dashboards by clicking around in the UI. Provision them from code so they're versioned, reviewable, and reproducible.

Provisioning Configuration

# grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Service Dashboards"
    type: file
    disableDeletion: false
    editable: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

JSON Dashboard Model

{
  "dashboard": {
    "title": "API Service Overview",
    "uid": "api-svc-overview",
    "tags": ["sre", "api", "golden-signals"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"$service\"}[5m])) by (status_class)",
            "legendFormat": "{{status_class}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "drawStyle": "line",
              "fillOpacity": 10
            },
            "unit": "reqps"
          }
        }
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"$service\",status_class=\"5xx\"}[5m])) / sum(rate(http_requests_total{service=\"$service\"}[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 1 },
                { "color": "red", "value": 5 }
              ]
            }
          }
        }
      }
    ]
  }
}

Store these JSON files in Git. Deploy them via a ConfigMap mounted into Grafana's provisioning directory. Every dashboard change goes through a PR.

Terraform for Grafana Dashboards

resource "grafana_dashboard" "api_overview" {
  config_json = file("${path.module}/dashboards/api-overview.json")
  folder      = grafana_folder.sre.id
  overwrite   = true
}

resource "grafana_folder" "sre" {
  title = "SRE Dashboards"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus.observability:9090"

  json_data_encoded = jsonencode({
    httpMethod = "POST"
    timeInterval = "15s"
  })
}

Infrastructure as code for your dashboards, data sources, and folder structure. Reproducible across environments.

Units and Legends Matter

A panel showing "12,453" with no unit label is useless during an incident. Always configure units.

Metric TypeGrafana UnitExample Display
Requests per secondreqps1.2k req/s
Latencys or ms247ms
Error ratepercentunit0.3%
Bytesbytes2.1 GiB
CPU usagepercentunit73%
Durationdtdurations2h 15m

Legend format matters too. {{pod}} gives you unreadable Kubernetes pod names. {{service}} - {{endpoint}} gives you context that matters during triage.

What Good Looks Like

A well-designed dashboard is quiet most of the time. Green panels, steady lines, predictable patterns. When something goes wrong, the signal is obvious and the drill-down path is clear: overview to triage to debug.

Treat your dashboards like code. Version them in JSON, review changes in pull requests, and delete panels that nobody looks at. A dashboard with ten focused panels is worth more than one with fifty that generate noise.

Reliability is a practice of measurement, and your dashboards are the instrument panel. Build them with the same care you'd build the systems they monitor.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles