Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
If It's Not Measured, It Doesn't Exist
I've been paged at every hour of the night. The difference between a 5-minute incident and a 5-hour one is almost always the same thing: observability. Teams with good monitoring detect issues before users do, diagnose root causes from dashboards instead of guesswork, and resolve incidents in minutes instead of hours.
This guide builds a complete monitoring stack from zero. Not a toy setup — a production-grade system with service discovery, recording rules, meaningful alerts, and dashboards that actually help during incidents. By the end, you'll have the same monitoring infrastructure I deploy for production Kubernetes clusters.
Architecture Overview
┌──────────────────────────────────────────────────┐
│ Grafana │
│ (Dashboards, Exploration) │
└────────────┬───────────────────┬─────────────────┘
│ │
┌────────▼────────┐ ┌──────▼──────────┐
│ Prometheus │ │ Alertmanager │
│ (Metrics Store) │ │ (Notification) │
└────────┬────────┘ └─────────────────┘
│
┌────────▼────────────────────────────┐
│ Scrape Targets │
│ ┌─────────┐ ┌──────┐ ┌─────────┐ │
│ │node-exp.│ │kube- │ │app │ │
│ │ │ │state │ │metrics │ │
│ └─────────┘ └──────┘ └─────────┘ │
└─────────────────────────────────────┘
Part 1: Installing the Stack with Helm
kube-prometheus-stack
The community Helm chart gives you Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one deployment. This is the right starting point.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create a comprehensive values file:
# values-monitoring.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: 40GB
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 2000m
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
# Scrape interval and evaluation
scrapeInterval: 30s
evaluationInterval: 30s
# Enable remote write for long-term storage
remoteWrite:
- url: "http://thanos-receive.monitoring:19291/api/v1/receive"
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # Don't send Go runtime metrics to long-term
# Service discovery for PodMonitors and ServiceMonitors
podMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
# Additional scrape configs for non-k8s targets
additionalScrapeConfigs:
- job_name: 'external-node-exporter'
static_configs:
- targets:
- 'bastion-host:9100'
- 'build-server:9100'
labels:
environment: infrastructure
grafana:
adminPassword: "" # Use external secret
persistence:
enabled: true
size: 10Gi
storageClassName: gp3
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
sidecar:
dashboards:
enabled: true
searchNamespace: ALL
folderAnnotation: grafana_folder
provider:
foldersFromFilesStructure: true
datasources:
enabled: true
grafana.ini:
server:
root_url: https://grafana.example.com
auth.generic_oauth:
enabled: true
name: SSO
allow_sign_up: true
scopes: openid profile email
security:
cookie_secure: true
strict_transport_security: true
alertmanager:
alertmanagerSpec:
resources:
requests:
memory: 128Mi
cpu: 50m
limits:
memory: 256Mi
cpu: 200m
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
nodeExporter:
resources:
requests:
memory: 64Mi
cpu: 50m
limits:
memory: 128Mi
cpu: 200m
kubeStateMetrics:
resources:
requests:
memory: 128Mi
cpu: 50m
limits:
memory: 256Mi
cpu: 200m
Deploy it:
kubectl create namespace monitoring
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values-monitoring.yaml \
--version 67.4.0 \
--wait
Part 2: Instrumenting Your Applications
ServiceMonitor for Kubernetes Services
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: production
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 15s
path: /metrics
scrapeTimeout: 10s
metricRelabelings:
# Drop high-cardinality metrics you don't need
- sourceLabels: [__name__]
regex: "http_request_duration_seconds_bucket"
action: keep
- sourceLabels: [__name__]
regex: "go_gc_.*"
action: drop
namespaceSelector:
matchNames:
- production
PodMonitor for Pods Without Services
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
namespace: production
spec:
selector:
matchLabels:
monitoring: enabled
podMetricsEndpoints:
- port: metrics
interval: 30s
Application Instrumentation (Go Example)
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests by method, path, and status",
},
[]string{"method", "path", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency in seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "path"},
)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func instrumentHandler(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
activeConnections.Inc()
defer activeConnections.Dec()
timer := prometheus.NewTimer(
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path),
)
defer timer.ObserveDuration()
rw := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(rw, r)
httpRequestsTotal.WithLabelValues(
r.Method, r.URL.Path, http.StatusText(rw.statusCode),
).Inc()
})
}
func main() {
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.Handler())
mux.Handle("/", instrumentHandler(http.HandlerFunc(handleRoot)))
http.ListenAndServe(":8080", mux)
}
Part 3: Recording Rules for Performance
Recording rules pre-compute expensive queries. Without them, your dashboards are slow and Prometheus burns CPU on repeated aggregations.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: http.rules
interval: 30s
rules:
# Request rate by service
- record: service:http_requests:rate5m
expr: |
sum by (service, namespace) (
rate(http_requests_total[5m])
)
# Error rate by service
- record: service:http_errors:rate5m
expr: |
sum by (service, namespace) (
rate(http_requests_total{status=~"5.."}[5m])
)
# Error ratio (for SLO dashboards)
- record: service:http_error_ratio:rate5m
expr: |
service:http_errors:rate5m
/
service:http_requests:rate5m
# P50, P90, P99 latency by service
- record: service:http_request_duration_seconds:p50
expr: |
histogram_quantile(0.50,
sum by (service, namespace, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- record: service:http_request_duration_seconds:p90
expr: |
histogram_quantile(0.90,
sum by (service, namespace, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- record: service:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (service, namespace, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- name: kubernetes.rules
interval: 30s
rules:
# CPU utilization by namespace
- record: namespace:container_cpu_usage:sum
expr: |
sum by (namespace) (
rate(container_cpu_usage_seconds_total{
container!="",
image!=""
}[5m])
)
# Memory utilization by namespace
- record: namespace:container_memory_working_set_bytes:sum
expr: |
sum by (namespace) (
container_memory_working_set_bytes{
container!="",
image!=""
}
)
# Pod restart rate
- record: namespace:kube_pod_container_restarts:rate1h
expr: |
sum by (namespace, pod) (
increase(kube_pod_container_status_restarts_total[1h])
)
Part 4: Alerting Rules That Don't Page You for Nothing
This is where most monitoring setups fail. Alert on symptoms, not causes. Page on user impact, not internal metrics.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: alerting-rules
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: slo.alerts
rules:
# High error rate (user-facing)
- alert: HighErrorRate
expr: |
service:http_error_ratio:rate5m > 0.01
for: 5m
labels:
severity: critical
team: "{{ $labels.namespace }}"
annotations:
summary: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
description: "Error rate exceeds 1% SLO for 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/slo-overview"
# High latency (user-facing)
- alert: HighLatencyP99
expr: |
service:http_request_duration_seconds:p99 > 2
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} p99 latency is {{ $value }}s"
runbook: "https://wiki.example.com/runbooks/high-latency"
- name: infrastructure.alerts
rules:
# Node is running out of disk
- alert: NodeDiskPressure
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}
) < 0.10
for: 15m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} has < 10% disk space"
# Pod CrashLoopBackOff
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
# Persistent volume filling up
- alert: PersistentVolumeFillingUp
expr: |
(
kubelet_volume_stats_available_bytes
/ kubelet_volume_stats_capacity_bytes
) < 0.15
and
predict_linear(kubelet_volume_stats_available_bytes[6h], 24 * 3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} will fill within 24h"
- name: prometheus.alerts
rules:
# Prometheus itself is having issues
- alert: PrometheusTargetDown
expr: up == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"
# Too many scrape errors
- alert: PrometheusScrapeErrors
expr: |
increase(prometheus_target_scrapes_exceeded_sample_limit_total[1h]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Scrape target hitting sample limit"
Part 5: Alertmanager Configuration
Route alerts to the right people through the right channels:
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: alert-routing
namespace: monitoring
spec:
route:
groupBy: ['alertname', 'namespace', 'service']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
receiver: default-slack
routes:
- matchers:
- name: severity
value: critical
receiver: pagerduty-critical
repeatInterval: 1h
continue: true # Also send to Slack
- matchers:
- name: severity
value: critical
receiver: critical-slack
- matchers:
- name: severity
value: warning
receiver: warning-slack
repeatInterval: 12h
receivers:
- name: default-slack
slackConfigs:
- channel: '#alerts-default'
apiURL:
name: slack-webhook
key: url
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.severity | toUpper }}*: {{ .Annotations.summary }}
{{ .Annotations.description }}
{{ if .Annotations.runbook }}Runbook: {{ .Annotations.runbook }}{{ end }}
{{ end }}
sendResolved: true
- name: critical-slack
slackConfigs:
- channel: '#alerts-critical'
apiURL:
name: slack-webhook
key: url
sendResolved: true
- name: warning-slack
slackConfigs:
- channel: '#alerts-warning'
apiURL:
name: slack-webhook
key: url
sendResolved: true
- name: pagerduty-critical
pagerdutyConfigs:
- routingKey:
name: pagerduty-key
key: routing-key
severity: critical
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
Part 6: Grafana Dashboards as Code
Store dashboards in ConfigMaps so they're version-controlled and survive Grafana restarts:
apiVersion: v1
kind: ConfigMap
metadata:
name: service-overview-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
annotations:
grafana_folder: "Service Dashboards"
data:
service-overview.json: |
{
"dashboard": {
"title": "Service Overview",
"uid": "service-overview",
"tags": ["services", "sre"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum by (service) (service:http_requests:rate5m)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "service:http_error_ratio:rate5m * 100",
"legendFormat": "{{ service }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "red", "value": 1 }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"targets": [
{
"expr": "service:http_request_duration_seconds:p99",
"legendFormat": "{{ service }}"
}
],
"fieldConfig": {
"defaults": { "unit": "s" }
}
},
{
"title": "Active Pods",
"type": "stat",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"targets": [
{
"expr": "sum by (namespace) (kube_pod_status_phase{phase='Running'})",
"legendFormat": "{{ namespace }}"
}
]
}
]
}
}
Part 7: Long-Term Storage with Thanos
Prometheus retention should be 15-30 days. For long-term metrics, add Thanos sidecar.
# Add to kube-prometheus-stack values
prometheus:
prometheusSpec:
thanos:
objectStorageConfig:
existingSecret:
name: thanos-objstore
key: config.yaml
# Keep 15 days locally
retention: 15d
Thanos object storage config:
# thanos-objstore-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: thanos-objstore
namespace: monitoring
stringData:
config.yaml: |
type: S3
config:
bucket: monitoring-thanos-store
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
Deploy Thanos components:
helm install thanos bitnami/thanos \
--namespace monitoring \
--set query.stores=["prometheus-kube-prometheus-stack-thanos-discovery.monitoring:10901"] \
--set compactor.enabled=true \
--set compactor.retentionResolutionRaw=30d \
--set compactor.retentionResolution5m=180d \
--set compactor.retentionResolution1h=365d \
--set storegateway.enabled=true \
--set existingObjstoreSecret=thanos-objstore
This gives you 30 days of raw resolution, 6 months at 5-minute resolution, and a year at 1-hour resolution. Enough to spot trends, do capacity planning, and satisfy auditors.
The Monitoring Stack Checklist
| Component | Purpose | Without It |
|---|---|---|
| Prometheus | Metrics collection and short-term storage | No metrics at all |
| node-exporter | Host-level metrics (CPU, memory, disk, network) | Blind to infrastructure issues |
| kube-state-metrics | Kubernetes object metrics (pods, deployments) | Can't see K8s state |
| Recording rules | Pre-computed aggregations | Slow dashboards, high CPU |
| Alerting rules | Automated incident detection | Manual monitoring only |
| Alertmanager | Alert routing and deduplication | Alert storms, no routing |
| Grafana | Visualization and exploration | Raw PromQL only |
| Thanos/Cortex | Long-term storage | Lose metrics after retention |
Part 8: Troubleshooting Common Issues
Prometheus Running Out of Memory
This is the most common operational issue. Prometheus memory usage is proportional to the number of active time series.
# Check current time series count
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'
# Find the highest cardinality metrics
curl -s http://localhost:9090/api/v1/status/tsdb | jq '
.data.seriesCountByMetricName |
sort_by(-.value) |
.[0:20] |
.[] | "\(.name): \(.value) series"'
Common culprits and fixes:
| Metric | Typical Cause | Fix |
|---|---|---|
apiserver_request_duration_seconds_bucket | Too many LE buckets | Drop with relabeling |
container_* | Monitoring paused/stopped containers | Filter container!="" |
http_request_duration_seconds_bucket | High-cardinality path labels | Normalize URL paths |
go_gc_* | Every Go service exports these | Drop with relabeling |
Drop high-cardinality metrics you don't need:
# In your ServiceMonitor or scrape config
metricRelabelings:
# Drop Go garbage collector metrics (rarely needed)
- sourceLabels: [__name__]
regex: "go_(gc|memstats|threads|info)_.*"
action: drop
# Drop unused histogram buckets
- sourceLabels: [__name__]
regex: "apiserver_request_duration_seconds_bucket"
action: drop
# Normalize high-cardinality URL paths
- sourceLabels: [path]
regex: "/api/v1/users/[0-9]+"
targetLabel: path
replacement: "/api/v1/users/:id"
Grafana Dashboards Loading Slowly
Slow dashboards are almost always caused by unoptimized PromQL queries hitting raw metrics instead of recording rules.
Before (slow — computes on every dashboard load):
sum by (service) (rate(http_requests_total{namespace="production"}[5m]))
After (fast — uses pre-computed recording rule):
service:http_requests:rate5m{namespace="production"}
Other optimization tips:
- Set dashboard time range to 6 hours or less by default. Longer ranges query more data.
- Use
$__rate_intervalinstead of hardcoded intervals like[5m]. - Add template variables for namespace and service to filter queries instead of aggregating everything.
Alertmanager Not Sending Notifications
# Check Alertmanager status
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# View active alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[0:5]'
# Check alert routing (shows which receiver an alert would hit)
curl -s http://localhost:9093/api/v2/alerts/groups | jq '.[] | {receiver, alerts: [.alerts[].labels.alertname]}'
# Test webhook connectivity
kubectl exec -n monitoring deploy/alertmanager -- \
wget -qO- --timeout=5 https://hooks.slack.com/services/... 2>&1
Common issues:
- Slack webhook URL changed — re-create the secret with the new URL.
- Alert is in
pendingstate — it hasn't fired long enough to meet theforduration. - Inhibition rules — a lower-severity alert may be suppressed by a higher-severity one.
- GroupWait too long — set
groupWait: 30sfor critical alerts.
Part 9: SLO-Based Monitoring
The most mature monitoring setup I deploy uses SLOs (Service Level Objectives) as the foundation for all alerting.
Defining SLOs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-rules
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: slo.rules
interval: 30s
rules:
# Availability SLO: 99.9% of requests succeed
- record: slo:api_availability:ratio
expr: |
1 - (
sum(rate(http_requests_total{namespace="production",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="production"}[5m]))
)
# Latency SLO: 99% of requests complete in < 500ms
- record: slo:api_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{
namespace="production",
le="0.5"
}[5m]))
/
sum(rate(http_request_duration_seconds_count{
namespace="production"
}[5m]))
# Error budget remaining (30-day window)
- record: slo:api_availability:error_budget_remaining
expr: |
1 - (
(1 - slo:api_availability:ratio)
/
(1 - 0.999)
)
- name: slo.alerts
rules:
# Burn rate alert: 2% of monthly error budget consumed in 1 hour
- alert: SLOHighBurnRate
expr: |
slo:api_availability:error_budget_remaining < 0.98
and
(1 - slo:api_availability:ratio) > (14.4 * (1 - 0.999))
for: 5m
labels:
severity: critical
annotations:
summary: "API availability SLO burn rate is critical"
description: "At the current error rate, the monthly error budget will be exhausted in less than 2 hours."
runbook: "https://wiki.example.com/runbooks/slo-burn-rate"
The burn rate approach avoids two problems: alerting too early on minor blips, and alerting too late on sustained degradation. A 14.4x burn rate means you'll exhaust your monthly error budget in ~2 hours if it continues — that's worth paging someone.
SLO Grafana Dashboard
Add an error budget panel to your service overview dashboard:
{
"title": "Error Budget Remaining (30d)",
"type": "gauge",
"targets": [
{
"expr": "slo:api_availability:error_budget_remaining * 100",
"legendFormat": "Budget Remaining"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 25 },
{ "color": "green", "value": 50 }
]
}
}
}
}
Part 10: Scaling Prometheus
When a Single Prometheus Isn't Enough
For clusters with more than 500 nodes or 5 million active series, a single Prometheus instance runs into memory and storage limits. Options:
Functional sharding — run multiple Prometheus instances, each scraping different workloads:
# prometheus-apps.yaml - Scrapes application metrics
prometheus:
prometheusSpec:
serviceMonitorSelector:
matchLabels:
monitoring-target: applications
externalLabels:
shard: apps
# prometheus-infra.yaml - Scrapes infrastructure metrics
prometheus:
prometheusSpec:
serviceMonitorSelector:
matchLabels:
monitoring-target: infrastructure
externalLabels:
shard: infra
Use Thanos Query to provide a unified view across shards:
helm install thanos-query bitnami/thanos \
--namespace monitoring \
--set query.stores=[\
"prometheus-apps-thanos:10901",\
"prometheus-infra-thanos:10901"\
]
Point Grafana at Thanos Query instead of individual Prometheus instances. Your dashboards work exactly the same — Thanos handles the fan-out and deduplication.
What I Wish Someone Told Me
- Start with USE and RED methods. For infrastructure: Utilization, Saturation, Errors. For services: Rate, Errors, Duration. These cover 90% of your monitoring needs.
- Recording rules are not optional. A dashboard that takes 30 seconds to load won't get used during an incident.
- High-cardinality labels will destroy Prometheus. Never use user IDs, request IDs, or timestamps as label values. Each unique combination creates a new time series.
- Alert fatigue kills on-call. Every alert should have a runbook. Every page should require human action. If it can be automated, it shouldn't page you.
- Monitor the monitoring. If Prometheus goes down and you don't notice, you have no monitoring at all. Set up external checks on your monitoring stack.
- SLOs before dashboards. Define what "healthy" means for each service before building dashboards. Without SLOs, you're just looking at graphs — you're not making decisions.
- Label standardization matters early. Agree on label names (
servicevsapp,environmentvsenv) before you have 50 ServiceMonitors. Renaming labels later is painful.
The goal isn't to collect every possible metric. The goal is to answer two questions at any time: "Is the system healthy?" and "If not, where is it broken?" Build toward that, and you'll have a monitoring stack that earns its keep.
Invest the time upfront to build this stack properly. A well-configured Prometheus with meaningful recording rules, SLO-based alerts, and Grafana dashboards that tell a story will save your team hundreds of hours in incident response over its lifetime. The alternative — waking up to an outage with no metrics, no alerts, and no dashboards — is a pain I've felt too many times to recommend to anyone. Build the stack. Trust the stack. Then make it better, one recording rule at a time.
Related Articles
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Prometheus Alerting Rules That Don't Wake You Up for Nothing
Design Prometheus alerting rules that catch real incidents and ignore noise — practical patterns from years of on-call experience.
Designing Grafana Dashboards That SREs Actually Use
Build Grafana dashboards that surface real signals instead of decorating walls — a structured approach rooted in SRE principles.
Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana
Use Prometheus recording rules to pre-compute expensive queries, speed up dashboards, and make SLO calculations reliable at scale.