DevOpsil
Monitoring
89%
Fresh

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Riku TanakaRiku Tanaka15 min read

If It's Not Measured, It Doesn't Exist

I've been paged at every hour of the night. The difference between a 5-minute incident and a 5-hour one is almost always the same thing: observability. Teams with good monitoring detect issues before users do, diagnose root causes from dashboards instead of guesswork, and resolve incidents in minutes instead of hours.

This guide builds a complete monitoring stack from zero. Not a toy setup — a production-grade system with service discovery, recording rules, meaningful alerts, and dashboards that actually help during incidents. By the end, you'll have the same monitoring infrastructure I deploy for production Kubernetes clusters.

Architecture Overview

┌──────────────────────────────────────────────────┐
│                    Grafana                        │
│           (Dashboards, Exploration)               │
└────────────┬───────────────────┬─────────────────┘
             │                   │
    ┌────────▼────────┐  ┌──────▼──────────┐
    │   Prometheus     │  │   Alertmanager   │
    │  (Metrics Store) │  │  (Notification)  │
    └────────┬────────┘  └─────────────────┘
    ┌────────▼────────────────────────────┐
    │         Scrape Targets              │
    │  ┌─────────┐ ┌──────┐ ┌─────────┐  │
    │  │node-exp.│ │kube-  │ │app      │  │
    │  │         │ │state  │ │metrics  │  │
    │  └─────────┘ └──────┘ └─────────┘  │
    └─────────────────────────────────────┘

Part 1: Installing the Stack with Helm

kube-prometheus-stack

The community Helm chart gives you Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one deployment. This is the right starting point.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create a comprehensive values file:

# values-monitoring.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: 40GB

    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2000m

    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

    # Scrape interval and evaluation
    scrapeInterval: 30s
    evaluationInterval: 30s

    # Enable remote write for long-term storage
    remoteWrite:
      - url: "http://thanos-receive.monitoring:19291/api/v1/receive"
        writeRelabelConfigs:
          - sourceLabels: [__name__]
            regex: "go_.*"
            action: drop  # Don't send Go runtime metrics to long-term

    # Service discovery for PodMonitors and ServiceMonitors
    podMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

    # Additional scrape configs for non-k8s targets
    additionalScrapeConfigs:
      - job_name: 'external-node-exporter'
        static_configs:
          - targets:
              - 'bastion-host:9100'
              - 'build-server:9100'
            labels:
              environment: infrastructure

grafana:
  adminPassword: ""  # Use external secret
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: gp3

  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 500m

  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
      folderAnnotation: grafana_folder
      provider:
        foldersFromFilesStructure: true
    datasources:
      enabled: true

  grafana.ini:
    server:
      root_url: https://grafana.example.com
    auth.generic_oauth:
      enabled: true
      name: SSO
      allow_sign_up: true
      scopes: openid profile email
    security:
      cookie_secure: true
      strict_transport_security: true

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        memory: 128Mi
        cpu: 50m
      limits:
        memory: 256Mi
        cpu: 200m

    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

nodeExporter:
  resources:
    requests:
      memory: 64Mi
      cpu: 50m
    limits:
      memory: 128Mi
      cpu: 200m

kubeStateMetrics:
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi
      cpu: 200m

Deploy it:

kubectl create namespace monitoring

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-monitoring.yaml \
  --version 67.4.0 \
  --wait

Part 2: Instrumenting Your Applications

ServiceMonitor for Kubernetes Services

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: production
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      scrapeTimeout: 10s
      metricRelabelings:
        # Drop high-cardinality metrics you don't need
        - sourceLabels: [__name__]
          regex: "http_request_duration_seconds_bucket"
          action: keep
        - sourceLabels: [__name__]
          regex: "go_gc_.*"
          action: drop
  namespaceSelector:
    matchNames:
      - production

PodMonitor for Pods Without Services

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      monitoring: enabled
  podMetricsEndpoints:
    - port: metrics
      interval: 30s

Application Instrumentation (Go Example)

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests by method, path, and status",
        },
        []string{"method", "path", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )

    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func instrumentHandler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        activeConnections.Inc()
        defer activeConnections.Dec()

        timer := prometheus.NewTimer(
            httpRequestDuration.WithLabelValues(r.Method, r.URL.Path),
        )
        defer timer.ObserveDuration()

        rw := &responseWriter{ResponseWriter: w, statusCode: 200}
        next.ServeHTTP(rw, r)

        httpRequestsTotal.WithLabelValues(
            r.Method, r.URL.Path, http.StatusText(rw.statusCode),
        ).Inc()
    })
}

func main() {
    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.Handler())
    mux.Handle("/", instrumentHandler(http.HandlerFunc(handleRoot)))
    http.ListenAndServe(":8080", mux)
}

Part 3: Recording Rules for Performance

Recording rules pre-compute expensive queries. Without them, your dashboards are slow and Prometheus burns CPU on repeated aggregations.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: recording-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: http.rules
      interval: 30s
      rules:
        # Request rate by service
        - record: service:http_requests:rate5m
          expr: |
            sum by (service, namespace) (
              rate(http_requests_total[5m])
            )

        # Error rate by service
        - record: service:http_errors:rate5m
          expr: |
            sum by (service, namespace) (
              rate(http_requests_total{status=~"5.."}[5m])
            )

        # Error ratio (for SLO dashboards)
        - record: service:http_error_ratio:rate5m
          expr: |
            service:http_errors:rate5m
            /
            service:http_requests:rate5m

        # P50, P90, P99 latency by service
        - record: service:http_request_duration_seconds:p50
          expr: |
            histogram_quantile(0.50,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

        - record: service:http_request_duration_seconds:p90
          expr: |
            histogram_quantile(0.90,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

        - record: service:http_request_duration_seconds:p99
          expr: |
            histogram_quantile(0.99,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

    - name: kubernetes.rules
      interval: 30s
      rules:
        # CPU utilization by namespace
        - record: namespace:container_cpu_usage:sum
          expr: |
            sum by (namespace) (
              rate(container_cpu_usage_seconds_total{
                container!="",
                image!=""
              }[5m])
            )

        # Memory utilization by namespace
        - record: namespace:container_memory_working_set_bytes:sum
          expr: |
            sum by (namespace) (
              container_memory_working_set_bytes{
                container!="",
                image!=""
              }
            )

        # Pod restart rate
        - record: namespace:kube_pod_container_restarts:rate1h
          expr: |
            sum by (namespace, pod) (
              increase(kube_pod_container_status_restarts_total[1h])
            )

Part 4: Alerting Rules That Don't Page You for Nothing

This is where most monitoring setups fail. Alert on symptoms, not causes. Page on user impact, not internal metrics.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: alerting-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo.alerts
      rules:
        # High error rate (user-facing)
        - alert: HighErrorRate
          expr: |
            service:http_error_ratio:rate5m > 0.01
          for: 5m
          labels:
            severity: critical
            team: "{{ $labels.namespace }}"
          annotations:
            summary: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
            description: "Error rate exceeds 1% SLO for 5 minutes."
            runbook: "https://wiki.example.com/runbooks/high-error-rate"
            dashboard: "https://grafana.example.com/d/slo-overview"

        # High latency (user-facing)
        - alert: HighLatencyP99
          expr: |
            service:http_request_duration_seconds:p99 > 2
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.service }} p99 latency is {{ $value }}s"
            runbook: "https://wiki.example.com/runbooks/high-latency"

    - name: infrastructure.alerts
      rules:
        # Node is running out of disk
        - alert: NodeDiskPressure
          expr: |
            (
              node_filesystem_avail_bytes{mountpoint="/"}
              / node_filesystem_size_bytes{mountpoint="/"}
            ) < 0.10
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.instance }} has < 10% disk space"

        # Pod CrashLoopBackOff
        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

        # Persistent volume filling up
        - alert: PersistentVolumeFillingUp
          expr: |
            (
              kubelet_volume_stats_available_bytes
              / kubelet_volume_stats_capacity_bytes
            ) < 0.15
            and
            predict_linear(kubelet_volume_stats_available_bytes[6h], 24 * 3600) < 0
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} will fill within 24h"

    - name: prometheus.alerts
      rules:
        # Prometheus itself is having issues
        - alert: PrometheusTargetDown
          expr: up == 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

        # Too many scrape errors
        - alert: PrometheusScrapeErrors
          expr: |
            increase(prometheus_target_scrapes_exceeded_sample_limit_total[1h]) > 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Scrape target hitting sample limit"

Part 5: Alertmanager Configuration

Route alerts to the right people through the right channels:

# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alert-routing
  namespace: monitoring
spec:
  route:
    groupBy: ['alertname', 'namespace', 'service']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h
    receiver: default-slack
    routes:
      - matchers:
          - name: severity
            value: critical
        receiver: pagerduty-critical
        repeatInterval: 1h
        continue: true  # Also send to Slack
      - matchers:
          - name: severity
            value: critical
        receiver: critical-slack
      - matchers:
          - name: severity
            value: warning
        receiver: warning-slack
        repeatInterval: 12h

  receivers:
    - name: default-slack
      slackConfigs:
        - channel: '#alerts-default'
          apiURL:
            name: slack-webhook
            key: url
          title: '{{ .GroupLabels.alertname }}'
          text: >-
            {{ range .Alerts }}
            *{{ .Labels.severity | toUpper }}*: {{ .Annotations.summary }}
            {{ .Annotations.description }}
            {{ if .Annotations.runbook }}Runbook: {{ .Annotations.runbook }}{{ end }}
            {{ end }}
          sendResolved: true

    - name: critical-slack
      slackConfigs:
        - channel: '#alerts-critical'
          apiURL:
            name: slack-webhook
            key: url
          sendResolved: true

    - name: warning-slack
      slackConfigs:
        - channel: '#alerts-warning'
          apiURL:
            name: slack-webhook
            key: url
          sendResolved: true

    - name: pagerduty-critical
      pagerdutyConfigs:
        - routingKey:
            name: pagerduty-key
            key: routing-key
          severity: critical
          description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

Part 6: Grafana Dashboards as Code

Store dashboards in ConfigMaps so they're version-controlled and survive Grafana restarts:

apiVersion: v1
kind: ConfigMap
metadata:
  name: service-overview-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
  annotations:
    grafana_folder: "Service Dashboards"
data:
  service-overview.json: |
    {
      "dashboard": {
        "title": "Service Overview",
        "uid": "service-overview",
        "tags": ["services", "sre"],
        "timezone": "browser",
        "refresh": "30s",
        "panels": [
          {
            "title": "Request Rate",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
            "targets": [
              {
                "expr": "sum by (service) (service:http_requests:rate5m)",
                "legendFormat": "{{ service }}"
              }
            ]
          },
          {
            "title": "Error Rate",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
            "targets": [
              {
                "expr": "service:http_error_ratio:rate5m * 100",
                "legendFormat": "{{ service }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    { "color": "green", "value": null },
                    { "color": "yellow", "value": 0.5 },
                    { "color": "red", "value": 1 }
                  ]
                }
              }
            }
          },
          {
            "title": "P99 Latency",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
            "targets": [
              {
                "expr": "service:http_request_duration_seconds:p99",
                "legendFormat": "{{ service }}"
              }
            ],
            "fieldConfig": {
              "defaults": { "unit": "s" }
            }
          },
          {
            "title": "Active Pods",
            "type": "stat",
            "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
            "targets": [
              {
                "expr": "sum by (namespace) (kube_pod_status_phase{phase='Running'})",
                "legendFormat": "{{ namespace }}"
              }
            ]
          }
        ]
      }
    }

Part 7: Long-Term Storage with Thanos

Prometheus retention should be 15-30 days. For long-term metrics, add Thanos sidecar.

# Add to kube-prometheus-stack values
prometheus:
  prometheusSpec:
    thanos:
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore
          key: config.yaml

    # Keep 15 days locally
    retention: 15d

Thanos object storage config:

# thanos-objstore-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore
  namespace: monitoring
stringData:
  config.yaml: |
    type: S3
    config:
      bucket: monitoring-thanos-store
      endpoint: s3.us-east-1.amazonaws.com
      region: us-east-1

Deploy Thanos components:

helm install thanos bitnami/thanos \
  --namespace monitoring \
  --set query.stores=["prometheus-kube-prometheus-stack-thanos-discovery.monitoring:10901"] \
  --set compactor.enabled=true \
  --set compactor.retentionResolutionRaw=30d \
  --set compactor.retentionResolution5m=180d \
  --set compactor.retentionResolution1h=365d \
  --set storegateway.enabled=true \
  --set existingObjstoreSecret=thanos-objstore

This gives you 30 days of raw resolution, 6 months at 5-minute resolution, and a year at 1-hour resolution. Enough to spot trends, do capacity planning, and satisfy auditors.

The Monitoring Stack Checklist

ComponentPurposeWithout It
PrometheusMetrics collection and short-term storageNo metrics at all
node-exporterHost-level metrics (CPU, memory, disk, network)Blind to infrastructure issues
kube-state-metricsKubernetes object metrics (pods, deployments)Can't see K8s state
Recording rulesPre-computed aggregationsSlow dashboards, high CPU
Alerting rulesAutomated incident detectionManual monitoring only
AlertmanagerAlert routing and deduplicationAlert storms, no routing
GrafanaVisualization and explorationRaw PromQL only
Thanos/CortexLong-term storageLose metrics after retention

Part 8: Troubleshooting Common Issues

Prometheus Running Out of Memory

This is the most common operational issue. Prometheus memory usage is proportional to the number of active time series.

# Check current time series count
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'

# Find the highest cardinality metrics
curl -s http://localhost:9090/api/v1/status/tsdb | jq '
  .data.seriesCountByMetricName |
  sort_by(-.value) |
  .[0:20] |
  .[] | "\(.name): \(.value) series"'

Common culprits and fixes:

MetricTypical CauseFix
apiserver_request_duration_seconds_bucketToo many LE bucketsDrop with relabeling
container_*Monitoring paused/stopped containersFilter container!=""
http_request_duration_seconds_bucketHigh-cardinality path labelsNormalize URL paths
go_gc_*Every Go service exports theseDrop with relabeling

Drop high-cardinality metrics you don't need:

# In your ServiceMonitor or scrape config
metricRelabelings:
  # Drop Go garbage collector metrics (rarely needed)
  - sourceLabels: [__name__]
    regex: "go_(gc|memstats|threads|info)_.*"
    action: drop

  # Drop unused histogram buckets
  - sourceLabels: [__name__]
    regex: "apiserver_request_duration_seconds_bucket"
    action: drop

  # Normalize high-cardinality URL paths
  - sourceLabels: [path]
    regex: "/api/v1/users/[0-9]+"
    targetLabel: path
    replacement: "/api/v1/users/:id"

Grafana Dashboards Loading Slowly

Slow dashboards are almost always caused by unoptimized PromQL queries hitting raw metrics instead of recording rules.

Before (slow — computes on every dashboard load):

sum by (service) (rate(http_requests_total{namespace="production"}[5m]))

After (fast — uses pre-computed recording rule):

service:http_requests:rate5m{namespace="production"}

Other optimization tips:

  • Set dashboard time range to 6 hours or less by default. Longer ranges query more data.
  • Use $__rate_interval instead of hardcoded intervals like [5m].
  • Add template variables for namespace and service to filter queries instead of aggregating everything.

Alertmanager Not Sending Notifications

# Check Alertmanager status
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# View active alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[0:5]'

# Check alert routing (shows which receiver an alert would hit)
curl -s http://localhost:9093/api/v2/alerts/groups | jq '.[] | {receiver, alerts: [.alerts[].labels.alertname]}'

# Test webhook connectivity
kubectl exec -n monitoring deploy/alertmanager -- \
  wget -qO- --timeout=5 https://hooks.slack.com/services/... 2>&1

Common issues:

  1. Slack webhook URL changed — re-create the secret with the new URL.
  2. Alert is in pending state — it hasn't fired long enough to meet the for duration.
  3. Inhibition rules — a lower-severity alert may be suppressed by a higher-severity one.
  4. GroupWait too long — set groupWait: 30s for critical alerts.

Part 9: SLO-Based Monitoring

The most mature monitoring setup I deploy uses SLOs (Service Level Objectives) as the foundation for all alerting.

Defining SLOs

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo.rules
      interval: 30s
      rules:
        # Availability SLO: 99.9% of requests succeed
        - record: slo:api_availability:ratio
          expr: |
            1 - (
              sum(rate(http_requests_total{namespace="production",status=~"5.."}[5m]))
              /
              sum(rate(http_requests_total{namespace="production"}[5m]))
            )

        # Latency SLO: 99% of requests complete in < 500ms
        - record: slo:api_latency:ratio
          expr: |
            sum(rate(http_request_duration_seconds_bucket{
              namespace="production",
              le="0.5"
            }[5m]))
            /
            sum(rate(http_request_duration_seconds_count{
              namespace="production"
            }[5m]))

        # Error budget remaining (30-day window)
        - record: slo:api_availability:error_budget_remaining
          expr: |
            1 - (
              (1 - slo:api_availability:ratio)
              /
              (1 - 0.999)
            )

    - name: slo.alerts
      rules:
        # Burn rate alert: 2% of monthly error budget consumed in 1 hour
        - alert: SLOHighBurnRate
          expr: |
            slo:api_availability:error_budget_remaining < 0.98
            and
            (1 - slo:api_availability:ratio) > (14.4 * (1 - 0.999))
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "API availability SLO burn rate is critical"
            description: "At the current error rate, the monthly error budget will be exhausted in less than 2 hours."
            runbook: "https://wiki.example.com/runbooks/slo-burn-rate"

The burn rate approach avoids two problems: alerting too early on minor blips, and alerting too late on sustained degradation. A 14.4x burn rate means you'll exhaust your monthly error budget in ~2 hours if it continues — that's worth paging someone.

SLO Grafana Dashboard

Add an error budget panel to your service overview dashboard:

{
  "title": "Error Budget Remaining (30d)",
  "type": "gauge",
  "targets": [
    {
      "expr": "slo:api_availability:error_budget_remaining * 100",
      "legendFormat": "Budget Remaining"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "thresholds": {
        "steps": [
          { "color": "red", "value": 0 },
          { "color": "yellow", "value": 25 },
          { "color": "green", "value": 50 }
        ]
      }
    }
  }
}

Part 10: Scaling Prometheus

When a Single Prometheus Isn't Enough

For clusters with more than 500 nodes or 5 million active series, a single Prometheus instance runs into memory and storage limits. Options:

Functional sharding — run multiple Prometheus instances, each scraping different workloads:

# prometheus-apps.yaml - Scrapes application metrics
prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchLabels:
        monitoring-target: applications
    externalLabels:
      shard: apps

# prometheus-infra.yaml - Scrapes infrastructure metrics
prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchLabels:
        monitoring-target: infrastructure
    externalLabels:
      shard: infra

Use Thanos Query to provide a unified view across shards:

helm install thanos-query bitnami/thanos \
  --namespace monitoring \
  --set query.stores=[\
    "prometheus-apps-thanos:10901",\
    "prometheus-infra-thanos:10901"\
  ]

Point Grafana at Thanos Query instead of individual Prometheus instances. Your dashboards work exactly the same — Thanos handles the fan-out and deduplication.

What I Wish Someone Told Me

  1. Start with USE and RED methods. For infrastructure: Utilization, Saturation, Errors. For services: Rate, Errors, Duration. These cover 90% of your monitoring needs.
  2. Recording rules are not optional. A dashboard that takes 30 seconds to load won't get used during an incident.
  3. High-cardinality labels will destroy Prometheus. Never use user IDs, request IDs, or timestamps as label values. Each unique combination creates a new time series.
  4. Alert fatigue kills on-call. Every alert should have a runbook. Every page should require human action. If it can be automated, it shouldn't page you.
  5. Monitor the monitoring. If Prometheus goes down and you don't notice, you have no monitoring at all. Set up external checks on your monitoring stack.
  6. SLOs before dashboards. Define what "healthy" means for each service before building dashboards. Without SLOs, you're just looking at graphs — you're not making decisions.
  7. Label standardization matters early. Agree on label names (service vs app, environment vs env) before you have 50 ServiceMonitors. Renaming labels later is painful.

The goal isn't to collect every possible metric. The goal is to answer two questions at any time: "Is the system healthy?" and "If not, where is it broken?" Build toward that, and you'll have a monitoring stack that earns its keep.

Invest the time upfront to build this stack properly. A well-configured Prometheus with meaningful recording rules, SLO-based alerts, and Grafana dashboards that tell a story will save your team hundreds of hours in incident response over its lifetime. The alternative — waking up to an outage with no metrics, no alerts, and no dashboards — is a pain I've felt too many times to recommend to anyone. Build the stack. Trust the stack. Then make it better, one recording rule at a time.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles