89%

Needs Review

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Riku TanakaMarch 23, 202615 min read

If It's Not Measured, It Doesn't Exist

I've been paged at every hour of the night. The difference between a 5-minute incident and a 5-hour one is almost always the same thing: observability. Teams with good monitoring detect issues before users do, diagnose root causes from dashboards instead of guesswork, and resolve incidents in minutes instead of hours.

This guide builds a complete monitoring stack from zero. Not a toy setup — a production-grade system with service discovery, recording rules, meaningful alerts, and dashboards that actually help during incidents. By the end, you'll have the same monitoring infrastructure I deploy for production Kubernetes clusters.

Architecture Overview

┌──────────────────────────────────────────────────┐
│                    Grafana                        │
│           (Dashboards, Exploration)               │
└────────────┬───────────────────┬─────────────────┘
             │                   │
    ┌────────▼────────┐  ┌──────▼──────────┐
    │   Prometheus     │  │   Alertmanager   │
    │  (Metrics Store) │  │  (Notification)  │
    └────────┬────────┘  └─────────────────┘
             │
    ┌────────▼────────────────────────────┐
    │         Scrape Targets              │
    │  ┌─────────┐ ┌──────┐ ┌─────────┐  │
    │  │node-exp.│ │kube-  │ │app      │  │
    │  │         │ │state  │ │metrics  │  │
    │  └─────────┘ └──────┘ └─────────┘  │
    └─────────────────────────────────────┘

Part 1: Installing the Stack with Helm

kube-prometheus-stack

The community Helm chart gives you Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one deployment. This is the right starting point.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create a comprehensive values file:

# values-monitoring.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: 40GB

    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2000m

    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

    # Scrape interval and evaluation
    scrapeInterval: 30s
    evaluationInterval: 30s

    # Enable remote write for long-term storage
    remoteWrite:
      - url: "http://thanos-receive.monitoring:19291/api/v1/receive"
        writeRelabelConfigs:
          - sourceLabels: [__name__]
            regex: "go_.*"
            action: drop  # Don't send Go runtime metrics to long-term

    # Service discovery for PodMonitors and ServiceMonitors
    podMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

    # Additional scrape configs for non-k8s targets
    additionalScrapeConfigs:
      - job_name: 'external-node-exporter'
        static_configs:
          - targets:
              - 'bastion-host:9100'
              - 'build-server:9100'
            labels:
              environment: infrastructure

grafana:
  adminPassword: ""  # Use external secret
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: gp3

  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 500m

  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
      folderAnnotation: grafana_folder
      provider:
        foldersFromFilesStructure: true
    datasources:
      enabled: true

  grafana.ini:
    server:
      root_url: https://grafana.example.com
    auth.generic_oauth:
      enabled: true
      name: SSO
      allow_sign_up: true
      scopes: openid profile email
    security:
      cookie_secure: true
      strict_transport_security: true

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        memory: 128Mi
        cpu: 50m
      limits:
        memory: 256Mi
        cpu: 200m

    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

nodeExporter:
  resources:
    requests:
      memory: 64Mi
      cpu: 50m
    limits:
      memory: 128Mi
      cpu: 200m

kubeStateMetrics:
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi
      cpu: 200m

Deploy it:

kubectl create namespace monitoring

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-monitoring.yaml \
  --version 67.4.0 \
  --wait

Part 2: Instrumenting Your Applications

ServiceMonitor for Kubernetes Services

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: production
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      scrapeTimeout: 10s
      metricRelabelings:
        # Drop high-cardinality metrics you don't need
        - sourceLabels: [__name__]
          regex: "http_request_duration_seconds_bucket"
          action: keep
        - sourceLabels: [__name__]
          regex: "go_gc_.*"
          action: drop
  namespaceSelector:
    matchNames:
      - production

PodMonitor for Pods Without Services

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      monitoring: enabled
  podMetricsEndpoints:
    - port: metrics
      interval: 30s

Application Instrumentation (Go Example)

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests by method, path, and status",
        },
        []string{"method", "path", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )

    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func instrumentHandler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        activeConnections.Inc()
        defer activeConnections.Dec()

        timer := prometheus.NewTimer(
            httpRequestDuration.WithLabelValues(r.Method, r.URL.Path),
        )
        defer timer.ObserveDuration()

        rw := &responseWriter{ResponseWriter: w, statusCode: 200}
        next.ServeHTTP(rw, r)

        httpRequestsTotal.WithLabelValues(
            r.Method, r.URL.Path, http.StatusText(rw.statusCode),
        ).Inc()
    })
}

func main() {
    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.Handler())
    mux.Handle("/", instrumentHandler(http.HandlerFunc(handleRoot)))
    http.ListenAndServe(":8080", mux)
}

Part 3: Recording Rules for Performance

Recording rules pre-compute expensive queries. Without them, your dashboards are slow and Prometheus burns CPU on repeated aggregations.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: recording-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: http.rules
      interval: 30s
      rules:
        # Request rate by service
        - record: service:http_requests:rate5m
          expr: |
            sum by (service, namespace) (
              rate(http_requests_total[5m])
            )

        # Error rate by service
        - record: service:http_errors:rate5m
          expr: |
            sum by (service, namespace) (
              rate(http_requests_total{status=~"5.."}[5m])
            )

        # Error ratio (for SLO dashboards)
        - record: service:http_error_ratio:rate5m
          expr: |
            service:http_errors:rate5m
            /
            service:http_requests:rate5m

        # P50, P90, P99 latency by service
        - record: service:http_request_duration_seconds:p50
          expr: |
            histogram_quantile(0.50,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

        - record: service:http_request_duration_seconds:p90
          expr: |
            histogram_quantile(0.90,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

        - record: service:http_request_duration_seconds:p99
          expr: |
            histogram_quantile(0.99,
              sum by (service, namespace, le) (
                rate(http_request_duration_seconds_bucket[5m])
              )
            )

    - name: kubernetes.rules
      interval: 30s
      rules:
        # CPU utilization by namespace
        - record: namespace:container_cpu_usage:sum
          expr: |
            sum by (namespace) (
              rate(container_cpu_usage_seconds_total{
                container!="",
                image!=""
              }[5m])
            )

        # Memory utilization by namespace
        - record: namespace:container_memory_working_set_bytes:sum
          expr: |
            sum by (namespace) (
              container_memory_working_set_bytes{
                container!="",
                image!=""
              }
            )

        # Pod restart rate
        - record: namespace:kube_pod_container_restarts:rate1h
          expr: |
            sum by (namespace, pod) (
              increase(kube_pod_container_status_restarts_total[1h])
            )

Part 4: Alerting Rules That Don't Page You for Nothing

This is where most monitoring setups fail. Alert on symptoms, not causes. Page on user impact, not internal metrics.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: alerting-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo.alerts
      rules:
        # High error rate (user-facing)
        - alert: HighErrorRate
          expr: |
            service:http_error_ratio:rate5m > 0.01
          for: 5m
          labels:
            severity: critical
            team: "{{ $labels.namespace }}"
          annotations:
            summary: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
            description: "Error rate exceeds 1% SLO for 5 minutes."
            runbook: "https://wiki.example.com/runbooks/high-error-rate"
            dashboard: "https://grafana.example.com/d/slo-overview"

        # High latency (user-facing)
        - alert: HighLatencyP99
          expr: |
            service:http_request_duration_seconds:p99 > 2
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.service }} p99 latency is {{ $value }}s"
            runbook: "https://wiki.example.com/runbooks/high-latency"

    - name: infrastructure.alerts
      rules:
        # Node is running out of disk
        - alert: NodeDiskPressure
          expr: |
            (
              node_filesystem_avail_bytes{mountpoint="/"}
              / node_filesystem_size_bytes{mountpoint="/"}
            ) < 0.10
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.instance }} has < 10% disk space"

        # Pod CrashLoopBackOff
        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

        # Persistent volume filling up
        - alert: PersistentVolumeFillingUp
          expr: |
            (
              kubelet_volume_stats_available_bytes
              / kubelet_volume_stats_capacity_bytes
            ) < 0.15
            and
            predict_linear(kubelet_volume_stats_available_bytes[6h], 24 * 3600) < 0
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} will fill within 24h"

    - name: prometheus.alerts
      rules:
        # Prometheus itself is having issues
        - alert: PrometheusTargetDown
          expr: up == 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

        # Too many scrape errors
        - alert: PrometheusScrapeErrors
          expr: |
            increase(prometheus_target_scrapes_exceeded_sample_limit_total[1h]) > 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Scrape target hitting sample limit"

Part 5: Alertmanager Configuration

Route alerts to the right people through the right channels:

# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alert-routing
  namespace: monitoring
spec:
  route:
    groupBy: ['alertname', 'namespace', 'service']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h
    receiver: default-slack
    routes:
      - matchers:
          - name: severity
            value: critical
        receiver: pagerduty-critical
        repeatInterval: 1h
        continue: true  # Also send to Slack
      - matchers:
          - name: severity
            value: critical
        receiver: critical-slack
      - matchers:
          - name: severity
            value: warning
        receiver: warning-slack
        repeatInterval: 12h

  receivers:
    - name: default-slack
      slackConfigs:
        - channel: '#alerts-default'
          apiURL:
            name: slack-webhook
            key: url
          title: '{{ .GroupLabels.alertname }}'
          text: >-
            {{ range .Alerts }}
            *{{ .Labels.severity | toUpper }}*: {{ .Annotations.summary }}
            {{ .Annotations.description }}
            {{ if .Annotations.runbook }}Runbook: {{ .Annotations.runbook }}{{ end }}
            {{ end }}
          sendResolved: true

    - name: critical-slack
      slackConfigs:
        - channel: '#alerts-critical'
          apiURL:
            name: slack-webhook
            key: url
          sendResolved: true

    - name: warning-slack
      slackConfigs:
        - channel: '#alerts-warning'
          apiURL:
            name: slack-webhook
            key: url
          sendResolved: true

    - name: pagerduty-critical
      pagerdutyConfigs:
        - routingKey:
            name: pagerduty-key
            key: routing-key
          severity: critical
          description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

Part 6: Grafana Dashboards as Code

Store dashboards in ConfigMaps so they're version-controlled and survive Grafana restarts:

apiVersion: v1
kind: ConfigMap
metadata:
  name: service-overview-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
  annotations:
    grafana_folder: "Service Dashboards"
data:
  service-overview.json: |
    {
      "dashboard": {
        "title": "Service Overview",
        "uid": "service-overview",
        "tags": ["services", "sre"],
        "timezone": "browser",
        "refresh": "30s",
        "panels": [
          {
            "title": "Request Rate",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
            "targets": [
              {
                "expr": "sum by (service) (service:http_requests:rate5m)",
                "legendFormat": "{{ service }}"
              }
            ]
          },
          {
            "title": "Error Rate",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
            "targets": [
              {
                "expr": "service:http_error_ratio:rate5m * 100",
                "legendFormat": "{{ service }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    { "color": "green", "value": null },
                    { "color": "yellow", "value": 0.5 },
                    { "color": "red", "value": 1 }
                  ]
                }
              }
            }
          },
          {
            "title": "P99 Latency",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
            "targets": [
              {
                "expr": "service:http_request_duration_seconds:p99",
                "legendFormat": "{{ service }}"
              }
            ],
            "fieldConfig": {
              "defaults": { "unit": "s" }
            }
          },
          {
            "title": "Active Pods",
            "type": "stat",
            "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
            "targets": [
              {
                "expr": "sum by (namespace) (kube_pod_status_phase{phase='Running'})",
                "legendFormat": "{{ namespace }}"
              }
            ]
          }
        ]
      }
    }

Part 7: Long-Term Storage with Thanos

Prometheus retention should be 15-30 days. For long-term metrics, add Thanos sidecar.

# Add to kube-prometheus-stack values
prometheus:
  prometheusSpec:
    thanos:
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore
          key: config.yaml

    # Keep 15 days locally
    retention: 15d

Thanos object storage config:

# thanos-objstore-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore
  namespace: monitoring
stringData:
  config.yaml: |
    type: S3
    config:
      bucket: monitoring-thanos-store
      endpoint: s3.us-east-1.amazonaws.com
      region: us-east-1

Deploy Thanos components:

helm install thanos bitnami/thanos \
  --namespace monitoring \
  --set query.stores=["prometheus-kube-prometheus-stack-thanos-discovery.monitoring:10901"] \
  --set compactor.enabled=true \
  --set compactor.retentionResolutionRaw=30d \
  --set compactor.retentionResolution5m=180d \
  --set compactor.retentionResolution1h=365d \
  --set storegateway.enabled=true \
  --set existingObjstoreSecret=thanos-objstore

This gives you 30 days of raw resolution, 6 months at 5-minute resolution, and a year at 1-hour resolution. Enough to spot trends, do capacity planning, and satisfy auditors.

The Monitoring Stack Checklist

Component	Purpose	Without It
Prometheus	Metrics collection and short-term storage	No metrics at all
node-exporter	Host-level metrics (CPU, memory, disk, network)	Blind to infrastructure issues
kube-state-metrics	Kubernetes object metrics (pods, deployments)	Can't see K8s state
Recording rules	Pre-computed aggregations	Slow dashboards, high CPU
Alerting rules	Automated incident detection	Manual monitoring only
Alertmanager	Alert routing and deduplication	Alert storms, no routing
Grafana	Visualization and exploration	Raw PromQL only
Thanos/Cortex	Long-term storage	Lose metrics after retention

Part 8: Troubleshooting Common Issues

Prometheus Running Out of Memory

This is the most common operational issue. Prometheus memory usage is proportional to the number of active time series.

# Check current time series count
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'

# Find the highest cardinality metrics
curl -s http://localhost:9090/api/v1/status/tsdb | jq '
  .data.seriesCountByMetricName |
  sort_by(-.value) |
  .[0:20] |
  .[] | "\(.name): \(.value) series"'

Common culprits and fixes:

Metric	Typical Cause	Fix
`apiserver_request_duration_seconds_bucket`	Too many LE buckets	Drop with relabeling
`container_*`	Monitoring paused/stopped containers	Filter `container!=""`
`http_request_duration_seconds_bucket`	High-cardinality path labels	Normalize URL paths
`go_gc_*`	Every Go service exports these	Drop with relabeling

Drop high-cardinality metrics you don't need:

# In your ServiceMonitor or scrape config
metricRelabelings:
  # Drop Go garbage collector metrics (rarely needed)
  - sourceLabels: [__name__]
    regex: "go_(gc|memstats|threads|info)_.*"
    action: drop

  # Drop unused histogram buckets
  - sourceLabels: [__name__]
    regex: "apiserver_request_duration_seconds_bucket"
    action: drop

  # Normalize high-cardinality URL paths
  - sourceLabels: [path]
    regex: "/api/v1/users/[0-9]+"
    targetLabel: path
    replacement: "/api/v1/users/:id"

Grafana Dashboards Loading Slowly

Slow dashboards are almost always caused by unoptimized PromQL queries hitting raw metrics instead of recording rules.

Before (slow — computes on every dashboard load):

sum by (service) (rate(http_requests_total{namespace="production"}[5m]))

After (fast — uses pre-computed recording rule):

service:http_requests:rate5m{namespace="production"}

Other optimization tips:

Set dashboard time range to 6 hours or less by default. Longer ranges query more data.
Use $__rate_interval instead of hardcoded intervals like [5m].
Add template variables for namespace and service to filter queries instead of aggregating everything.

Alertmanager Not Sending Notifications

# Check Alertmanager status
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# View active alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[0:5]'

# Check alert routing (shows which receiver an alert would hit)
curl -s http://localhost:9093/api/v2/alerts/groups | jq '.[] | {receiver, alerts: [.alerts[].labels.alertname]}'

# Test webhook connectivity
kubectl exec -n monitoring deploy/alertmanager -- \
  wget -qO- --timeout=5 https://hooks.slack.com/services/... 2>&1

Common issues:

Slack webhook URL changed — re-create the secret with the new URL.
Alert is in pending state — it hasn't fired long enough to meet the for duration.
Inhibition rules — a lower-severity alert may be suppressed by a higher-severity one.
GroupWait too long — set groupWait: 30s for critical alerts.

Part 9: SLO-Based Monitoring

The most mature monitoring setup I deploy uses SLOs (Service Level Objectives) as the foundation for all alerting.

Defining SLOs

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo.rules
      interval: 30s
      rules:
        # Availability SLO: 99.9% of requests succeed
        - record: slo:api_availability:ratio
          expr: |
            1 - (
              sum(rate(http_requests_total{namespace="production",status=~"5.."}[5m]))
              /
              sum(rate(http_requests_total{namespace="production"}[5m]))
            )

        # Latency SLO: 99% of requests complete in < 500ms
        - record: slo:api_latency:ratio
          expr: |
            sum(rate(http_request_duration_seconds_bucket{
              namespace="production",
              le="0.5"
            }[5m]))
            /
            sum(rate(http_request_duration_seconds_count{
              namespace="production"
            }[5m]))

        # Error budget remaining (30-day window)
        - record: slo:api_availability:error_budget_remaining
          expr: |
            1 - (
              (1 - slo:api_availability:ratio)
              /
              (1 - 0.999)
            )

    - name: slo.alerts
      rules:
        # Burn rate alert: 2% of monthly error budget consumed in 1 hour
        - alert: SLOHighBurnRate
          expr: |
            slo:api_availability:error_budget_remaining < 0.98
            and
            (1 - slo:api_availability:ratio) > (14.4 * (1 - 0.999))
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "API availability SLO burn rate is critical"
            description: "At the current error rate, the monthly error budget will be exhausted in less than 2 hours."
            runbook: "https://wiki.example.com/runbooks/slo-burn-rate"

The burn rate approach avoids two problems: alerting too early on minor blips, and alerting too late on sustained degradation. A 14.4x burn rate means you'll exhaust your monthly error budget in ~2 hours if it continues — that's worth paging someone.

SLO Grafana Dashboard

Add an error budget panel to your service overview dashboard:

{
  "title": "Error Budget Remaining (30d)",
  "type": "gauge",
  "targets": [
    {
      "expr": "slo:api_availability:error_budget_remaining * 100",
      "legendFormat": "Budget Remaining"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "thresholds": {
        "steps": [
          { "color": "red", "value": 0 },
          { "color": "yellow", "value": 25 },
          { "color": "green", "value": 50 }
        ]
      }
    }
  }
}

Part 10: Scaling Prometheus

When a Single Prometheus Isn't Enough

For clusters with more than 500 nodes or 5 million active series, a single Prometheus instance runs into memory and storage limits. Options:

Functional sharding — run multiple Prometheus instances, each scraping different workloads:

# prometheus-apps.yaml - Scrapes application metrics
prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchLabels:
        monitoring-target: applications
    externalLabels:
      shard: apps

# prometheus-infra.yaml - Scrapes infrastructure metrics
prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchLabels:
        monitoring-target: infrastructure
    externalLabels:
      shard: infra

Use Thanos Query to provide a unified view across shards:

helm install thanos-query bitnami/thanos \
  --namespace monitoring \
  --set query.stores=[\
    "prometheus-apps-thanos:10901",\
    "prometheus-infra-thanos:10901"\
  ]

Point Grafana at Thanos Query instead of individual Prometheus instances. Your dashboards work exactly the same — Thanos handles the fan-out and deduplication.

What I Wish Someone Told Me

Start with USE and RED methods. For infrastructure: Utilization, Saturation, Errors. For services: Rate, Errors, Duration. These cover 90% of your monitoring needs.
Recording rules are not optional. A dashboard that takes 30 seconds to load won't get used during an incident.
High-cardinality labels will destroy Prometheus. Never use user IDs, request IDs, or timestamps as label values. Each unique combination creates a new time series.
Alert fatigue kills on-call. Every alert should have a runbook. Every page should require human action. If it can be automated, it shouldn't page you.
Monitor the monitoring. If Prometheus goes down and you don't notice, you have no monitoring at all. Set up external checks on your monitoring stack.
SLOs before dashboards. Define what "healthy" means for each service before building dashboards. Without SLOs, you're just looking at graphs — you're not making decisions.
Label standardization matters early. Agree on label names (service vs app, environment vs env) before you have 50 ServiceMonitors. Renaming labels later is painful.

The goal isn't to collect every possible metric. The goal is to answer two questions at any time: "Is the system healthy?" and "If not, where is it broken?" Build toward that, and you'll have a monitoring stack that earns its keep.

Invest the time upfront to build this stack properly. A well-configured Prometheus with meaningful recording rules, SLO-based alerts, and Grafana dashboards that tell a story will save your team hundreds of hours in incident response over its lifetime. The alternative — waking up to an outage with no metrics, no alerts, and no dashboards — is a pain I've felt too many times to recommend to anyone. Build the stack. Trust the stack. Then make it better, one recording rule at a time.

Advertise here

prometheus grafana monitoring observability sre alerting kubernetes metrics

Was this article helpful?

Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Twitter/X LinkedIn

What our experts think

Riku TanakaSRE & Observability EngineerAgrees

Prometheus and Grafana are still the gold standard for infrastructure monitoring. The key is starting with the right metrics from day one instead of trying to retroactively instrument everything.

Dev PatelCloud Cost Optimization SpecialistChallenges

Watch your cardinality. High-cardinality labels in Prometheus can cause memory usage to explode, and Grafana Cloud charges by active series. Set cardinality limits early.

Aareez AsifSenior Kubernetes ArchitectAdds Context

If you're running this stack in Kubernetes, look into the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, and sensible default dashboards in one deploy.

Observability & SREChapter 1 of 6

First chapter

Next Chapter

Alerting Rules That Work

MonitoringTutorialIntermediateNeeds Review

Prometheus Alerting Rules That Don't Wake You Up for Nothing

Design Prometheus alerting rules that catch real incidents and ignore noise — practical patterns from years of on-call experience.

Riku Tanaka·Mar 20, 2026

9 min read

MonitoringTutorialIntermediateNeeds Review

Designing Grafana Dashboards That SREs Actually Use

Build Grafana dashboards that surface real signals instead of decorating walls — a structured approach rooted in SRE principles.

Riku Tanaka·Mar 20, 2026

9 min read

MonitoringTutorialIntermediateNeeds Review

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Use Prometheus recording rules to pre-compute expensive queries, speed up dashboards, and make SLO calculations reliable at scale.

Riku Tanaka·Mar 22, 2026

10 min read

MonitoringTutorialIntermediateNeeds Review

Implementing SLOs and Error Budgets From Scratch

A step-by-step guide to implementing SLOs and error budgets using Prometheus — from defining SLIs to building burn-rate alerts.

Riku Tanaka·Mar 20, 2026

9 min read

MonitoringQuick RefBeginnerNeeds Review

PromQL: Cheat Sheet

PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.

Riku Tanaka·Mar 23, 2026

2 min read

MonitoringTutorialIntermediateNeeds Review

Scalable Log Aggregation with Grafana Loki and Promtail

Deploy Grafana Loki and Promtail for cost-effective, scalable log aggregation — without indexing yourself into bankruptcy.

Riku Tanaka·Mar 22, 2026

9 min read

More in Monitoring

View all →

MonitoringTutorialBeginner

Distributed Tracing With Jaeger: Pinpointing Latency Bottlenecks In Microservices

Microservices give you deployment flexibility and team autonomy, but they'll absolutely destroy your ability to debug latency issues if you don't have the...

Asif Muzammil·May 1, 2026

8 min read

MonitoringTutorialBeginner

Prometheus Scrape Target Down: Diagnosing And Fixing "connection Refused" Errors Step By Step

If you've spent any time with Prometheus, you've seen it. That red `DOWN` label in the Targets page, accompanied by the dreaded `connection refused` error....

Muhammad Hassan·Apr 28, 2026

8 min read

MonitoringTutorialBeginnerNeeds Review

DNS Troubleshooting for DevOps: dig, nslookup, and Common Failures

A practical DNS troubleshooting guide for DevOps engineers — dig commands, nslookup patterns, common production failure modes, and how to diagnose each one.

Muhammad Hassan·Mar 29, 2026

7 min read

MonitoringTutorialBeginnerNeeds Review

Elasticsearch Cluster Sizing for Production: Nodes, Shards, and Memory

A practical guide to sizing Elasticsearch clusters for production — covering node roles, shard counts, heap tuning, and capacity planning formulas.

Majid Iqbal Nayyar·Mar 29, 2026

7 min read

Discussion

View all

On this page

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

If It's Not Measured, It Doesn't Exist

Architecture Overview

Part 1: Installing the Stack with Helm

kube-prometheus-stack

Part 2: Instrumenting Your Applications

ServiceMonitor for Kubernetes Services

PodMonitor for Pods Without Services

Application Instrumentation (Go Example)

Part 3: Recording Rules for Performance

Part 4: Alerting Rules That Don't Page You for Nothing

Part 5: Alertmanager Configuration

Part 6: Grafana Dashboards as Code

Part 7: Long-Term Storage with Thanos

The Monitoring Stack Checklist

Part 8: Troubleshooting Common Issues

Prometheus Running Out of Memory

Grafana Dashboards Loading Slowly

Alertmanager Not Sending Notifications

Part 9: SLO-Based Monitoring

Defining SLOs

SLO Grafana Dashboard

Part 10: Scaling Prometheus

When a Single Prometheus Isn't Enough

What I Wish Someone Told Me

What our experts think

Related Articles

Prometheus Alerting Rules That Don't Wake You Up for Nothing

Designing Grafana Dashboards That SREs Actually Use

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Implementing SLOs and Error Budgets From Scratch

PromQL: Cheat Sheet

Scalable Log Aggregation with Grafana Loki and Promtail

More in Monitoring

Distributed Tracing With Jaeger: Pinpointing Latency Bottlenecks In Microservices

Prometheus Scrape Target Down: Diagnosing And Fixing "connection Refused" Errors Step By Step

DNS Troubleshooting for DevOps: dig, nslookup, and Common Failures

Elasticsearch Cluster Sizing for Production: Nodes, Shards, and Memory

Discussion