Monitoring

Metrics, alerting, dashboards, and keeping your systems healthy.

MonitoringTutorialBeginnerFresh

Distributed Tracing With Jaeger: Pinpointing Latency Bottlenecks In Microservices

Microservices give you deployment flexibility and team autonomy, but they'll absolutely destroy your ability to debug latency issues if you don't have the...

Asif Muzammil·May 1, 2026

8 min read

MonitoringTutorialBeginnerCurrent

Prometheus Scrape Target Down: Diagnosing And Fixing "connection Refused" Errors Step By Step

If you've spent any time with Prometheus, you've seen it. That red `DOWN` label in the Targets page, accompanied by the dreaded `connection refused` error....

Muhammad Hassan·Apr 28, 2026

8 min read

MonitoringTutorialBeginner

DNS Troubleshooting for DevOps: dig, nslookup, and Common Failures

A practical DNS troubleshooting guide for DevOps engineers — dig commands, nslookup patterns, common production failure modes, and how to diagnose each one.

Muhammad Hassan·Mar 29, 2026

7 min read

MonitoringTutorialBeginner

Elasticsearch Cluster Sizing for Production: Nodes, Shards, and Memory

A practical guide to sizing Elasticsearch clusters for production — covering node roles, shard counts, heap tuning, and capacity planning formulas.

Majid Iqbal Nayyar·Mar 29, 2026

7 min read

MonitoringDeep DiveIntermediate

Building a Complete Prometheus + Grafana Monitoring Stack from Scratch

Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.

Riku Tanaka·Mar 23, 2026

15 min read

MonitoringQuick RefBeginner

PromQL: Cheat Sheet

PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.

Riku Tanaka·Mar 23, 2026

2 min read

MonitoringTutorialIntermediate

Scalable Log Aggregation with Grafana Loki and Promtail

Deploy Grafana Loki and Promtail for cost-effective, scalable log aggregation — without indexing yourself into bankruptcy.

Riku Tanaka·Mar 22, 2026

9 min read

MonitoringTutorialBeginner

OpenTelemetry Collector: Deploying Your Observability Pipeline the Right Way

Deploy and configure the OpenTelemetry Collector to unify traces, metrics, and logs into a single pipeline — with production-tested patterns.

Riku Tanaka·Mar 22, 2026

8 min read

MonitoringTutorialIntermediate

Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana

Use Prometheus recording rules to pre-compute expensive queries, speed up dashboards, and make SLO calculations reliable at scale.

Riku Tanaka·Mar 22, 2026

10 min read

MonitoringTutorialIntermediate

Prometheus Alerting Rules That Don't Wake You Up for Nothing

Design Prometheus alerting rules that catch real incidents and ignore noise — practical patterns from years of on-call experience.

Riku Tanaka·Mar 20, 2026

9 min read

MonitoringTutorialIntermediate

Designing Grafana Dashboards That SREs Actually Use

Build Grafana dashboards that surface real signals instead of decorating walls — a structured approach rooted in SRE principles.

Riku Tanaka·Mar 20, 2026

9 min read

MonitoringTutorialIntermediate

Implementing SLOs and Error Budgets From Scratch

A step-by-step guide to implementing SLOs and error budgets using Prometheus — from defining SLIs to building burn-rate alerts.

Riku Tanaka·Mar 20, 2026

9 min read