PromQL: Cheat Sheet
Selectors & Matchers
http_requests_total{method="GET"} # Exact match
http_requests_total{handler=~"/api/.*"} # Regex match
http_requests_total{status!="200"} # Negative match
http_requests_total{method!~"OPTIONS|HEAD"} # Negative regex
Rates & Counters
rate(http_requests_total[5m]) # Per-second rate over 5m (use for counters)
irate(http_requests_total[5m]) # Instant rate (last two points)
increase(http_requests_total[1h]) # Absolute increase over 1h
delta(temperature_celsius[5m]) # Change in gauge over 5m
Aggregations
sum(rate(http_requests_total[5m])) # Total rate
sum by (status) (rate(http_requests_total[5m])) # Group by status
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
topk(5, container_memory_usage_bytes) # Top 5 by memory
count(up == 1) # Count targets up
| Operator | Description |
|---|---|
sum | Total across series |
avg | Arithmetic mean |
min / max | Minimum / maximum value |
count | Number of series |
topk / bottomk | Top/bottom K series |
quantile | Quantile across series |
Histograms
# 99th percentile request duration
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# 95th percentile grouped by handler
histogram_quantile(0.95,
sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# Average duration from histogram
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
Binary Operations
# Error ratio (5xx / total)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Disk usage percentage
1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)
# Filter series above threshold
rate(http_requests_total[5m]) > 100
Common Alert Expressions
# Error rate above 5%
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
# Instance down
up == 0
# Disk fills in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
# Memory above 90%
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
# Pod restarting frequently
increase(kube_pod_container_status_restarts_total[1h]) > 5
Useful Functions
| Function | Purpose |
|---|---|
rate(m[5m]) | Per-second rate of counter |
increase(m[1h]) | Total increase of counter |
histogram_quantile(0.99, ...) | Percentile from histogram |
predict_linear(m[1h], 3600) | Linear extrapolation |
absent(up{job="api"}) | Returns 1 if no series exist |
label_replace(...) | Rewrite labels |
avg_over_time(m[1h]) | Average of gauge over time |
max_over_time(m[6h]) | Max of gauge over time |
Subqueries
# Max of 5m rate, over 1h at 1m resolution
max_over_time(rate(http_requests_total[5m])[1h:1m])
# Average uptime over 24 hours
avg_over_time(up[24h])
Related Articles
Riku Tanaka
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
MonitoringDeep DiveFresh
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
15 min read
MonitoringTutorialFresh
Prometheus Recording Rules: Fix Your Query Performance Before It Breaks Grafana
Use Prometheus recording rules to pre-compute expensive queries, speed up dashboards, and make SLO calculations reliable at scale.
10 min read
MonitoringTutorialFresh
Prometheus Alerting Rules That Don't Wake You Up for Nothing
Design Prometheus alerting rules that catch real incidents and ignore noise — practical patterns from years of on-call experience.
9 min read