DevOpsil
Monitoring
90%
Fresh

Implementing SLOs and Error Budgets From Scratch

Riku TanakaRiku Tanaka9 min read

Reliability Without a Number Is Just a Feeling

Every team says they care about reliability. But ask them how reliable their service is right now, and you get vague answers. "It's been pretty stable." "We had an issue last week but it's fine now." These are feelings, not measurements.

SLOs — Service Level Objectives — turn reliability into a number. And error budgets turn that number into a decision-making framework. The Google SRE book calls this the foundation of the SRE practice, and after implementing SLOs across dozens of services, I agree completely. Once you have SLOs, every operational decision becomes clearer.

Let me walk through the entire implementation, from choosing your SLIs to building multi-window burn-rate alerts.

Step 1: Define Your SLIs

A Service Level Indicator (SLI) is a measurement of the service behavior that users care about. Not CPU. Not memory. User-visible behavior.

For most request-serving systems, two SLIs cover the majority of the user experience:

Availability SLI

The proportion of requests that succeed.

# Availability SLI: successful requests / total requests
sum(rate(http_requests_total{service="$service", status_class!="5xx"}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))

A deliberate choice here: 4xx responses count as successful. A 404 is the system working correctly — the user asked for something that doesn't exist. A 429 (rate limit) is the system protecting itself. Only 5xx responses represent the system failing to do its job.

Latency SLI

The proportion of requests served faster than a threshold.

# Latency SLI: requests under 300ms / total requests
sum(rate(http_request_duration_seconds_bucket{service="$service", le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="$service"}[5m]))

The threshold (300ms in this example) should come from user expectations, not system capability. If your users expect sub-second responses, set the threshold there. Don't set it at the P99 of your current performance — that's circular.

Step 2: Set Your SLO Targets

An SLO is a target value for your SLI over a time window. "99.9% of requests succeed over a rolling 30-day window."

Here's how to think about target selection:

SLO TargetMonthly Error BudgetDowntime Equivalent
99%1%~7.3 hours
99.5%0.5%~3.6 hours
99.9%0.1%~43 minutes
99.95%0.05%~21 minutes
99.99%0.01%~4.3 minutes

Start lower than you think you need. A 99.9% SLO is ambitious for most services. If you're not sure, measure your current performance for two weeks and set the SLO slightly above your worst day. You can always tighten it later.

The Google SRE book has a critical insight here: your SLO should be set at the point where users start to complain. Not where your system fails — where users notice. If your users are happy at 99.5%, setting a 99.99% SLO creates unnecessary pressure and constrains engineering velocity for no user benefit.

Step 3: Calculate Your Error Budget

The error budget is simply the inverse of the SLO.

Error Budget = 1 - SLO Target

For a 99.9% SLO over 30 days:

Error Budget = 1 - 0.999 = 0.001 (0.1%)

This means you can afford 0.1% of your requests to fail over the 30-day window. Here's how to track remaining budget in Prometheus:

# Error budget remaining (as a fraction of total budget)
# 1.0 = full budget remaining, 0.0 = budget exhausted
1 - (
  (
    sum(increase(http_requests_total{service="$service", status_class="5xx"}[30d]))
    /
    sum(increase(http_requests_total{service="$service"}[30d]))
  ) / 0.001
)

Display this as a Grafana single-stat panel with thresholds:

thresholds:
  mode: percentage
  steps:
    - color: green
      value: null
    - color: yellow
      value: 50     # Half budget consumed — worth watching
    - color: red
      value: 75     # Three-quarters consumed — slow down

Step 4: Recording Rules for Performance

Calculating SLIs from raw metrics on every dashboard load is expensive. Use Prometheus recording rules to pre-compute them.

groups:
  - name: slo-recording-rules
    interval: 30s
    rules:
      # Availability SLI over various windows
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status_class!="5xx"}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      - record: sli:availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status_class!="5xx"}[1h])) by (service)
          /
          sum(rate(http_requests_total[1h])) by (service)

      - record: sli:availability:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{status_class!="5xx"}[6h])) by (service)
          /
          sum(rate(http_requests_total[6h])) by (service)

      - record: sli:availability:ratio_rate30d
        expr: |
          sum(rate(http_requests_total{status_class!="5xx"}[30d])) by (service)
          /
          sum(rate(http_requests_total[30d])) by (service)

      # Latency SLI over various windows
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)

      - record: sli:latency:ratio_rate1h
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1h])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[1h])) by (service)

      # Error budget remaining
      - record: slo:error_budget:remaining
        expr: |
          1 - (
            (1 - sli:availability:ratio_rate30d)
            / 0.001
          )

Step 5: Multi-Window, Multi-Burn-Rate Alerts

This is where SLOs become operationally powerful. Instead of alerting on arbitrary thresholds, you alert based on how fast you're burning your error budget.

The burn rate is the ratio of actual error rate to the error rate that would exactly exhaust your budget over the full window.

# Burn rate = actual error rate / tolerable error rate
# For a 99.9% SLO, tolerable error rate = 0.001
(1 - sli:availability:ratio_rate1h) / 0.001

A burn rate of 1 means you'll exactly exhaust your budget by the end of the 30-day window. A burn rate of 14.4 means you'll exhaust it in roughly 2 days. The Google SRE Workbook recommends these alert windows:

groups:
  - name: slo-burn-rate-alerts
    rules:
      # Fast burn: pages the on-call
      # Burns through 2% of 30-day budget in 1 hour
      - alert: SLOFastBurn
        expr: |
          (1 - sli:availability:ratio_rate1h{service="api-gateway"}) / 0.001 > 14.4
            and
          (1 - sli:availability:ratio_rate5m{service="api-gateway"}) / 0.001 > 14.4
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Fast error budget burn on {{ $labels.service }}"
          description: >
            Burn rate is {{ $value | printf "%.1f" }}x.
            At this rate, the 30-day error budget will be exhausted in
            {{ printf "%.1f" (divide 1 (multiply $value (divide 1 720))) }} hours.
          runbook_url: "https://wiki.internal/runbooks/slo-fast-burn"

      # Slow burn: creates a ticket
      # Burns through 5% of 30-day budget in 6 hours
      - alert: SLOSlowBurn
        expr: |
          (1 - sli:availability:ratio_rate6h{service="api-gateway"}) / 0.001 > 6
            and
          (1 - sli:availability:ratio_rate30m{service="api-gateway"}) / 0.001 > 6
        for: 5m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Slow error budget burn on {{ $labels.service }}"
          description: >
            Burn rate is {{ $value | printf "%.1f" }}x over 6 hours.
            Investigate during business hours.
          runbook_url: "https://wiki.internal/runbooks/slo-slow-burn"

The and clause is the key to this approach. It requires both a long window (to avoid noise from transient spikes) and a short window (to confirm the problem is ongoing right now). This dramatically reduces false positives.

Step 6: Error Budget Policies

An error budget without a policy is just a dashboard. The policy defines what happens when budget is consumed. Here's a template:

# Error Budget Policy: api-gateway
# SLO: 99.9% availability over 30 days

budget_remaining_above_50_percent:
  - Feature development proceeds normally
  - Standard deployment cadence

budget_remaining_25_to_50_percent:
  - Deployments require SRE approval
  - Focus shifts to reliability-related work
  - Review recent changes for regressions

budget_remaining_below_25_percent:
  - Feature freeze until budget recovers
  - All engineering effort directed at reliability
  - Postmortem required for any further budget consumption

budget_exhausted:
  - Change freeze (emergency fixes only)
  - Escalation to engineering leadership
  - Joint SRE and development review of system architecture

This is where the error budget becomes a shared language between SRE and product teams. When the product manager asks "can we ship this risky feature?" you don't need to argue. You check the budget. If there's room, ship it. If not, fix reliability first.

Step 7: The SLO Dashboard

Bring it all together in a Grafana dashboard that the team checks daily.

Row 1 — Current State:

# Error budget remaining (single stat, percentage)
slo:error_budget:remaining{service="$service"} * 100

# Current availability (single stat)
sli:availability:ratio_rate30d{service="$service"} * 100

# Current burn rate (single stat)
(1 - sli:availability:ratio_rate1h{service="$service"}) / 0.001

Row 2 — Trends:

# Availability SLI over time (time series)
sli:availability:ratio_rate5m{service="$service"}

# Error budget consumption over time (time series)
1 - slo:error_budget:remaining{service="$service"}

Row 3 — Budget Attribution:

# Error rate by endpoint — which paths consume the most budget
sum(rate(http_requests_total{service="$service", status_class="5xx"}[1h])) by (endpoint)
/
sum(rate(http_requests_total{service="$service"}[1h])) by (endpoint)

This last panel is critical. When budget is being consumed, you need to know which endpoints are responsible. Not all errors are equal — a failing health check endpoint and a failing checkout endpoint have very different business impacts.

Common Mistakes When Starting Out

Setting SLOs too tight. A 99.99% SLO for an internal tool that has three users is over-engineering. Match the SLO to actual user expectations.

Measuring the wrong thing. If your SLI doesn't correlate with user happiness, your SLO is meaningless. Validate by comparing SLI dips with user-reported issues.

No error budget policy. Without a policy, the error budget is an interesting metric with no teeth. Get stakeholder agreement on the policy before you launch.

Ignoring dependencies. Your service can't be more reliable than its least reliable critical dependency. If your database is 99.9% available, setting a 99.99% SLO for the service that depends on it is self-defeating.

Not iterating. Your first SLOs will be wrong. That's fine. Review them quarterly. Tighten if users expect more, loosen if you're over-investing in reliability at the cost of velocity.

The Long Game

SLOs are not a project you finish. They're an ongoing practice. Start with one service and one SLI. Get the recording rules and alerts working. Build the dashboard. Establish the error budget policy. Then expand to the next service.

Within six months, you'll have a reliability framework that replaces arguments with data, turns vague concerns into measurable thresholds, and gives every team a shared language for talking about the trade-off between velocity and stability.

That's what SRE is about. Not perfection. Not zero downtime. Measured, intentional reliability that serves the users and the business equally.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles