Skip to content

SLOs & Error Budgets

Service Level Objectives (SLOs) are the foundation of reliability engineering.

What is an SLO?

An SLO is a target for service reliability, expressed as:

"99.9% of requests should succeed over a 30-day window"

Components:

  • Objective: The target (99.9%)
  • Indicator: What we measure (successful requests)
  • Window: Time period (30 days)

Error Budgets

The error budget is the inverse of your SLO:

Error Budget = 100% - SLO Objective

For a 99.9% SLO over 30 days:

Error Budget = 0.1% × 30 days × 24 hours × 60 minutes
             = 43.2 minutes of allowed downtime

How NthLayer Uses SLOs

1. Automatic Generation

Define your tier, get sensible defaults:

name: payment-api
tier: critical  # Gets 99.95% availability target
type: api

2. Dashboard Visualization

Generated dashboards show:

  • Current SLO compliance
  • Error budget remaining
  • Burn rate trends

3. Portfolio View

See all services at once:

nthlayer portfolio

4. Live Queries

Query Prometheus for real-time status:

nthlayer slo collect payment-api.yaml

SLO Types

Availability SLO

Percentage of successful requests:

resources:
  - kind: SLO
    name: availability
    spec:
      objective: 99.95
      window: 30d
      indicator:
        type: availability
        query: |
          sum(rate(http_requests_total{service="$service",status!~"5.."}[5m])) /
          sum(rate(http_requests_total{service="$service"}[5m]))

Latency SLO

Percentage of requests under a threshold:

resources:
  - kind: SLO
    name: latency-p99
    spec:
      objective: 99.0
      window: 30d
      threshold_ms: 200
      indicator:
        type: latency
        percentile: 99
        query: |
          histogram_quantile(0.99,
            sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))
          )

Throughput SLO

Minimum requests/operations per second:

resources:
  - kind: SLO
    name: throughput
    spec:
      objective: 99.0
      window: 30d
      threshold_rps: 100
      indicator:
        type: throughput
        query: sum(rate(http_requests_total{service="$service"}[5m]))

Tier-Based Defaults

Tier Availability Latency (p99) Error Budget
Critical 99.95% 200ms 21.6 min/month
Standard 99.9% 500ms 43.2 min/month
Low 99.5% 1000ms 216 min/month

Budget Consumption

NthLayer tracks how much budget has been consumed:

Status Budget Consumed Action
Healthy < 80% Continue normal development
Warning 80-100% Slow down, focus on stability
Critical 100-150% Freeze changes, investigate
Exhausted > 150% Incident mode, all hands

Best Practices

1. Start Conservative

Begin with achievable targets and tighten over time:

# Start here
tier: standard  # 99.9%

# After proving stability
tier: critical  # 99.95%

2. Match Business Impact

Tier should reflect business criticality:

  • Critical: Payment processing, authentication
  • Standard: Main application features
  • Low: Internal tools, analytics

3. Use Multiple SLOs

Cover different failure modes:

resources:
  - kind: SLO
    name: availability
    spec:
      objective: 99.95

  - kind: SLO
    name: latency-p99
    spec:
      objective: 99.0
      threshold_ms: 200

  - kind: SLO
    name: latency-p50
    spec:
      objective: 99.9
      threshold_ms: 50

4. Review Regularly

Use the portfolio view for weekly reviews:

nthlayer portfolio --format json > slo-review-$(date +%Y%m%d).json

Further Reading