drift¶

Analyze reliability drift for a service by detecting degradation trends over time.

Synopsis¶

nthlayer drift <service-file> [options]

Description¶

The drift command queries Prometheus for historical SLO metrics, analyzes trends using linear regression, and detects patterns that indicate reliability degradation.

This enables proactive reliability management - identifying issues before they become incidents by detecting gradual budget erosion, sudden drops, or volatile patterns.

Exit Codes¶

Code	Severity	Meaning
0	None/Info	No significant drift or improving trend
1	Warning	Drift detected, investigate soon
2	Critical	Severe drift, immediate action needed

Options¶

Option	Description
`--prometheus-url URL`	Prometheus server URL (or use `NTHLAYER_PROMETHEUS_URL` env var)
`--environment ENV`	Environment name (dev, staging, prod)
`--window WINDOW`	Analysis window (e.g., `30d`, `14d`). Uses tier default if not specified
`--slo SLO`	SLO to analyze (default: `availability`)
`--format FORMAT`	Output format: `table` or `json`
`--demo`	Show demo output with sample data

Examples¶

Basic Analysis¶

nthlayer drift services/payment-api.yaml \
  --prometheus-url http://prometheus:9090

Output:

╭──────────────────────────────────────────────────────────────╮
│  Drift Analysis: payment-api                                 │
╰──────────────────────────────────────────────────────────────╯

SLO: availability
Window: 30d
Data Range: 2024-12-09 → 2025-01-08

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric               ┃         Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Current Budget       │        72.34% │
│ Trend                │   -0.523%/week│
│ Pattern              │ Gradual Decline│
│ Fit Quality (R²)     │         0.847 │
│ Data Points          │           720 │
└──────────────────────┴───────────────┘

Projection:
  Days to Exhaustion   138 days
  Budget in 30d        70.25%
  Budget in 60d        68.17%
  Confidence           85%

⚠ Severity: WARN
Error budget declining at 0.52% per week with high confidence (R²=0.85).

Recommendation: Investigate recent changes. Common causes: increased traffic,
dependency degradation, or configuration drift. Run `nthlayer verify` to check
metric coverage.

JSON Output for CI/CD¶

nthlayer drift services/api.yaml \
  --prometheus-url http://prometheus:9090 \
  --format json

Custom Analysis Window¶

# Analyze last 14 days instead of tier default
nthlayer drift services/api.yaml \
  --prometheus-url http://prometheus:9090 \
  --window 14d

Analyze Specific SLO¶

nthlayer drift services/api.yaml \
  --prometheus-url http://prometheus:9090 \
  --slo latency_p99

Drift Patterns¶

The analyzer classifies drift into six patterns:

Pattern	Description	Typical Cause
Stable	No significant trend	Healthy service
Gradual Decline	Slow, steady budget erosion	Creeping technical debt
Gradual Improvement	Slow, steady recovery	Recent fixes taking effect
Step Change Down	Sudden budget drop	Incident, bad deploy
Step Change Up	Sudden improvement	Fix deployed, incident resolved
Volatile	High variance, no clear trend	Intermittent issues

Tier-Based Defaults¶

Default analysis windows and thresholds vary by service tier:

Tier	Default Window	Warn Threshold	Critical Threshold
Critical	30d	-0.2%/week	-0.5%/week
Standard	30d	-0.5%/week	-1.0%/week
Low	14d	Disabled by default	Disabled by default

Integration with check-deploy¶

Combine drift analysis with deployment gates:

nthlayer check-deploy services/api.yaml \
  --prometheus-url http://prometheus:9090 \
  --include-drift

This adds drift analysis to the deployment gate check, blocking deploys if critical drift is detected.

Integration with portfolio¶

View drift across all services:

nthlayer portfolio --drift --prometheus-url http://prometheus:9090

Output includes a drift analysis table showing trends, patterns, and exhaustion projections for each service.

CI/CD Integration¶

GitHub Actions¶

jobs:
  check-reliability:
    steps:
      - name: Check for Drift
        run: |
          nthlayer drift services/api.yaml \
            --prometheus-url ${{ secrets.PROMETHEUS_URL }} \
            --format json > drift-report.json

          # Fail on critical drift
          if [ $? -eq 2 ]; then
            echo "::error::Critical drift detected - investigate before deploying"
            exit 1
          fi

Weekly Drift Reports¶

# .github/workflows/drift-report.yml
name: Weekly Drift Report

on:
  schedule:
    - cron: '0 9 * * 1'  # Monday 9am

jobs:
  report:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Generate Drift Report
        run: |
          nthlayer portfolio --drift \
            --prometheus-url ${{ secrets.PROMETHEUS_URL }} \
            --format markdown > drift-report.md

      - name: Post to Slack
        uses: slackapi/slack-github-action@v1
        with:
          payload-file-path: drift-report.md

Environment Variables¶

Variable	Description
`NTHLAYER_PROMETHEUS_URL`	Prometheus server URL
`NTHLAYER_METRICS_USER`	Basic auth username
`NTHLAYER_METRICS_PASSWORD`	Basic auth password

Algorithm Details¶

Trend Calculation¶

Drift uses linear regression (least squares) on time-series data:

Query Prometheus for slo:error_budget_remaining:ratio over the analysis window
Calculate slope (budget change per week) and R² (fit quality)
Project days until budget exhaustion based on current trend

Severity Classification¶

Priority order for severity assignment:

Projected exhaustion within critical window → CRITICAL
Step change down detected → CRITICAL
Slope exceeds critical threshold → CRITICAL
Projected exhaustion within warn window → WARN
Slope exceeds warn threshold → WARN
Any negative slope → INFO
Otherwise → NONE