Reliability Shift Left¶

What is Shift Left?¶

"Shift Left" means moving validation and testing earlier in the software development lifecycle. Instead of discovering issues in production, you catch them during development or CI/CD.

Traditional approach:

Code → Deploy → Monitor → Incident → Fix → Repeat
                           ↑
                   "We found out the hard way"

Shift Left approach:

Code → Validate → Verify → Gate → Deploy → Monitor
         ↑          ↑        ↑
   "Is it valid?" "Does it exist?" "Is it safe?"

How NthLayer Shifts Reliability Left¶

1. Contract Verification (`nthlayer verify`)¶

Before deploying, verify that the metrics your SLOs depend on actually exist in Prometheus:

$ nthlayer verify service.yaml --prometheus-url $PROM_URL

Verifying metrics for payment-api...

  ✓ http_requests_total{service="payment-api"}
  ✓ http_request_duration_seconds_bucket{service="payment-api"}
  ✗ http_requests_total{service="payment-api",status=~"5.."}  NOT FOUND

Contract verification failed: 1 metric(s) not found

Pipeline integration:

# GitHub Actions
- name: Verify SLO Metrics
  run: nthlayer verify service.yaml --prometheus-url $PROM_URL
  # Fails pipeline if metrics don't exist

2. Deployment Gates (`nthlayer check-deploy`)¶

Block deployments when error budget is exhausted:

$ nthlayer check-deploy service.yaml --prometheus-url $PROM_URL

╭──────────────────────────────────────────────────────────────╮
│  Deployment Gate Check                                       │
╰──────────────────────────────────────────────────────────────╯

  Service:       payment-api
  Tier:          critical
  Window:        30d

  SLO Results:
    availability   99.87%  (target: 99.95%)   budget: 42% remaining   ⚠ WARNING
    latency_p99    187ms   (target: 200ms)    budget: 78% remaining   ✓ OK

  Decision:  ⚠ PROCEED WITH CAUTION

Exit codes: - 0 - Deploy approved - 1 - Warning (budget low, but allowed) - 2 - Blocked (budget exhausted)

3. PromQL Validation (`nthlayer apply --lint`)¶

Catch syntax errors before they reach Prometheus:

$ nthlayer apply service.yaml --lint

Applied 4 resources in 0.3s → generated/payment-api/

Validating alerts with pint...
  ✓ 12 rules validated

  ⚠ [promql/series] Line 45: metric "http_errors_total" not found

The Google SRE Connection¶

NthLayer automates concepts from the Google SRE Book:

SRE Concept	Manual Process	NthLayer Automation
Production Readiness Review	Multi-week checklist	`nthlayer verify` in CI
Error Budget Policy	Spreadsheet tracking	`nthlayer check-deploy` gates
Release Engineering	Manual runbooks	Generated artifacts + GitOps
Monitoring Standards	Wiki pages	`service.yaml` spec

CI/CD Pipeline Example¶

Tekton Pipeline¶

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: deploy-with-reliability-gates
spec:
  tasks:
    - name: generate
      taskRef:
        name: nthlayer-apply
      params:
        - name: service-file
          value: service.yaml

    - name: verify-metrics
      taskRef:
        name: nthlayer-verify
      runAfter: [generate]

    - name: check-budget
      taskRef:
        name: nthlayer-check-deploy
      runAfter: [verify-metrics]

    - name: deploy
      taskRef:
        name: kubectl-apply
      runAfter: [check-budget]
      # Only runs if all gates pass

GitHub Actions¶

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install NthLayer
        run: pip install nthlayer

      - name: Generate & Lint
        run: nthlayer apply service.yaml --lint

      - name: Verify Metrics Exist
        run: nthlayer verify service.yaml --prometheus-url $PROM_URL
        env:
          PROM_URL: ${{ secrets.PROMETHEUS_URL }}

      - name: Check Deployment Gate
        run: nthlayer check-deploy service.yaml --prometheus-url $PROM_URL
        env:
          PROM_URL: ${{ secrets.PROMETHEUS_URL }}

      - name: Deploy
        if: success()
        run: kubectl apply -f generated/

Benefits¶

Prevent, Don't React¶

Traditional	Shift Left
Deploy first, monitor later	Validate before deploy
Alert after incident	Block risky deploys
SLOs as documentation	SLOs as enforcement
Manual PRR checklist	Automated verification

Measurable Outcomes¶

Teams using reliability shift left typically see:

60% reduction in incidents caused by missing monitoring
80% faster SLO setup (5 min vs 20 hours)
Zero deploys with missing metrics
Reduced MTTR - dashboards exist from day 1