Adoption Path¶
NthLayer can be adopted incrementally. You don't need to enable all features on day one. This guide walks through a proven three-phase approach that lets teams build confidence before enabling enforcement.
Overview¶
| Phase | What You Do | Risk Level | Time to Value |
|---|---|---|---|
| 1. Generate | Run locally, review output | None | 1 day |
| 2. Validate | Add to CI, warnings only | Low | 1 week |
| 3. Protect | Enable gates, block deploys | Medium | 2-4 weeks |
Phase 1: Generate Only¶
Goal: See what NthLayer produces without any CI/CD integration.
Duration: 1-3 days
Steps¶
-
Install NthLayer
-
Create a service spec
-
Generate artifacts locally
-
Review the output
-
Compare to your existing setup
- Are the generated alerts better than what you have?
- Does the dashboard cover what you need?
- Are SLO targets reasonable for this service?
Success Criteria¶
- [ ] Generated artifacts look correct
- [ ] You understand what each file does
- [ ] You've identified any customizations needed
What You Learn¶
- How tier affects defaults
- What NthLayer generates vs what you need to customize
- Whether your service.yaml needs adjustments
Phase 2: Validate in CI¶
Goal: Run NthLayer in CI to catch issues early, but don't block deploys yet.
Duration: 1-2 weeks
Steps¶
- Add NthLayer to your CI pipeline
# .github/workflows/ci.yml
- name: Generate and validate reliability config
run: |
pip install nthlayer
nthlayer apply services/${{ matrix.service }}.yaml --lint
- Enable verification in warning mode
- name: Verify metrics exist (warnings only)
run: |
nthlayer verify services/${{ matrix.service }}.yaml --no-fail
env:
PROMETHEUS_URL: ${{ secrets.PROMETHEUS_URL }}
- Commit generated artifacts
What to Watch For¶
- Lint failures: Invalid PromQL in generated alerts
- Verification warnings: Missing metrics in Prometheus
- Drift: Generated files that weren't committed
Success Criteria¶
- [ ] CI runs NthLayer on every PR
- [ ] Team reviews NthLayer output in PRs
- [ ] No unexpected lint failures
- [ ] Verification warnings are understood (not necessarily fixed)
What You Learn¶
- Which services are missing required metrics
- Whether your Prometheus setup works with NthLayer
- Team comfort level with the generated artifacts
Phase 3: Protect in CD¶
Goal: Enable deployment gates that block risky deploys.
Duration: 2-4 weeks (gradual rollout)
Steps¶
- Start with non-critical services
Pick 2-3 standard or low tier services first:
- Enable check-deploy in warning mode
# CD pipeline
- name: Check deployment gate
run: |
nthlayer check-deploy services/${{ matrix.service }}.yaml || echo "Gate warning (not blocking)"
env:
PROMETHEUS_URL: ${{ secrets.PROMETHEUS_URL }}
- Monitor for false positives
Track: - How often would deploys have been blocked? - Were those blocks justified? - Any false positives?
- Graduate to blocking mode
- name: Check deployment gate
run: |
nthlayer check-deploy services/${{ matrix.service }}.yaml
# Now exit code 2 will fail the pipeline
- Expand to critical services
Only after confidence is built:
Rollout Schedule¶
| Week | Services | Mode |
|---|---|---|
| 1 | 2-3 low tier | Warning only |
| 2 | All low tier | Blocking |
| 3 | Standard tier | Warning only |
| 4 | Standard tier | Blocking |
| 5+ | Critical tier | Warning, then blocking |
Success Criteria¶
- [ ] Gates correctly block deploys with exhausted error budgets
- [ ] No false positives blocking valid deploys
- [ ] Team trusts the gate decisions
- [ ] Escalation path exists for gate overrides
What You Learn¶
- Whether your SLO targets are realistic
- How often services are actually at risk
- Team response to automated enforcement
Common Adoption Patterns¶
Pattern A: Platform Team Drives¶
- Platform team adopts NthLayer
- Creates org-wide service templates
- Onboards service teams one by one
- Mandates adoption for new services
Best for: Organizations with strong platform teams
Pattern B: Service Team Experiments¶
- One service team tries NthLayer
- Shares results with other teams
- Organic adoption spreads
- Platform team eventually standardizes
Best for: Bottom-up engineering cultures
Pattern C: Incident-Driven¶
- Major incident reveals monitoring gaps
- NthLayer adopted for affected services
- Expanded based on incident learnings
- Eventually becomes standard
Best for: Organizations learning from failures
Rollback Plan¶
If adoption isn't working:
Phase 3 → Phase 2¶
- Remove
check-deployfrom CD - Keep
verify --no-failin CI - Investigate why gates were problematic
Phase 2 → Phase 1¶
- Remove NthLayer from CI
- Continue using generated artifacts manually
- Investigate lint/verify issues
Phase 1 → Nothing¶
- Stop using NthLayer
- Keep existing monitoring setup
- Document what didn't work for future reference
Timeline Summary¶
| Milestone | Typical Duration |
|---|---|
| First service.yaml created | Day 1 |
| First generated artifacts reviewed | Day 1-3 |
| NthLayer running in CI | Week 1 |
| First service with blocking gate | Week 3-4 |
| All services with blocking gates | Month 2-3 |
| Full org standardization | Month 3-6 |
The key is incremental confidence: each phase proves value before the next adds risk.