SLAs, SLOs & SLIs
The three pillars of reliability engineering. Define what "reliable" means, measure it, and create agreements around it.
SLA & Uptime Calculator
Calculate allowed downtime and error budgets
1The Restaurant Analogy
2Understanding the Hierarchy
A quantitative metric that measures a specific aspect of service performance.
A target value or range for an SLI. The internal goal for reliability.
A contract with customers that defines service expectations and remedies if not met.
3SLI: Service Level Indicators
Common SLIs
Choosing Good SLIs
- • Directly reflects user experience
- • Measurable and quantifiable
- • Actionable (you can improve it)
- • Understandable by stakeholders
- • CPU utilization (internal metric)
- • Lines of code (vanity metric)
- • Number of deploys (not user-facing)
- • Uptime without context
4SLO: Service Level Objectives
SLO Components
The "Nines" of Availability
| Availability | Downtime/Year | Downtime/Month | Example |
|---|---|---|---|
| 99% (two 9s) | 3.65 days | 7.3 hours | Internal tools |
| 99.9% (three 9s) | 8.76 hours | 43.8 minutes | Standard web apps |
| 99.95% | 4.38 hours | 21.9 minutes | Cloud providers (typical) |
| 99.99% (four 9s) | 52.6 minutes | 4.4 minutes | Financial systems |
| 99.999% (five 9s) | 5.26 minutes | 26.3 seconds | Telecom, medical |
5Error Budgets
Error Budget Calculation
Using Error Budgets
Budget Available → Ship Fast
- • Deploy new features
- • Run experiments
- • Take calculated risks
- • Move fast with innovation
Budget Exhausted → Slow Down
- • Freeze non-critical deploys
- • Focus on reliability fixes
- • Investigate root causes
- • Improve observability
6SLA: Service Level Agreements
Real-World SLA Examples
SLA Best Practices
- ✓ SLA should be less strict than SLO
- ✓ Define clear measurement methods
- ✓ Specify exclusions (maintenance windows, etc.)
- ✓ Include remedy/compensation terms
- ✓ Define escalation procedures
- ✓ Review and update regularly
7Putting It All Together
Example: E-commerce Checkout Service
- • Availability: % of successful checkout requests (2xx responses)
- • Latency: Time to complete checkout (P99)
- • Error Rate: % of failed payment processing
- • Availability: 99.95% over 30 days
- • Latency: P99 < 500ms
- • Error Rate: < 0.1% payment failures
- • Guarantee: 99.9% uptime (less strict than SLO)
- • Remedy: 10% credit if <99.9%, 25% if <99.5%
- • Exclusions: Scheduled maintenance, force majeure
8Key Takeaways
9Interview Follow-up Questions
Interview Follow-up Questions
Common follow-up questions interviewers ask
10Test Your Understanding
Test Your Understanding
5 questions
Which statement correctly describes the relationship between SLI, SLO, and SLA?
A service has 99.9% availability SLO. How much downtime is allowed per month?
Your error budget is 50% consumed after 1 week into the month. What should you do?
Why is P99 latency often more important than average latency?
Your SLA promises 99.95% uptime. What should your SLO be?