Module 0 - Core Concepts

SLAs, SLOs & SLIs

The three pillars of reliability engineering. Define what "reliable" means, measure it, and create agreements around it.

10 min readReliability

SLA & Uptime Calculator

Calculate allowed downtime and error budgets

Per Year
8.76 hours
Per Month
43.20 min
Per Week
10.08 min
Per Day
1.44 min
Error Budget (This Month)
100.0% remaining
Used: 0.00 min (0.0%)Remaining: 43.20 min
Industry SLA Reference
AWS EC299.99%
Google Cloud99.95%
Stripe API99.99%
Slack99.99%

1The Restaurant Analogy

Simple Analogy
Imagine a restaurant that promises "food delivered in 30 minutes or it's free."
SLI
The actual delivery time measured for each order (28 min, 32 min, etc.)
SLO
Internal goal: 95% of orders delivered in under 30 minutes
SLA
Public promise: If not delivered in 30 min, you get it free (the consequence)

2Understanding the Hierarchy

SLI
Service Level Indicator
What we measure

A quantitative metric that measures a specific aspect of service performance.

Example: Request latency, error rate, availability percentage
SLO
Service Level Objective
What we target

A target value or range for an SLI. The internal goal for reliability.

Example: 99.9% of requests complete in <200ms over 30 days
SLA
Service Level Agreement
What we promise (with consequences)

A contract with customers that defines service expectations and remedies if not met.

Example: 99.5% uptime or customer receives service credits
Key Insight
SLO > SLA always. Your internal target (SLO) should be stricter than your customer promise (SLA). This gives you a buffer before you owe customers money or damage trust.

3SLI: Service Level Indicators

An SLI is a carefully chosen metric that directly reflects user experience. It answers: "How is the service performing right now?"

Common SLIs

Availability
Formula: Successful requests / Total requests × 100
Example: 99.95% of requests returned 2xx/3xx
Good for: All services
Latency
Formula: Time from request to response
Example: P99 latency = 180ms
Good for: User-facing APIs
Throughput
Formula: Requests processed per second
Example: 10,000 RPS sustained
Good for: High-volume systems
Error Rate
Formula: Failed requests / Total requests × 100
Example: 0.01% 5xx error rate
Good for: All services
Durability
Formula: Data successfully stored / Data written
Example: 99.999999999% (11 nines)
Good for: Storage systems
Freshness
Formula: Time since last data update
Example: Data updated within 5 minutes
Good for: Analytics, caching

Choosing Good SLIs

✓ Good SLI Characteristics
  • • Directly reflects user experience
  • • Measurable and quantifiable
  • • Actionable (you can improve it)
  • • Understandable by stakeholders
✗ Poor SLI Choices
  • • CPU utilization (internal metric)
  • • Lines of code (vanity metric)
  • • Number of deploys (not user-facing)
  • • Uptime without context

4SLO: Service Level Objectives

An SLO is a target value for an SLI, measured over a time window. It answers: "What level of reliability are we aiming for?"

SLO Components

99.9% of requests will have latency < 200ms measured over 30 days
Target (99.9%)
The threshold you're aiming for
SLI (<200ms latency)
What you're measuring
Window (30 days)
Time period for measurement

The "Nines" of Availability

AvailabilityDowntime/YearDowntime/MonthExample
99% (two 9s)3.65 days7.3 hoursInternal tools
99.9% (three 9s)8.76 hours43.8 minutesStandard web apps
99.95%4.38 hours21.9 minutesCloud providers (typical)
99.99% (four 9s)52.6 minutes4.4 minutesFinancial systems
99.999% (five 9s)5.26 minutes26.3 secondsTelecom, medical
More Nines = Exponentially Harder
Going from 99% to 99.9% is 10x harder. Going from 99.9% to 99.99% is another 10x. Each additional nine requires significantly more investment in redundancy, testing, and operations.

5Error Budgets

An Error Budget is the allowed amount of unreliability. If your SLO is 99.9%, your error budget is 0.1% (the inverse).

Error Budget Calculation

SLO Target:
99.9%
Error Budget:
100% - 99.9% = 0.1%
In a 30-day window:
30 days × 24h × 60min × 0.1% = 43.2 minutes

Using Error Budgets

Budget Available → Ship Fast

  • • Deploy new features
  • • Run experiments
  • • Take calculated risks
  • • Move fast with innovation

Budget Exhausted → Slow Down

  • • Freeze non-critical deploys
  • • Focus on reliability fixes
  • • Investigate root causes
  • • Improve observability

6SLA: Service Level Agreements

An SLA is a formal contract between provider and customer that defines service expectations and consequences for not meeting them (usually financial).

Real-World SLA Examples

AWS EC2
99.99% uptime
Remedy: 10% credit for <99.99%, 30% for <99%, 100% for <95%
Google Cloud
99.95% uptime
Remedy: 10-50% service credits based on downtime
Stripe API
99.99% uptime
Remedy: Service credits for extended outages

SLA Best Practices

  • ✓ SLA should be less strict than SLO
  • ✓ Define clear measurement methods
  • ✓ Specify exclusions (maintenance windows, etc.)
  • ✓ Include remedy/compensation terms
  • ✓ Define escalation procedures
  • ✓ Review and update regularly

7Putting It All Together

Example: E-commerce Checkout Service

SLIs (What we measure)
  • Availability: % of successful checkout requests (2xx responses)
  • Latency: Time to complete checkout (P99)
  • Error Rate: % of failed payment processing
SLOs (What we target)
  • Availability: 99.95% over 30 days
  • Latency: P99 < 500ms
  • Error Rate: < 0.1% payment failures
SLA (What we promise)
  • Guarantee: 99.9% uptime (less strict than SLO)
  • Remedy: 10% credit if <99.9%, 25% if <99.5%
  • Exclusions: Scheduled maintenance, force majeure

8Key Takeaways

1SLI = What you measure (the metric). Focus on user-facing metrics.
2SLO = What you target (internal goal). Your reliability north star.
3SLA = What you promise (external contract with consequences).
4Always set SLO stricter than SLA to maintain a safety buffer.
5Error budgets balance reliability with innovation. Use them wisely.
6Each additional "nine" of availability is 10x harder and more expensive.
7Start with 3-5 meaningful SLIs. Too many metrics = noise.

9Interview Follow-up Questions

Interview Follow-up Questions

Common follow-up questions interviewers ask

10Test Your Understanding

Test Your Understanding

5 questions

1

Which statement correctly describes the relationship between SLI, SLO, and SLA?

2

A service has 99.9% availability SLO. How much downtime is allowed per month?

3

Your error budget is 50% consumed after 1 week into the month. What should you do?

4

Why is P99 latency often more important than average latency?

5

Your SLA promises 99.95% uptime. What should your SLO be?

0 of 5 answered