Module 0 - Core Concepts

SLAs, SLOs & SLIs

The three pillars of reliability engineering. Define what "reliable" means, measure it, and create agreements around it.

10 min readReliability

SLA & Uptime Calculator

Calculate allowed downtime and error budgets

Target SLA (%)

Downtime Used This Month (min)

Per Year

8.76 hours

Per Month

43.20 min

Per Week

10.08 min

Per Day

1.44 min

Error Budget (This Month)

100.0% remaining

Used: 0.00 min (0.0%)Remaining: 43.20 min

Industry SLA Reference

AWS EC299.99%

Google Cloud99.95%

Stripe API99.99%

Slack99.99%

1The Restaurant Analogy

Simple Analogy

Imagine a restaurant that promises "food delivered in 30 minutes or it's free."

SLI

The actual delivery time measured for each order (28 min, 32 min, etc.)

SLO

Internal goal: 95% of orders delivered in under 30 minutes

SLA

Public promise: If not delivered in 30 min, you get it free (the consequence)

2Understanding the Hierarchy

SLI

Service Level Indicator

What we measure

A quantitative metric that measures a specific aspect of service performance.

Example: Request latency, error rate, availability percentage

SLO

Service Level Objective

What we target

A target value or range for an SLI. The internal goal for reliability.

Example: 99.9% of requests complete in <200ms over 30 days

SLA

Service Level Agreement

What we promise (with consequences)

A contract with customers that defines service expectations and remedies if not met.

Example: 99.5% uptime or customer receives service credits

Key Insight

SLO > SLA always. Your internal target (SLO) should be stricter than your customer promise (SLA). This gives you a buffer before you owe customers money or damage trust.

3SLI: Service Level Indicators

An SLI is a carefully chosen metric that directly reflects user experience. It answers: "How is the service performing right now?"

Common SLIs

Availability

Formula: Successful requests / Total requests × 100

Example: 99.95% of requests returned 2xx/3xx

Good for: All services

Latency

Formula: Time from request to response

Example: P99 latency = 180ms

Good for: User-facing APIs

Throughput

Formula: Requests processed per second

Example: 10,000 RPS sustained

Good for: High-volume systems

Error Rate

Formula: Failed requests / Total requests × 100

Example: 0.01% 5xx error rate

Good for: All services

Durability

Formula: Data successfully stored / Data written

Example: 99.999999999% (11 nines)

Good for: Storage systems

Freshness

Formula: Time since last data update

Example: Data updated within 5 minutes

Good for: Analytics, caching

Choosing Good SLIs

✓ Good SLI Characteristics

• Directly reflects user experience
• Measurable and quantifiable
• Actionable (you can improve it)
• Understandable by stakeholders

✗ Poor SLI Choices

• CPU utilization (internal metric)
• Lines of code (vanity metric)
• Number of deploys (not user-facing)
• Uptime without context

4SLO: Service Level Objectives

An SLO is a target value for an SLI, measured over a time window. It answers: "What level of reliability are we aiming for?"

SLO Components

99.9% of requests will have latency < 200ms measured over 30 days

Target (99.9%)

The threshold you're aiming for

SLI (<200ms latency)

What you're measuring

Window (30 days)

Time period for measurement

The "Nines" of Availability

Availability	Downtime/Year	Downtime/Month	Example
99% (two 9s)	3.65 days	7.3 hours	Internal tools
99.9% (three 9s)	8.76 hours	43.8 minutes	Standard web apps
99.95%	4.38 hours	21.9 minutes	Cloud providers (typical)
99.99% (four 9s)	52.6 minutes	4.4 minutes	Financial systems
99.999% (five 9s)	5.26 minutes	26.3 seconds	Telecom, medical

More Nines = Exponentially Harder

Going from 99% to 99.9% is 10x harder. Going from 99.9% to 99.99% is another 10x. Each additional nine requires significantly more investment in redundancy, testing, and operations.

5Error Budgets

An Error Budget is the allowed amount of unreliability. If your SLO is 99.9%, your error budget is 0.1% (the inverse).

Error Budget Calculation

SLO Target:

99.9%

Error Budget:

100% - 99.9% = 0.1%

In a 30-day window:

30 days × 24h × 60min × 0.1% = 43.2 minutes

Using Error Budgets

Budget Available → Ship Fast

• Deploy new features
• Run experiments
• Take calculated risks
• Move fast with innovation

Budget Exhausted → Slow Down

• Freeze non-critical deploys
• Focus on reliability fixes
• Investigate root causes
• Improve observability

6SLA: Service Level Agreements

An SLA is a formal contract between provider and customer that defines service expectations and consequences for not meeting them (usually financial).

Real-World SLA Examples

AWS EC2

99.99% uptime

Remedy: 10% credit for <99.99%, 30% for <99%, 100% for <95%

Google Cloud

99.95% uptime

Remedy: 10-50% service credits based on downtime

Stripe API

99.99% uptime

Remedy: Service credits for extended outages

SLA Best Practices

✓ SLA should be less strict than SLO
✓ Define clear measurement methods
✓ Specify exclusions (maintenance windows, etc.)

✓ Include remedy/compensation terms
✓ Define escalation procedures
✓ Review and update regularly

7Putting It All Together

Example: E-commerce Checkout Service

SLIs (What we measure)

• Availability: % of successful checkout requests (2xx responses)
• Latency: Time to complete checkout (P99)
• Error Rate: % of failed payment processing

SLOs (What we target)

• Availability: 99.95% over 30 days
• Latency: P99 < 500ms
• Error Rate: < 0.1% payment failures

SLA (What we promise)

• Guarantee: 99.9% uptime (less strict than SLO)
• Remedy: 10% credit if <99.9%, 25% if <99.5%
• Exclusions: Scheduled maintenance, force majeure

8Key Takeaways

1SLI = What you measure (the metric). Focus on user-facing metrics.

2SLO = What you target (internal goal). Your reliability north star.

3SLA = What you promise (external contract with consequences).

4Always set SLO stricter than SLA to maintain a safety buffer.

5Error budgets balance reliability with innovation. Use them wisely.

6Each additional "nine" of availability is 10x harder and more expensive.

7Start with 3-5 meaningful SLIs. Too many metrics = noise.

9Interview Follow-up Questions

Interview Follow-up Questions

Common follow-up questions interviewers ask

10Test Your Understanding

Test Your Understanding

5 questions

Which statement correctly describes the relationship between SLI, SLO, and SLA?

A service has 99.9% availability SLO. How much downtime is allowed per month?

Your error budget is 50% consumed after 1 week into the month. What should you do?

Why is P99 latency often more important than average latency?

Your SLA promises 99.95% uptime. What should your SLO be?

0 of 5 answered