Module 4 - Scaling

High Availability

Designing systems that stay up even when things fail.

1The Hospital Analogy

Simple Analogy
A hospital has backup generators, multiple doctors on call, and emergency protocols. If one system fails, another takes over. The goal: never stop serving patients. That's high availability.

High Availability (HA) means a system remains operational for a high percentage of time. Measured in "nines"-99.9% (3 nines) = 8.7 hours downtime/year.

2Availability Levels

AvailabilityDowntime/YearExample
99% (2 nines)3.65 daysInternal tools
99.9% (3 nines)8.7 hoursMost SaaS apps
99.99% (4 nines)52 minutesE-commerce, fintech
99.999% (5 nines)5 minutesTelecom, healthcare

3Achieving HA

Redundancy

No single point of failure. Multiple instances of everything.

Load Balancing

Distribute traffic. If one fails, others handle it.

Failover

Automatic switch to backup when primary fails.

Health Checks

Detect failures quickly and route around them.

Multi-AZ / Multi-Region

Survive data center or region failures.

4Failure Modes

Hardware Failure

Disks, network, servers fail. Use redundancy.

Software Bugs

Code crashes. Use circuit breakers, rollback.

Network Partitions

Nodes can't communicate. Handle gracefully.

Dependency Failures

External service down. Use fallbacks, caching.

5Key Takeaways

1HA = staying operational despite failures
2Measured in nines: 99.9% = 8.7 hrs/year downtime
3Eliminate single points of failure with redundancy
4Use health checks + failover for quick recovery
5Multi-AZ/region protects against site failures

?Quiz

1. Your SLA requires 99.99% uptime. Max downtime per year?

2. Single database, no replica. What's the risk?