Module 4 - Scaling
High Availability
Designing systems that stay up even when things fail.
1The Hospital Analogy
Simple Analogy
A hospital has backup generators, multiple doctors on call, and emergency protocols. If one system fails, another takes over. The goal: never stop serving patients. That's high availability.
High Availability (HA) means a system remains operational for a high percentage of time. Measured in "nines"-99.9% (3 nines) = 8.7 hours downtime/year.
2Availability Levels
| Availability | Downtime/Year | Example |
|---|---|---|
| 99% (2 nines) | 3.65 days | Internal tools |
| 99.9% (3 nines) | 8.7 hours | Most SaaS apps |
| 99.99% (4 nines) | 52 minutes | E-commerce, fintech |
| 99.999% (5 nines) | 5 minutes | Telecom, healthcare |
3Achieving HA
Redundancy
No single point of failure. Multiple instances of everything.
Load Balancing
Distribute traffic. If one fails, others handle it.
Failover
Automatic switch to backup when primary fails.
Health Checks
Detect failures quickly and route around them.
Multi-AZ / Multi-Region
Survive data center or region failures.
4Failure Modes
Hardware Failure
Disks, network, servers fail. Use redundancy.
Software Bugs
Code crashes. Use circuit breakers, rollback.
Network Partitions
Nodes can't communicate. Handle gracefully.
Dependency Failures
External service down. Use fallbacks, caching.
5Key Takeaways
1HA = staying operational despite failures
2Measured in nines: 99.9% = 8.7 hrs/year downtime
3Eliminate single points of failure with redundancy
4Use health checks + failover for quick recovery
5Multi-AZ/region protects against site failures
?Quiz
1. Your SLA requires 99.99% uptime. Max downtime per year?
2. Single database, no replica. What's the risk?