Module 4 - Scaling

High Availability

Designing systems that stay up even when things fail.

1The Hospital Analogy

Simple Analogy

A hospital has backup generators, multiple doctors on call, and emergency protocols. If one system fails, another takes over. The goal: never stop serving patients. That's high availability.

High Availability (HA) means a system remains operational for a high percentage of time. Measured in "nines"-99.9% (3 nines) = 8.7 hours downtime/year.

2Availability Levels

Availability	Downtime/Year	Example
99% (2 nines)	3.65 days	Internal tools
99.9% (3 nines)	8.7 hours	Most SaaS apps
99.99% (4 nines)	52 minutes	E-commerce, fintech
99.999% (5 nines)	5 minutes	Telecom, healthcare

3Achieving HA

Redundancy

No single point of failure. Multiple instances of everything.

Load Balancing

Distribute traffic. If one fails, others handle it.

Failover

Automatic switch to backup when primary fails.

Health Checks

Detect failures quickly and route around them.

Multi-AZ / Multi-Region

Survive data center or region failures.

4Failure Modes

Hardware Failure

Disks, network, servers fail. Use redundancy.

Software Bugs

Code crashes. Use circuit breakers, rollback.

Network Partitions

Nodes can't communicate. Handle gracefully.

Dependency Failures

External service down. Use fallbacks, caching.

5Key Takeaways

1HA = staying operational despite failures

2Measured in nines: 99.9% = 8.7 hrs/year downtime

3Eliminate single points of failure with redundancy

4Use health checks + failover for quick recovery

5Multi-AZ/region protects against site failures

?Quiz

1. Your SLA requires 99.99% uptime. Max downtime per year?

2. Single database, no replica. What's the risk?