Module 4 — Scaling
High Availability
Design systems that stay up even when components fail. The goal: 99.99% uptime.
1The Hospital Analogy
💡 Simple Analogy
A hospital can't close when one doctor is sick. They have backup doctors (redundancy),multiple operating rooms (no single point of failure), and emergency protocols (failover).
High availability means your system keeps running even when parts fail.
High availability means your system keeps running even when parts fail.
2Measuring Availability
| Availability | Downtime/Year | Downtime/Month | Typical Use |
|---|---|---|---|
| 99% | 3.65 days | 7.3 hours | Internal tools |
| 99.9% | 8.76 hours | 43.8 min | Most SaaS |
| 99.99% | 52.6 min | 4.38 min | E-commerce, banking |
| 99.999% | 5.26 min | 26 sec | Cloud infrastructure |
Cost of Nines
Each additional 9 is exponentially harder. Going from 99.9% to 99.99% might require multi-region deployment, which can 3x your infrastructure costs.
3HA Strategies
Redundancy
Multiple instances of everything. No single point of failure.
2+ app servers3+ database replicasRedundant load balancers
Failover
Automatic switch to backup when primary fails.
Database auto-failoverDNS failoverActive-passive clusters
Health Checks
Continuously monitor components, remove unhealthy ones.
LB health checksKubernetes liveness probesHeartbeat monitoring
Geographic Distribution
Deploy across multiple availability zones or regions.
Multi-AZ RDSCross-region replicationGlobal load balancing
4Common Failure Points
Identify and eliminate single points of failure:
SPOF: Single Load Balancer
Fix: Multiple LBs with DNS failover
SPOF: Single Database
Fix: Primary + replica with auto-failover
SPOF: Single Datacenter
Fix: Multi-AZ or multi-region deployment
SPOF: Single DNS Provider
Fix: Secondary DNS provider
SPOF: Shared Dependencies
Fix: Circuit breakers, fallbacks
SPOF: Human Error
Fix: Automation, infrastructure as code
5Active-Active vs Active-Passive
Active-Active
All instances handle traffic simultaneously.
✓ Better resource utilization
✓ No failover time
✗ More complex sync
Active-Passive
Standby waits, takes over on failure.
✓ Simpler architecture
✓ No sync issues
✗ Wasted standby resources
✗ Failover takes time
6Key Takeaways
1High Availability = system stays up when parts fail.
2Measured in nines: 99.99% = 52 min downtime/year.
3Redundancy: no single point of failure.
4Failover: automatic switch to backup.
5Health checks: detect and remove unhealthy components.
6Multi-AZ is minimum; multi-region for 99.99%+.