Module 4 — Scaling

High Availability

Design systems that stay up even when components fail. The goal: 99.99% uptime.

1The Hospital Analogy

💡 Simple Analogy
A hospital can't close when one doctor is sick. They have backup doctors (redundancy),multiple operating rooms (no single point of failure), and emergency protocols (failover).

High availability means your system keeps running even when parts fail.

2Measuring Availability

AvailabilityDowntime/YearDowntime/MonthTypical Use
99%3.65 days7.3 hoursInternal tools
99.9%8.76 hours43.8 minMost SaaS
99.99%52.6 min4.38 minE-commerce, banking
99.999%5.26 min26 secCloud infrastructure
Cost of Nines

Each additional 9 is exponentially harder. Going from 99.9% to 99.99% might require multi-region deployment, which can 3x your infrastructure costs.

3HA Strategies

Redundancy

Multiple instances of everything. No single point of failure.

2+ app servers3+ database replicasRedundant load balancers

Failover

Automatic switch to backup when primary fails.

Database auto-failoverDNS failoverActive-passive clusters

Health Checks

Continuously monitor components, remove unhealthy ones.

LB health checksKubernetes liveness probesHeartbeat monitoring

Geographic Distribution

Deploy across multiple availability zones or regions.

Multi-AZ RDSCross-region replicationGlobal load balancing

4Common Failure Points

Identify and eliminate single points of failure:

SPOF: Single Load Balancer
Fix: Multiple LBs with DNS failover
SPOF: Single Database
Fix: Primary + replica with auto-failover
SPOF: Single Datacenter
Fix: Multi-AZ or multi-region deployment
SPOF: Single DNS Provider
Fix: Secondary DNS provider
SPOF: Shared Dependencies
Fix: Circuit breakers, fallbacks
SPOF: Human Error
Fix: Automation, infrastructure as code

5Active-Active vs Active-Passive

Active-Active

All instances handle traffic simultaneously.

✓ Better resource utilization
✓ No failover time
✗ More complex sync

Active-Passive

Standby waits, takes over on failure.

✓ Simpler architecture
✓ No sync issues
✗ Wasted standby resources
✗ Failover takes time

6Key Takeaways

1High Availability = system stays up when parts fail.
2Measured in nines: 99.99% = 52 min downtime/year.
3Redundancy: no single point of failure.
4Failover: automatic switch to backup.
5Health checks: detect and remove unhealthy components.
6Multi-AZ is minimum; multi-region for 99.99%+.