Module 2 — Traffic & Load Management

Health Checks & Failover

How load balancers detect unhealthy servers and automatically route traffic around them to maintain system availability.

8 min readInteractive Demo

1The Doctor Checkup Analogy

Simple Analogy
Think of a sports team with a doctor. Before each game, the doctor checks every player. If a player is injured (unhealthy), they sit on the bench (removed from rotation). The coach only puts healthy players in the game. If a benched player recovers, the doctor clears them to play again.

The load balancer is the coach, health checks are the doctor, and servers are the players.

Health Check is a mechanism where the load balancer periodically verifies that backend servers are alive and capable of handling requests. Unhealthy servers are automatically removed from the pool.

2Types of Health Checks

🔄Active Health Checks
  • LB sends periodic probe requests to servers
  • Common: HTTP GET /health every 10-30 seconds
  • Checks: status code, response body, response time
  • Proactive: detects issues before user traffic affected
  • Example: Ping /health, expect 200 OK
👁️Passive Health Checks
  • Monitors real user traffic for errors
  • Tracks: 5xx errors, timeouts, connection failures
  • No extra network overhead
  • Reactive: detects issues from actual failures
  • Example: 3 consecutive 500 errors → unhealthy
Best Practice

Use both active and passive checks together. Active catches issues proactively, passive catches issues active checks might miss (like slow responses to complex queries).

3Health Check Simulator

Watch health checks in action. Click "Crash Server" to see failover happen automatically.

Health Check MonitorChecking every 2s • Total: 0
AHEALTHY
Checks: 0
BHEALTHY
Checks: 0
CHEALTHY
Checks: 0
Health check system started...
3
Healthy
0
Unhealthy
0
Total Probes

4Health Check Configuration

Key parameters for configuring health checks:

Interval
10-30s
How often to check each server
Timeout
3-5s
Max wait time for health response
Unhealthy Threshold
2-3
Failures before marking unhealthy
Healthy Threshold
2-3
Successes before marking healthy again
Path
/health
Endpoint to probe (HTTP checks)
AWS ALB Health Check Config
{
  "healthCheck": {
    "path": "/health",
    "protocol": "HTTP",
    "port": 8080,
    "interval": 30,        // Check every 30 seconds
    "timeout": 5,          // Wait max 5 seconds
    "unhealthyThreshold": 2,  // 2 failures = unhealthy
    "healthyThreshold": 3,    // 3 successes = healthy
    "matcher": {
      "httpCode": "200-299"   // Accept 2xx as healthy
    }
  }
}

5Health Endpoint Design

Your /health endpoint should check critical dependencies and return appropriate status.

Good Health Endpoint (Node.js)
app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkRedis(),
    disk: checkDiskSpace(),
    memory: checkMemory()
  };
  
  const healthy = Object.values(checks).every(c => c.ok);
  
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks
  });
});
✓ Good Health Response
HTTP 200 OK
{
  "status": "healthy",
  "database": "connected",
  "cache": "connected"
}
✗ Unhealthy Response
HTTP 503 Service Unavailable
{
  "status": "unhealthy",
  "database": "timeout",
  "cache": "connected"
}

6Failover Strategies

What happens when a server becomes unhealthy?

1
Detection
Health check fails (e.g., 2 consecutive timeouts or 500 errors)
2
Marking
Server is marked as "unhealthy" in the LB's server pool
3
Draining
Existing connections may complete, but no new traffic is sent
4
Isolation
Server is removed from rotation—all traffic goes to healthy servers
5
Recovery
LB continues health checks. After N successes, server is restored
Connection Draining

Good load balancers support "connection draining"—allowing in-flight requests to complete before fully removing a server. AWS calls this "deregistration delay" (default 300s).

7Common Pitfalls

Health endpoint too simple
GET /health → always returns 200
Check database, cache, and critical dependencies
Health endpoint too slow
Health check queries database with no timeout
Health checks should complete in <1 second
Thresholds too aggressive
1 failure = immediately unhealthy
2-3 failures with reasonable intervals
No graceful shutdown
Server stops immediately on SIGTERM
Stop accepting new requests, drain existing, then exit

8Key Takeaways

1Health checks are essential for automatic failover and high availability.
2Use both active and passive checks for comprehensive monitoring.
3Configure reasonable thresholds—not too aggressive, not too lenient.
4Your /health endpoint should check real dependencies (DB, cache, etc.).
5Implement graceful shutdown to handle deploys without dropping requests.
6Connection draining allows in-flight requests to complete during failover.