Module 2 - Traffic & Load Management

Health Checks & Failover

How load balancers detect unhealthy servers and automatically route traffic around them to maintain system availability.

8 min readInteractive Demo

1The Doctor Checkup Analogy

Simple Analogy

Think of a sports team with a doctor. Before each game, the doctor checks every player. If a player is injured (unhealthy), they sit on the bench (removed from rotation). The coach only puts healthy players in the game. If a benched player recovers, the doctor clears them to play again.

The load balancer is the coach, health checks are the doctor, and servers are the players.

Health Check is a mechanism where the load balancer periodically verifies that backend servers are alive and capable of handling requests. Unhealthy servers are automatically removed from the pool.

2Types of Health Checks

Active Health Checks

LB sends periodic probe requests to servers
Common: HTTP GET /health every 10-30 seconds
Checks: status code, response body, response time
Proactive: detects issues before user traffic affected
Example: Ping /health, expect 200 OK

Passive Health Checks

Monitors real user traffic for errors
Tracks: 5xx errors, timeouts, connection failures
No extra network overhead
Reactive: detects issues from actual failures
Example: 3 consecutive 500 errors → unhealthy

Best Practice

Use both active and passive checks together. Active catches issues proactively, passive catches issues active checks might miss (like slow responses to complex queries).

3Health Check Simulator

Watch health checks in action. Click "Crash Server" to see failover happen automatically.

Health Check MonitorChecking every 2s • Total: 0

AHEALTHY

Checks: 0

BHEALTHY

Checks: 0

CHEALTHY

Checks: 0

Health check system started...

Healthy

Unhealthy

Total Probes

4Health Check Configuration

Key parameters for configuring health checks:

Interval

10-30s

How often to check each server

Timeout

3-5s

Max wait time for health response

Unhealthy Threshold

2-3

Failures before marking unhealthy

Healthy Threshold

2-3

Successes before marking healthy again

Path

/health

Endpoint to probe (HTTP checks)

AWS ALB Health Check Config

{
  "healthCheck": {
    "path": "/health",
    "protocol": "HTTP",
    "port": 8080,
    "interval": 30,        // Check every 30 seconds
    "timeout": 5,          // Wait max 5 seconds
    "unhealthyThreshold": 2,  // 2 failures = unhealthy
    "healthyThreshold": 3,    // 3 successes = healthy
    "matcher": {
      "httpCode": "200-299"   // Accept 2xx as healthy
    }
  }
}

5Health Endpoint Design

Your /health endpoint should check critical dependencies and return appropriate status.

Good Health Endpoint (Node.js)

app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkRedis(),
    disk: checkDiskSpace(),
    memory: checkMemory()
  };
  
  const healthy = Object.values(checks).every(c => c.ok);
  
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks
  });
});

✓ Good Health Response

HTTP 200 OK
{
  "status": "healthy",
  "database": "connected",
  "cache": "connected"
}

✗ Unhealthy Response

HTTP 503 Service Unavailable
{
  "status": "unhealthy",
  "database": "timeout",
  "cache": "connected"
}

6Failover Strategies

What happens when a server becomes unhealthy?

Detection

Health check fails (e.g., 2 consecutive timeouts or 500 errors)

Marking

Server is marked as "unhealthy" in the LB's server pool

Draining

Existing connections may complete, but no new traffic is sent

Isolation

Server is removed from rotation-all traffic goes to healthy servers

Recovery

LB continues health checks. After N successes, server is restored

Connection Draining

Good load balancers support "connection draining"-allowing in-flight requests to complete before fully removing a server. AWS calls this "deregistration delay" (default 300s).

7Common Pitfalls

Health endpoint too simple

✗GET /health → always returns 200

✓Check database, cache, and critical dependencies

Health endpoint too slow

✗Health check queries database with no timeout

✓Health checks should complete in <1 second

Thresholds too aggressive

✗1 failure = immediately unhealthy

✓2-3 failures with reasonable intervals

No graceful shutdown

✗Server stops immediately on SIGTERM

✓Stop accepting new requests, drain existing, then exit

8Key Takeaways

1Health checks are essential for automatic failover and high availability.

2Use both active and passive checks for comprehensive monitoring.

3Configure reasonable thresholds-not too aggressive, not too lenient.

4Your /health endpoint should check real dependencies (DB, cache, etc.).

5Implement graceful shutdown to handle deploys without dropping requests.

6Connection draining allows in-flight requests to complete during failover.