Health Checks & Failover
How load balancers detect unhealthy servers and automatically route traffic around them to maintain system availability.
1The Doctor Checkup Analogy
The load balancer is the coach, health checks are the doctor, and servers are the players.
Health Check is a mechanism where the load balancer periodically verifies that backend servers are alive and capable of handling requests. Unhealthy servers are automatically removed from the pool.
2Types of Health Checks
- LB sends periodic probe requests to servers
- Common: HTTP GET /health every 10-30 seconds
- Checks: status code, response body, response time
- Proactive: detects issues before user traffic affected
- Example: Ping /health, expect 200 OK
- Monitors real user traffic for errors
- Tracks: 5xx errors, timeouts, connection failures
- No extra network overhead
- Reactive: detects issues from actual failures
- Example: 3 consecutive 500 errors → unhealthy
Use both active and passive checks together. Active catches issues proactively, passive catches issues active checks might miss (like slow responses to complex queries).
3Health Check Simulator
Watch health checks in action. Click "Crash Server" to see failover happen automatically.
4Health Check Configuration
Key parameters for configuring health checks:
{
"healthCheck": {
"path": "/health",
"protocol": "HTTP",
"port": 8080,
"interval": 30, // Check every 30 seconds
"timeout": 5, // Wait max 5 seconds
"unhealthyThreshold": 2, // 2 failures = unhealthy
"healthyThreshold": 3, // 3 successes = healthy
"matcher": {
"httpCode": "200-299" // Accept 2xx as healthy
}
}
}5Health Endpoint Design
Your /health endpoint should check critical dependencies and return appropriate status.
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
cache: await checkRedis(),
disk: checkDiskSpace(),
memory: checkMemory()
};
const healthy = Object.values(checks).every(c => c.ok);
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks
});
});HTTP 200 OK
{
"status": "healthy",
"database": "connected",
"cache": "connected"
}HTTP 503 Service Unavailable
{
"status": "unhealthy",
"database": "timeout",
"cache": "connected"
}6Failover Strategies
What happens when a server becomes unhealthy?
Good load balancers support "connection draining"—allowing in-flight requests to complete before fully removing a server. AWS calls this "deregistration delay" (default 300s).