Health Checks Deep Dive
How load balancers, orchestrators, and monitors know if your service is alive and healthy.
1The Doctor Checkup Analogy
Health checks are endpoints that report whether a service is functioning correctly. Used by load balancers, Kubernetes, and monitoring systems to make routing and restart decisions.
2Types of Health Checks
Liveness Probe
Purpose: Is the process running?
Checks: Basic HTTP 200 response. Process is alive and not deadlocked.
On failure: If fails: Restart the container/process.
GET /health/live → 200 OKReadiness Probe
Purpose: Can it handle traffic?
Checks: All dependencies connected. Ready to serve requests.
On failure: If fails: Remove from load balancer. Don't restart.
GET /health/ready → 200 OK (if DB connected)Startup Probe
Purpose: Is it still starting up?
Checks: Give slow-starting apps time to initialize.
On failure: If fails after timeout: Kill and restart.
Check every 5s for up to 5 minutes3What to Check
Liveness (Shallow)
- •Process is responding
- •Not deadlocked
- •Basic HTTP response
- •Should NOT check dependencies
Readiness (Deep)
- •Database connection pool healthy
- •Redis/cache reachable
- •Required config loaded
- •Dependent services reachable
Common Mistake
Don't put dependency checks in liveness probe! If the database is down, restarting your service won't fix it-you'll just create a restart loop.
4Implementation Example
Express.js Health Endpoints
// Liveness - always return 200 if process is running
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive' });
});
// Readiness - check dependencies
app.get('/health/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
res.status(200).json({
status: 'ready',
checks: { database: 'ok', redis: 'ok' }
});
} catch (error) {
res.status(503).json({
status: 'not ready',
error: error.message
});
}
});5Kubernetes Configuration
Pod Spec Example
containers:
- name: api
image: my-api:1.0
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3 # Restart after 3 failures
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2 # Remove from LB after 2 failuresinitialDelaySecondsWait before first check. Give app time to start.
periodSecondsHow often to check. 5-30s typical.
failureThresholdConsecutive failures before action. 2-5 typical.
6Real-World Dry Run
Scenario: Database goes down during traffic
Key insight: Pods stayed alive (no restarts). They just stopped receiving traffic until ready. This is exactly what you want.
7Best Practices
Keep Liveness Probe Simple
Just return 200. No dependency checks. Avoid restart loops.
Set Reasonable Timeouts
Don't set timeout to 1s if your check needs 2s. Leads to false failures.
Use Startup Probe for Slow Apps
Java apps take 30-60s to start. Use startup probe to avoid premature kills.
Log Health Check Results
Log failures with details. Helps debug intermittent issues.
Return Structured Response
JSON with individual check status. Makes debugging easier.
8Key Takeaways
?Quiz
1. Database is down. What should happen to your API pods?
2. Why separate liveness from readiness?