Module 7 - Observability

Health Checks Deep Dive

How load balancers, orchestrators, and monitors know if your service is alive and healthy.

1The Doctor Checkup Analogy

Simple Analogy
A routine checkup has different levels: Are you breathing? (liveness). Can you walk and talk? (readiness). How's your blood pressure and cholesterol? (deep health). Health checks work the same-different checks for different purposes.

Health checks are endpoints that report whether a service is functioning correctly. Used by load balancers, Kubernetes, and monitoring systems to make routing and restart decisions.

2Types of Health Checks

Liveness Probe

Purpose: Is the process running?

Checks: Basic HTTP 200 response. Process is alive and not deadlocked.

On failure: If fails: Restart the container/process.

GET /health/live → 200 OK

Readiness Probe

Purpose: Can it handle traffic?

Checks: All dependencies connected. Ready to serve requests.

On failure: If fails: Remove from load balancer. Don't restart.

GET /health/ready → 200 OK (if DB connected)

Startup Probe

Purpose: Is it still starting up?

Checks: Give slow-starting apps time to initialize.

On failure: If fails after timeout: Kill and restart.

Check every 5s for up to 5 minutes

3What to Check

Liveness (Shallow)
  • Process is responding
  • Not deadlocked
  • Basic HTTP response
  • Should NOT check dependencies
Readiness (Deep)
  • Database connection pool healthy
  • Redis/cache reachable
  • Required config loaded
  • Dependent services reachable
Common Mistake

Don't put dependency checks in liveness probe! If the database is down, restarting your service won't fix it-you'll just create a restart loop.

4Implementation Example

Express.js Health Endpoints
// Liveness - always return 200 if process is running
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness - check dependencies
app.get('/health/ready', async (req, res) => {
  try {
    // Check database
    await db.query('SELECT 1');
    
    // Check Redis
    await redis.ping();
    
    res.status(200).json({ 
      status: 'ready',
      checks: { database: 'ok', redis: 'ok' }
    });
  } catch (error) {
    res.status(503).json({ 
      status: 'not ready',
      error: error.message 
    });
  }
});

5Kubernetes Configuration

Pod Spec Example
containers:
  - name: api
    image: my-api:1.0
    livenessProbe:
      httpGet:
        path: /health/live
        port: 3000
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3    # Restart after 3 failures
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2    # Remove from LB after 2 failures
initialDelaySeconds

Wait before first check. Give app time to start.

periodSeconds

How often to check. 5-30s typical.

failureThreshold

Consecutive failures before action. 2-5 typical.

6Real-World Dry Run

Scenario: Database goes down during traffic

1
Database becomes unreachable
All pods still alive (liveness passes)
2
Readiness probe fails
DB check returns 503
3
K8s removes pods from Service
Traffic stops reaching unhealthy pods
4
Load balancer routes to healthy pods
Other clusters/regions serve traffic
5
Database recovers
Readiness passes, pods receive traffic again

Key insight: Pods stayed alive (no restarts). They just stopped receiving traffic until ready. This is exactly what you want.

7Best Practices

Keep Liveness Probe Simple

Just return 200. No dependency checks. Avoid restart loops.

Set Reasonable Timeouts

Don't set timeout to 1s if your check needs 2s. Leads to false failures.

Use Startup Probe for Slow Apps

Java apps take 30-60s to start. Use startup probe to avoid premature kills.

Log Health Check Results

Log failures with details. Helps debug intermittent issues.

Return Structured Response

JSON with individual check status. Makes debugging easier.

8Key Takeaways

1Liveness: Is process alive? Simple check. Restart if fails.
2Readiness: Can it handle traffic? Check dependencies. Remove from LB if fails.
3Never put dependency checks in liveness-causes restart loops
4Use startup probe for slow-starting applications
5Return structured JSON with individual check status

?Quiz

1. Database is down. What should happen to your API pods?

2. Why separate liveness from readiness?