Module 7 - Observability

Health Checks Deep Dive

How load balancers, orchestrators, and monitors know if your service is alive and healthy.

1The Doctor Checkup Analogy

Simple Analogy

A routine checkup has different levels: Are you breathing? (liveness). Can you walk and talk? (readiness). How's your blood pressure and cholesterol? (deep health). Health checks work the same-different checks for different purposes.

Health checks are endpoints that report whether a service is functioning correctly. Used by load balancers, Kubernetes, and monitoring systems to make routing and restart decisions.

2Types of Health Checks

Liveness Probe

Purpose: Is the process running?

Checks: Basic HTTP 200 response. Process is alive and not deadlocked.

On failure: If fails: Restart the container/process.

GET /health/live → 200 OK

Readiness Probe

Purpose: Can it handle traffic?

Checks: All dependencies connected. Ready to serve requests.

On failure: If fails: Remove from load balancer. Don't restart.

GET /health/ready → 200 OK (if DB connected)

Startup Probe

Purpose: Is it still starting up?

Checks: Give slow-starting apps time to initialize.

On failure: If fails after timeout: Kill and restart.

Check every 5s for up to 5 minutes

3What to Check

Liveness (Shallow)

•Process is responding
•Not deadlocked
•Basic HTTP response
•Should NOT check dependencies

Readiness (Deep)

•Database connection pool healthy
•Redis/cache reachable
•Required config loaded
•Dependent services reachable

Common Mistake

Don't put dependency checks in liveness probe! If the database is down, restarting your service won't fix it-you'll just create a restart loop.

4Implementation Example

Express.js Health Endpoints

// Liveness - always return 200 if process is running
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness - check dependencies
app.get('/health/ready', async (req, res) => {
  try {
    // Check database
    await db.query('SELECT 1');
    
    // Check Redis
    await redis.ping();
    
    res.status(200).json({ 
      status: 'ready',
      checks: { database: 'ok', redis: 'ok' }
    });
  } catch (error) {
    res.status(503).json({ 
      status: 'not ready',
      error: error.message 
    });
  }
});

5Kubernetes Configuration

Pod Spec Example

containers:
  - name: api
    image: my-api:1.0
    livenessProbe:
      httpGet:
        path: /health/live
        port: 3000
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3    # Restart after 3 failures
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2    # Remove from LB after 2 failures

initialDelaySeconds

Wait before first check. Give app time to start.

periodSeconds

How often to check. 5-30s typical.

failureThreshold

Consecutive failures before action. 2-5 typical.

6Real-World Dry Run

Scenario: Database goes down during traffic

Database becomes unreachable

All pods still alive (liveness passes)

Readiness probe fails

DB check returns 503

K8s removes pods from Service

Traffic stops reaching unhealthy pods

Load balancer routes to healthy pods

Other clusters/regions serve traffic

Database recovers

Readiness passes, pods receive traffic again

Key insight: Pods stayed alive (no restarts). They just stopped receiving traffic until ready. This is exactly what you want.

7Best Practices

Keep Liveness Probe Simple

Just return 200. No dependency checks. Avoid restart loops.

Set Reasonable Timeouts

Don't set timeout to 1s if your check needs 2s. Leads to false failures.

Use Startup Probe for Slow Apps

Java apps take 30-60s to start. Use startup probe to avoid premature kills.

Log Health Check Results

Log failures with details. Helps debug intermittent issues.

Return Structured Response

JSON with individual check status. Makes debugging easier.

8Key Takeaways

1Liveness: Is process alive? Simple check. Restart if fails.

2Readiness: Can it handle traffic? Check dependencies. Remove from LB if fails.

3Never put dependency checks in liveness-causes restart loops

4Use startup probe for slow-starting applications

5Return structured JSON with individual check status

?Quiz

1. Database is down. What should happen to your API pods?

2. Why separate liveness from readiness?