Module 7 - Observability

Metrics & Monitoring

Numbers that tell you what's happening in your system-before users complain.

1The Car Dashboard Analogy

Simple Analogy

A car dashboard shows speed, fuel, temperature, and warning lights. You glance at it to know if everything's okay. Metrics are your application's dashboard-at a glance, you should know if the system is healthy, slow, or on fire.

Metrics are numerical measurements collected over time. Monitoring is the practice of collecting, storing, and visualizing these metrics to understand system health.

2The Four Golden Signals

Google's SRE book recommends focusing on these four metrics for any service:

Latency

How long requests take. P50, P95, P99 percentiles.

P99 latency = 250ms means 99% of requests complete in 250ms

Traffic

How much demand. Requests per second (RPS).

1000 RPS during peak, 100 RPS at night

Errors

Rate of failed requests. 5xx errors, exceptions.

Error rate = 0.1% means 1 in 1000 requests fail

Saturation

How full is the system. CPU, memory, queue depth.

CPU at 80% = getting saturated, might slow down

3Metric Types

Counter

Monotonically increasing value. Only goes up (or resets).

Total requestsErrors countBytes sent

Gauge

Value that goes up and down. Current state.

Current connectionsCPU usageQueue size

Histogram

Distribution of values. Bucketed counts.

Request latency distributionResponse size distribution

Summary

Like histogram but calculates percentiles on client.

P50, P95, P99 latency

4Real-World Dry Run

Scenario: Users report the site is slow

Check error rate dashboard

Error rate is 0.5%-slightly elevated but not spiking

Check latency dashboard

P99 jumped from 200ms to 2000ms-found it!

Check saturation metrics

Database CPU at 95%-bottleneck identified

Drill into DB metrics

Query rate doubled due to missing cache

Root cause

Cache server restarted, all traffic hit DB. Fix: warm cache.

Without metrics, this would be hours of log searching. With metrics, it's minutes of dashboard investigation.

5Common Tools

Prometheus

Time-series database. Pull-based metrics collection. Industry standard.

Grafana

Visualization and dashboards. Works with Prometheus, InfluxDB, etc.

Datadog

SaaS observability platform. Metrics, logs, traces in one.

CloudWatch

AWS native monitoring. Integrates with all AWS services.

New Relic

APM and infrastructure monitoring. Auto-instrumentation.

StatsD

Simple daemon for aggregating metrics. Push-based.

6Best Practices

Use Percentiles, Not Averages

P99 catches the bad cases. Average hides outliers. 1000ms + 10ms + 10ms = 340ms avg (looks fine!)

Add Dimensions/Labels

Metric per endpoint, per host, per region. request_latency{endpoint='/api/orders', method='POST'}

Set Baseline Alerts

Know what's normal. Alert on deviation, not absolute values.

Don't Over-Instrument

High cardinality (too many labels) = expensive storage and slow queries.

7Key Takeaways

1Four golden signals: Latency, Traffic, Errors, Saturation

2Counter (cumulative), Gauge (current), Histogram (distribution)

3Use percentiles (P99) instead of averages for latency

4Add dimensions (endpoint, region) but avoid high cardinality

5Prometheus + Grafana is the standard open-source stack

?Quiz

1. Which golden signal would detect a slow database?

2. Current number of open connections is best represented by?