Metrics & Monitoring
Numbers that tell you what's happening in your system-before users complain.
1The Car Dashboard Analogy
Metrics are numerical measurements collected over time. Monitoring is the practice of collecting, storing, and visualizing these metrics to understand system health.
2The Four Golden Signals
Google's SRE book recommends focusing on these four metrics for any service:
Latency
How long requests take. P50, P95, P99 percentiles.
P99 latency = 250ms means 99% of requests complete in 250ms
Traffic
How much demand. Requests per second (RPS).
1000 RPS during peak, 100 RPS at night
Errors
Rate of failed requests. 5xx errors, exceptions.
Error rate = 0.1% means 1 in 1000 requests fail
Saturation
How full is the system. CPU, memory, queue depth.
CPU at 80% = getting saturated, might slow down
3Metric Types
Counter
Monotonically increasing value. Only goes up (or resets).
Gauge
Value that goes up and down. Current state.
Histogram
Distribution of values. Bucketed counts.
Summary
Like histogram but calculates percentiles on client.
4Real-World Dry Run
Scenario: Users report the site is slow
Without metrics, this would be hours of log searching. With metrics, it's minutes of dashboard investigation.
5Common Tools
Prometheus
Time-series database. Pull-based metrics collection. Industry standard.
Grafana
Visualization and dashboards. Works with Prometheus, InfluxDB, etc.
Datadog
SaaS observability platform. Metrics, logs, traces in one.
CloudWatch
AWS native monitoring. Integrates with all AWS services.
New Relic
APM and infrastructure monitoring. Auto-instrumentation.
StatsD
Simple daemon for aggregating metrics. Push-based.
6Best Practices
Use Percentiles, Not Averages
P99 catches the bad cases. Average hides outliers. 1000ms + 10ms + 10ms = 340ms avg (looks fine!)
Add Dimensions/Labels
Metric per endpoint, per host, per region. request_latency{endpoint='/api/orders', method='POST'}
Set Baseline Alerts
Know what's normal. Alert on deviation, not absolute values.
Don't Over-Instrument
High cardinality (too many labels) = expensive storage and slow queries.
7Key Takeaways
?Quiz
1. Which golden signal would detect a slow database?
2. Current number of open connections is best represented by?