Module 7 - Observability

Alerting Strategies

Wake up for real problems. Sleep through the noise.

1The Fire Alarm Analogy

Simple Analogy
A smoke detector that goes off every time you cook is useless-you'll ignore it or disconnect it. A good smoke detector only alarms when there's real danger. Your alerting system should work the same way: alert on real problems, not noise.

Alerting is automatically notifying humans when something needs attention. The goal is catching real issues early while minimizing false alarms that cause alert fatigue.

2Alert Fatigue is Real

80%
of alerts ignored in high-alert environments
3 AM
pager wakes you. False alarm. Now you can't sleep.
100+
alerts/day leads to ignoring all of them
1-2
actionable alerts per shift is the goal

Every alert should require human action. If it doesn't, it shouldn't be an alert-it should be a log or a dashboard metric.

3Alert Severity Levels

P1 - Critical

Service down. Users impacted. Revenue loss. Immediate response.

Page on-call. All hands. 5-min response.

P2 - High

Degraded service. Some users affected. Could escalate.

Page on-call during business hours. 30-min response.

P3 - Medium

Issue needs attention but service is functional.

Ticket created. Fix within 24 hours.

P4 - Low

Minor issue. No user impact. Cleanup or tech debt.

Ticket created. Fix in next sprint.

4What to Alert On

Alert On (Symptoms)
  • Error rate > 1% for 5 minutes
  • P99 latency > 2s for 10 minutes
  • Zero successful requests for 1 minute
  • Disk space < 10%
Don't Alert On (Causes)
  • Single node down (if redundant)
  • CPU spike lasting 30 seconds
  • One failed request
  • Deployment in progress

Alert on symptoms (users are affected) not causes (a server is slow). Users don't care if one node is down-they care if the service is slow.

5Reducing Alert Noise

Add Duration Thresholds

CPU > 90% for 30 seconds, not just CPU > 90%. Prevents alerting on brief spikes.

Use Percentiles

Alert if P99 > 2s, not if any request > 2s. Tolerates occasional slow requests.

Group Related Alerts

Database down causes 10 services to fail. Send 1 alert, not 10.

Auto-resolve Transient Issues

If the issue resolves in 2 minutes, log it but don't page.

Time-based Suppression

Don't alert during deployments or maintenance windows.

6Real-World Dry Run

Scenario: Designing alerts for an e-commerce checkout

AlertConditionSeverity
Checkout DownSuccess rate = 0% for 1 minP1
High Error RateError rate > 5% for 5 minP1
Checkout SlowP99 latency > 3s for 10 minP2
Elevated ErrorsError rate > 1% for 10 minP2
Payment Gateway ErrorsStripe errors > 2% for 5 minP3

7Common Tools

PagerDuty

Industry standard on-call management. Escalations, schedules.

Opsgenie

Atlassian's alerting. Integrates with Jira, Confluence.

Prometheus Alertmanager

Open source. Groups, dedupes, routes alerts.

Datadog Monitors

SaaS. Anomaly detection, composite conditions.

Grafana Alerting

Built into Grafana. Alert on dashboard panels.

Slack/Teams

For P3/P4 alerts. Not for paging.

8Key Takeaways

1Every alert should be actionable-if not, make it a log
2Alert on symptoms (user impact) not causes (node down)
3Use duration thresholds to prevent transient alerts
4Define clear severity levels (P1-P4) with response expectations
5Alert fatigue is real-aim for 1-2 actionable alerts per shift

?Quiz

1. One server (of 10) goes down but service is unaffected. What to do?

2. Why alert on 'error rate > 1% for 5 minutes' instead of 'any error'?