Alerting Strategies
Wake up for real problems. Sleep through the noise.
1The Fire Alarm Analogy
Alerting is automatically notifying humans when something needs attention. The goal is catching real issues early while minimizing false alarms that cause alert fatigue.
2Alert Fatigue is Real
Every alert should require human action. If it doesn't, it shouldn't be an alert-it should be a log or a dashboard metric.
3Alert Severity Levels
P1 - Critical
Service down. Users impacted. Revenue loss. Immediate response.
Page on-call. All hands. 5-min response.
P2 - High
Degraded service. Some users affected. Could escalate.
Page on-call during business hours. 30-min response.
P3 - Medium
Issue needs attention but service is functional.
Ticket created. Fix within 24 hours.
P4 - Low
Minor issue. No user impact. Cleanup or tech debt.
Ticket created. Fix in next sprint.
4What to Alert On
Alert On (Symptoms)
- ✓Error rate > 1% for 5 minutes
- ✓P99 latency > 2s for 10 minutes
- ✓Zero successful requests for 1 minute
- ✓Disk space < 10%
Don't Alert On (Causes)
- ✗Single node down (if redundant)
- ✗CPU spike lasting 30 seconds
- ✗One failed request
- ✗Deployment in progress
Alert on symptoms (users are affected) not causes (a server is slow). Users don't care if one node is down-they care if the service is slow.
5Reducing Alert Noise
Add Duration Thresholds
CPU > 90% for 30 seconds, not just CPU > 90%. Prevents alerting on brief spikes.
Use Percentiles
Alert if P99 > 2s, not if any request > 2s. Tolerates occasional slow requests.
Group Related Alerts
Database down causes 10 services to fail. Send 1 alert, not 10.
Auto-resolve Transient Issues
If the issue resolves in 2 minutes, log it but don't page.
Time-based Suppression
Don't alert during deployments or maintenance windows.
6Real-World Dry Run
Scenario: Designing alerts for an e-commerce checkout
| Alert | Condition | Severity |
|---|---|---|
| Checkout Down | Success rate = 0% for 1 min | P1 |
| High Error Rate | Error rate > 5% for 5 min | P1 |
| Checkout Slow | P99 latency > 3s for 10 min | P2 |
| Elevated Errors | Error rate > 1% for 10 min | P2 |
| Payment Gateway Errors | Stripe errors > 2% for 5 min | P3 |
7Common Tools
PagerDuty
Industry standard on-call management. Escalations, schedules.
Opsgenie
Atlassian's alerting. Integrates with Jira, Confluence.
Prometheus Alertmanager
Open source. Groups, dedupes, routes alerts.
Datadog Monitors
SaaS. Anomaly detection, composite conditions.
Grafana Alerting
Built into Grafana. Alert on dashboard panels.
Slack/Teams
For P3/P4 alerts. Not for paging.
8Key Takeaways
?Quiz
1. One server (of 10) goes down but service is unaffected. What to do?
2. Why alert on 'error rate > 1% for 5 minutes' instead of 'any error'?