Core Performance Metrics
The key numbers that define system health: Latency, Throughput, Availability, and Error Rate.
1Latency
Types of Latency
Time for data to travel across the network
Time spent processing in your code
Time to execute queries and return data
Time waiting in queues or for I/O operations
Latency Breakdown: Typical Web Request
Measuring Latency: Percentiles Matter
Average latency hides problems. Use percentiles:
At 1M requests/day, p99 = 450ms means 10,000 users experience 450ms+ latency daily. For Amazon, every 100ms latency costs 1% in sales. High percentiles = real user pain.
How to Reduce Latency
2Throughput
Throughput vs Latency
• High throughput can increase latency (queuing under load)
• Optimize for what matters most for your use case
Factors Affecting Throughput
Real-World Throughput Numbers
| System | Throughput | Notes |
|---|---|---|
| Single Node.js | 1K-10K RPS | Event loop, single thread |
| Single Go Server | 10K-100K RPS | Goroutines, multi-core |
| Redis | 100K+ RPS | In-memory, simple ops |
| PostgreSQL | 1K-50K QPS | Depends on query complexity |
| Kafka | 1M+ messages/sec | Per broker, sequential I/O |
3Availability
The Nines Table
| Availability | Downtime/Year | Downtime/Month | Typical Use |
|---|---|---|---|
| 99% (two 9s) | 3.65 days | 7.3 hours | Internal tools, batch jobs |
| 99.9% (three 9s) | 8.76 hours | 43.8 minutes | Standard SaaS products |
| 99.99% (four 9s) | 52.6 minutes | 4.4 minutes | E-commerce, financial |
| 99.999% (five 9s) | 5.26 minutes | 26 seconds | Critical infrastructure |
Calculating System Availability
How to Achieve High Availability
4Error Rate
Types of Errors
4xx Client Errors
400 Bad Request - malformed input401 Unauthorized - not logged in403 Forbidden - no permission404 Not Found - resource doesn't exist429 Too Many Requests - rate limited5xx Server Errors
500 Internal Server Error - bug/crash502 Bad Gateway - upstream failed503 Service Unavailable - overloaded504 Gateway Timeout - upstream slowError Budget Concept
If your SLO is 99.9% availability, you have an "error budget" of 0.1%:
Healthy Error Rate Targets
| Metric | Good | Acceptable | Critical |
|---|---|---|---|
| 5xx Rate | < 0.01% | < 0.1% | > 1% |
| Timeout Rate | < 0.1% | < 0.5% | > 2% |
| 4xx Rate | < 1% | < 5% | > 10% |
5Putting It All Together
The Trade-offs
Typical SLOs by Service Type
| Service Type | Latency (p99) | Availability | Error Rate |
|---|---|---|---|
| User-facing API | < 200ms | 99.9% | < 0.1% |
| Payment Service | < 500ms | 99.99% | < 0.01% |
| Search Service | < 100ms | 99.9% | < 0.5% |
| Batch Processing | N/A (throughput matters) | 99% | < 1% |
6Try It: Latency vs Throughput Simulator
Use this highway analogy to understand how latency and throughput interact. Cars represent requests, lanes represent servers. Add lanes (horizontal scaling) or increase speed (vertical scaling) to see the effects.
🚗 Latency vs Throughput Simulator
Highway analogy: Cars = Requests, Lanes = Servers
7Key Takeaways
8Interview Follow-up Questions
Interview Follow-up Questions
Common follow-up questions interviewers ask
8Test Your Understanding
Test Your Understanding
5 questions
Your service has an average latency of 100ms and P99 latency of 800ms. What does this tell you?
Which formula correctly represents throughput?
A service has 99.9% availability. How much downtime is allowed per month?
Which is NOT typically included in error rate calculations?
Batching database writes typically: