Capacity Planning
The strategic process of determining the infrastructure needed to meet current and future demands while optimizing cost and performance.
1What is Capacity Planning?
Capacity planning is the same for systems. How many servers? How much storage? How much bandwidth? Get it wrong and you either crash under load or burn money on unused resources.
- • CPU cores
- • Memory (RAM)
- • Number of servers
- • Container instances
- • Disk capacity
- • IOPS (I/O operations)
- • Storage type (SSD/HDD)
- • Backup storage
- • Bandwidth
- • Connections/second
- • Latency requirements
- • CDN capacity
2The Capacity Planning Process
3Key Metrics to Track
- Unexpected traffic spikes
- Graceful handling of failures
- Time to scale up when needed
4Capacity Planning Strategies
Lead Strategy
Add capacity BEFORE you need it
- ✓ No performance degradation
- ✓ Handles unexpected spikes
- ✓ Peace of mind
- ✗ Higher costs
- ✗ Potential waste
- ✗ Capital tied up
Lag Strategy
Add capacity AFTER demand increases
- ✓ Lower costs
- ✓ No waste
- ✓ Efficient resource use
- ✗ Risk of overload
- ✗ Performance issues during scaling
- ✗ Reactive
Match Strategy
Add capacity AS demand grows
- ✓ Balanced cost/performance
- ✓ Responsive
- ✓ Efficient
- ✗ Requires accurate forecasting
- ✗ More operational overhead
Auto-Scaling
Automatically adjust capacity based on metrics
- ✓ Optimal resource use
- ✓ Handles spikes
- ✓ Cost-efficient
- ✗ Scaling lag time
- ✗ Complex configuration
- ✗ Cold start issues
5Capacity Planning Example
E-commerce Platform Planning
6Common Bottlenecks
7Tools for Capacity Planning
- • Prometheus + Grafana
- • Datadog
- • New Relic
- • CloudWatch
- • k6
- • Locust
- • JMeter
- • Gatling
- • pprof
- • async-profiler
- • py-spy
- • Flame Graphs
8Key Takeaways
9Interview Follow-up Questions
Interview Follow-up Questions
Common follow-up questions interviewers ask
10Test Your Understanding
Test Your Understanding
5 questions
Your servers are running at 70% CPU on average. What is the appropriate response?
Which metric is MOST important to monitor for a database server?
Auto-scaling is triggered when CPU hits 60%. Traffic spikes and within 30 seconds you need 10x capacity. What happens?
What is the main purpose of keeping 20-30% headroom in capacity?
You're planning for Black Friday and expect 5x normal traffic. Normal traffic needs 10 servers. How many should you provision?