Module 0 - Core Concepts

Capacity Planning

The strategic process of determining the infrastructure needed to meet current and future demands while optimizing cost and performance.

12 min readInfrastructure

1What is Capacity Planning?

Simple Analogy
Think of planning a wedding. You need to know: How many guests? How much food? How many tables? Order too little = disaster. Order too much = waste.

Capacity planning is the same for systems. How many servers? How much storage? How much bandwidth? Get it wrong and you either crash under load or burn money on unused resources.
Capacity Planning is the process of determining the production capacity needed to meet changing demands. It involves analyzing current usage, forecasting future needs, and provisioning resources accordingly-all while balancing cost and performance.
Compute
  • CPU cores
  • Memory (RAM)
  • Number of servers
  • Container instances
Storage
  • Disk capacity
  • IOPS (I/O operations)
  • Storage type (SSD/HDD)
  • Backup storage
Network
  • Bandwidth
  • Connections/second
  • Latency requirements
  • CDN capacity

2The Capacity Planning Process

1
Gather Current Metrics
Collect data on current usage: CPU utilization, memory, disk I/O, network throughput, request rates. Use monitoring tools like Prometheus, Datadog, or CloudWatch.
2
Analyze Usage Patterns
Identify peak hours, seasonal trends, and growth rates. Understand: When is traffic highest? What triggers spikes? What's the growth trajectory?
3
Forecast Future Demand
Project future needs based on business plans, historical growth, and expected events (launches, marketing campaigns, holidays).
4
Model Capacity Scenarios
Create models for different scenarios: baseline growth, aggressive growth, and worst-case spikes. Plan for at least 2x headroom.
5
Plan Infrastructure Changes
Determine what needs to be added: more servers, larger instances, additional regions, database upgrades, CDN expansion.
6
Implement & Monitor
Roll out changes, set up alerts for capacity thresholds, and continuously monitor to validate assumptions.

3Key Metrics to Track

CPU Utilization
Target: 40-70% average
Sustained >80% needs attention
Leave headroom for spikes
Memory Usage
Target: 60-80% average
>90% risks OOM kills
Monitor for memory leaks
Disk I/O
Target: <70% of max IOPS
High wait times indicate bottleneck
Consider SSD for high I/O
Network Throughput
Target: <60% of capacity
Saturation causes drops
Use CDN for static content
Request Latency
Target: P99 <500ms (typical)
Increasing latency = capacity issue
Track P50, P95, P99
Error Rate
Target: <0.1%
Errors spike under load
5xx errors often mean capacity
The 80% Rule
Never plan to run at 100% capacity. Keep at least 20% headroom for:
  • Unexpected traffic spikes
  • Graceful handling of failures
  • Time to scale up when needed

4Capacity Planning Strategies

Lead Strategy

Add capacity BEFORE you need it

Pros
  • No performance degradation
  • Handles unexpected spikes
  • Peace of mind
Cons
  • Higher costs
  • Potential waste
  • Capital tied up
Best for: Critical systems, unpredictable growth

Lag Strategy

Add capacity AFTER demand increases

Pros
  • Lower costs
  • No waste
  • Efficient resource use
Cons
  • Risk of overload
  • Performance issues during scaling
  • Reactive
Best for: Cost-sensitive, predictable workloads

Match Strategy

Add capacity AS demand grows

Pros
  • Balanced cost/performance
  • Responsive
  • Efficient
Cons
  • Requires accurate forecasting
  • More operational overhead
Best for: Most production systems

Auto-Scaling

Automatically adjust capacity based on metrics

Pros
  • Optimal resource use
  • Handles spikes
  • Cost-efficient
Cons
  • Scaling lag time
  • Complex configuration
  • Cold start issues
Best for: Variable workloads, cloud-native

5Capacity Planning Example

E-commerce Platform Planning

Current State
Daily Orders
50,000
Peak QPS
500
Avg Latency
120ms
Server Count
10
Growth Forecast (1 Year)
Expected Orders
150,000
+200%
Projected Peak QPS
1,500
+200%
Target Latency
<100ms
Holiday Spike
3x normal
Capacity Plan
Compute: Scale from 10 to 25 servers (with auto-scaling to 40 for peaks)
Database: Upgrade to larger instance, add read replica
Cache: Increase Redis cluster from 3 to 6 nodes
CDN: Add additional edge locations for new markets
Timeline: Q1: Database upgrade, Q2: Server scaling, Q3: CDN expansion

6Common Bottlenecks

Database
Symptoms: High query latency, connection pool exhaustion, replication lag
Solutions: Read replicas, connection pooling, query optimization, sharding
Application Servers
Symptoms: High CPU/memory, slow response times, request queuing
Solutions: Horizontal scaling, code optimization, caching, async processing
Network
Symptoms: High latency, packet drops, bandwidth saturation
Solutions: CDN, compression, connection reuse, regional deployment
Third-Party APIs
Symptoms: Timeout errors, inconsistent latency, rate limiting
Solutions: Circuit breakers, caching, fallbacks, retry with backoff

7Tools for Capacity Planning

Monitoring
  • Prometheus + Grafana
  • Datadog
  • New Relic
  • CloudWatch
Load Testing
  • k6
  • Locust
  • JMeter
  • Gatling
Profiling
  • pprof
  • async-profiler
  • py-spy
  • Flame Graphs

8Key Takeaways

1Capacity planning = ensuring you have the right resources at the right time at the right cost.
2Track the Big 4: CPU, Memory, Disk I/O, Network. Watch utilization AND saturation.
3Always maintain 20-30% headroom. Never plan to run at 100%.
4Use auto-scaling for variable workloads, but understand its limitations (lag time, cold starts).
5Plan for peaks, not averages. Holiday traffic can be 3-10x normal.
6Capacity planning is continuous. Review monthly, adjust quarterly.

9Interview Follow-up Questions

Interview Follow-up Questions

Common follow-up questions interviewers ask

10Test Your Understanding

Test Your Understanding

5 questions

1

Your servers are running at 70% CPU on average. What is the appropriate response?

2

Which metric is MOST important to monitor for a database server?

3

Auto-scaling is triggered when CPU hits 60%. Traffic spikes and within 30 seconds you need 10x capacity. What happens?

4

What is the main purpose of keeping 20-30% headroom in capacity?

5

You're planning for Black Friday and expect 5x normal traffic. Normal traffic needs 10 servers. How many should you provision?

0 of 5 answered