Mission Compile - System Design Learning Platform

1What is Capacity Planning?

Simple Analogy

Think of planning a wedding. You need to know: How many guests? How much food? How many tables? Order too little = disaster. Order too much = waste.

Capacity planning is the same for systems. How many servers? How much storage? How much bandwidth? Get it wrong and you either crash under load or burn money on unused resources.

Capacity Planning is the process of determining the production capacity needed to meet changing demands. It involves analyzing current usage, forecasting future needs, and provisioning resources accordingly-all while balancing cost and performance.

Compute

• CPU cores
• Memory (RAM)
• Number of servers
• Container instances

Storage

• Disk capacity
• IOPS (I/O operations)
• Storage type (SSD/HDD)
• Backup storage

Network

• Bandwidth
• Connections/second
• Latency requirements
• CDN capacity

2The Capacity Planning Process

1

Gather Current Metrics

Collect data on current usage: CPU utilization, memory, disk I/O, network throughput, request rates. Use monitoring tools like Prometheus, Datadog, or CloudWatch.

2

Analyze Usage Patterns

Identify peak hours, seasonal trends, and growth rates. Understand: When is traffic highest? What triggers spikes? What's the growth trajectory?

3

Forecast Future Demand

Project future needs based on business plans, historical growth, and expected events (launches, marketing campaigns, holidays).

4

Model Capacity Scenarios

Create models for different scenarios: baseline growth, aggressive growth, and worst-case spikes. Plan for at least 2x headroom.

5

Plan Infrastructure Changes

Determine what needs to be added: more servers, larger instances, additional regions, database upgrades, CDN expansion.

6

Implement & Monitor

Roll out changes, set up alerts for capacity thresholds, and continuously monitor to validate assumptions.

3Key Metrics to Track

CPU Utilization

✓Target: 40-70% average

⚠Sustained >80% needs attention

Leave headroom for spikes

Memory Usage

✓Target: 60-80% average

⚠>90% risks OOM kills

Monitor for memory leaks

Disk I/O

✓Target: <70% of max IOPS

⚠High wait times indicate bottleneck

Consider SSD for high I/O

Network Throughput

✓Target: <60% of capacity

⚠Saturation causes drops

Use CDN for static content

Request Latency

✓Target: P99 <500ms (typical)

⚠Increasing latency = capacity issue

Track P50, P95, P99

Error Rate

✓Target: <0.1%

⚠Errors spike under load

5xx errors often mean capacity

The 80% Rule

Never plan to run at 100% capacity. Keep at least 20% headroom for:

Unexpected traffic spikes
Graceful handling of failures
Time to scale up when needed

4Capacity Planning Strategies

Lead Strategy

Add capacity BEFORE you need it

Pros

✓ No performance degradation
✓ Handles unexpected spikes
✓ Peace of mind

Cons

✗ Higher costs
✗ Potential waste
✗ Capital tied up

Best for: Critical systems, unpredictable growth

Lag Strategy

Add capacity AFTER demand increases

Pros

✓ Lower costs
✓ No waste
✓ Efficient resource use

Cons

✗ Risk of overload
✗ Performance issues during scaling
✗ Reactive

Best for: Cost-sensitive, predictable workloads

Match Strategy

Add capacity AS demand grows

Pros

✓ Balanced cost/performance
✓ Responsive
✓ Efficient

Cons

✗ Requires accurate forecasting
✗ More operational overhead

Best for: Most production systems

Auto-Scaling

Automatically adjust capacity based on metrics

Pros

✓ Optimal resource use
✓ Handles spikes
✓ Cost-efficient

Cons

✗ Scaling lag time
✗ Complex configuration
✗ Cold start issues

Best for: Variable workloads, cloud-native

5Capacity Planning Example

E-commerce Platform Planning

Current State

Daily Orders

50,000

Peak QPS

500

Avg Latency

120ms

Server Count

10

Growth Forecast (1 Year)

Expected Orders

150,000

+200%

Projected Peak QPS

1,500

+200%

Target Latency

<100ms

Holiday Spike

3x normal

Capacity Plan

• Compute: Scale from 10 to 25 servers (with auto-scaling to 40 for peaks)

• Database: Upgrade to larger instance, add read replica

• Cache: Increase Redis cluster from 3 to 6 nodes

• CDN: Add additional edge locations for new markets

• Timeline: Q1: Database upgrade, Q2: Server scaling, Q3: CDN expansion

6Common Bottlenecks

Database

Symptoms: High query latency, connection pool exhaustion, replication lag

Solutions: Read replicas, connection pooling, query optimization, sharding

Application Servers

Symptoms: High CPU/memory, slow response times, request queuing

Solutions: Horizontal scaling, code optimization, caching, async processing

Network

Symptoms: High latency, packet drops, bandwidth saturation

Solutions: CDN, compression, connection reuse, regional deployment

Third-Party APIs

Symptoms: Timeout errors, inconsistent latency, rate limiting

Solutions: Circuit breakers, caching, fallbacks, retry with backoff

7Tools for Capacity Planning

Monitoring

• Prometheus + Grafana
• Datadog
• New Relic
• CloudWatch

Load Testing

• k6
• Locust
• JMeter
• Gatling

Profiling

• pprof
• async-profiler
• py-spy
• Flame Graphs

8Key Takeaways

1Capacity planning = ensuring you have the right resources at the right time at the right cost.

2Track the Big 4: CPU, Memory, Disk I/O, Network. Watch utilization AND saturation.

3Always maintain 20-30% headroom. Never plan to run at 100%.

4Use auto-scaling for variable workloads, but understand its limitations (lag time, cold starts).

5Plan for peaks, not averages. Holiday traffic can be 3-10x normal.

6Capacity planning is continuous. Review monthly, adjust quarterly.

9Interview Follow-up Questions

Interview Follow-up Questions

Common follow-up questions interviewers ask

10Test Your Understanding

Test Your Understanding

5 questions

1

Your servers are running at 70% CPU on average. What is the appropriate response?

2

Which metric is MOST important to monitor for a database server?

3

Auto-scaling is triggered when CPU hits 60%. Traffic spikes and within 30 seconds you need 10x capacity. What happens?

4

What is the main purpose of keeping 20-30% headroom in capacity?

5

You're planning for Black Friday and expect 5x normal traffic. Normal traffic needs 10 servers. How many should you provision?

0 of 5 answered