Module 11 - Interview Prep

Bottleneck Analysis

Finding and fixing the weakest link in your system.

1The Highway Traffic Analogy

Simple Analogy
A 6-lane highway narrowing to 2 lanes creates a traffic jam-no matter how fast you drive before or after that point. Systems work the same way. If your database handles 1K QPS but your app servers push 10K QPS, the database is your bottleneck. Optimizing app servers won't help.

Bottleneck is the component that limits overall system throughput. The system is only as fast as its slowest part. Identifying and addressing bottlenecks is the key to scaling.

2Common Bottleneck Types

Database Bottleneck

Symptoms

Slow queries

Connection pool exhausted

High CPU on DB server

Lock contention

Solutions

Add read replicas

Implement caching

Optimize queries/indexes

Shard the database

Network Bottleneck

Symptoms

High latency between services

Bandwidth saturation

Cross-region calls

Solutions

CDN for static content

Compress responses

Move services closer

Batch requests

CPU Bottleneck

Symptoms

100% CPU utilization

Slow response under load

Request queuing

Solutions

Horizontal scaling

Optimize algorithms

Async processing

Caching computed results

Memory Bottleneck

Symptoms

OOM errors

Excessive GC pauses

Swapping to disk

Solutions

Increase instance size

Reduce in-memory data

Streaming processing

Pagination

3Identifying Bottlenecks

1
Look at the Data Flow
Trace a request from client to response. Where does time go?
2
Find the Synchronous Path
What's on the critical path? What must complete before response?
3
Check Utilization
Which component is at 100%? That's likely your bottleneck.
4
Calculate Capacity
What's the theoretical max throughput of each component?

Rule of thumb: Start at the database. 80% of the time, that's where the bottleneck is. Then check network, then compute.

4Worked Example: E-commerce Checkout

Problem: Checkout takes 5 seconds under load
Load Balancer10ms100K QPSOK
App Server200ms1K QPS per server (x10)OK
Inventory Check500ms500 QPSWARNING
Payment Service800ms100 QPSCRITICAL
Database Write100ms1K QPSOK
Analysis

Bottleneck: Payment service at 100 QPS, taking 800ms per request.

Solutions: (1) Add more payment service instances, (2) Make payment async-confirm order first, process payment in background, (3) Use payment gateway that batches requests.

5Interviewer Questions

"What's the bottleneck in your design?"

Look at your HLD. Which component handles the most load? Which scales least well?

"How would you scale 10x?"

Identify current bottleneck, solve it, then find the next one. It's iterative.

"What breaks first under load?"

Usually: database → external APIs → app servers → load balancer

"How would you find bottlenecks in production?"

Monitoring: latency per component, saturation metrics, distributed tracing.

6Resolution Strategies

Scale Up

Bigger machines. Quick fix but limited ceiling.

When: Small systems, vertical limits not reached

Scale Out

More machines. Requires stateless design.

When: Compute-bound, embarrassingly parallel

Cache

Reduce load on slow components.

When: Read-heavy, data doesn't change often

Async

Move work off critical path.

When: Work can be done later, user doesn't need immediate result

Shard

Partition data/load across nodes.

When: Data is too large for single node

Optimize

Better algorithms, queries, code.

When: Before scaling, always check for inefficiencies

7Key Takeaways

1Bottleneck = component limiting overall throughput. System speed = slowest part.
2Database is usually first bottleneck. Then external APIs, then compute.
3Trace the request path. Find where time is spent.
4Solve iteratively. Fix one bottleneck, find the next.
5In interviews, proactively identify bottlenecks in your design.

?Quiz

1. App servers at 30% CPU, database at 95% CPU. Where's the bottleneck?

2. Best way to reduce database load for read-heavy workload?