Module 2 — Traffic & Load

Rate Limiting

Control how many requests a client can make. Protect your APIs from abuse and overload.

1What is Rate Limiting?

Rate Limiting is a technique to control the rate at which clients can make requests to your API. It limits the number of requests a user, IP, or API key can make within a specified time window.
Simple Analogy
Think of a nightclub bouncer:
🚪
Entry Limit
"Only 10 people can enter per minute"
⏱️
Time Window
Counter resets every minute
🚫
Over Limit
"Sorry, wait for the next minute"

Why Rate Limiting?

🛡️Prevent Abuse

Stop bad actors from overwhelming your service with requests

e.g., Scraper trying to download your entire database
⚖️Ensure Fairness

One user shouldn't monopolize resources at others' expense

e.g., Free tier user making 10x more requests than paid users
🏗️Protect Infrastructure

Prevent cascading failures when demand exceeds capacity

e.g., Viral tweet causing 100x normal traffic
💰Control Costs

Limit expensive operations like AI calls or third-party APIs

e.g., GPT-4 API calls at $0.06 per 1K tokens

2Rate Limiting Algorithms

There are several algorithms to implement rate limiting, each with different trade-offs:

🪣

1. Token Bucket

Most common. Allows bursts. Used by AWS, Stripe.
How it works:
  1. 1. Bucket holds tokens (e.g., max 10 tokens)
  2. 2. Tokens added at constant rate (e.g., 1 per second)
  3. 3. Each request takes 1 token from bucket
  4. 4. No tokens? Request rejected (429)
  5. 5. Bucket never exceeds max capacity
Visual:
Bucket fills up, drains with requests, refills over time
Allows short bursts (e.g., 10 quick requests)
Smooths out traffic over time
Slightly more complex to implement
Need to track last refill time
Implementation:
class TokenBucket:
    def __init__(self, capacity=10, refill_rate=1):
        self.capacity = capacity      # Max tokens
        self.tokens = capacity        # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.now()

    def allow_request(self):
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True  # Request allowed
        return False     # Rate limited (429)

    def _refill(self):
        now = time.now()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
📋

2. Sliding Window Log

Most accurate. Stores timestamps of each request.
How it works:
  1. 1. Store timestamp of every request
  2. 2. On new request, remove timestamps older than window
  3. 3. Count remaining timestamps
  4. 4. If count < limit, allow. Else reject.
Example (5 req/min):
Window: [10:00:05, 10:00:23, 10:00:45, 10:00:58]
New request at 10:01:02
Remove < 10:00:02: [10:00:05, 10:00:23, 10:00:45, 10:00:58]
Count = 4, limit = 5 → ALLOWED
Most accurate - no boundary issues
Smooth rate limiting
High memory usage (stores all timestamps)
O(n) to count requests
🪟

3. Fixed Window Counter

Simplest. Counter resets at fixed intervals.
How it works:
  1. 1. Divide time into fixed windows (e.g., 1 minute)
  2. 2. Counter for each window
  3. 3. Increment counter on each request
  4. 4. If counter > limit, reject until window ends
Visual (limit: 5/min):
10:00-10:01
4
OK
10:01-10:02
7
2 rejected
10:02-10:03
2
OK
⚠️ Boundary Problem:
5 requests at 10:00:59 + 5 requests at 10:01:01 = 10 requests in 2 seconds! Both pass because they're in different windows.
Simple to implement
Low memory (one counter per window)
Burst at window boundaries
Not smooth rate limiting
📊

4. Sliding Window Counter

Best of both worlds. Weighted average of windows.
How it works:
  1. 1. Track counts for current and previous window
  2. 2. Calculate weighted sum based on position in window
  3. 3. At 10:00:15: prev_count × 0.75 + curr_count × 0.25
  4. 4. Compare weighted sum to limit
Example:
Previous window (10:00-10:01): 4 requests
Current window (10:01-10:02): 2 requests
Time: 10:01:15 (25% into current window)
Weighted: 4 × 0.75 + 2 × 0.25 = 3.5
3.5 < 5 limit → ALLOWED
Smooth rate limiting
Low memory (two counters)
No boundary burst problem
~ Approximate, not exact

Algorithm Comparison

AlgorithmMemoryAccuracyBurst HandlingComplexity
Token BucketLowHighAllows burstsMedium
Sliding Window LogHighPerfectSmoothMedium
Fixed WindowVery LowLowBoundary burstSimple
Sliding Window CounterLowGoodSmoothMedium

3What to Rate Limit By

User ID

Each authenticated user gets their own limit

+ Fair per-user
+ Works with auth
- Doesn't stop unauthenticated abuse
- User can create multiple accounts
user:12345 → 100 requests/hour
API Key

Each API key (usually per application) gets a limit

+ Track per-app usage
+ Easy to revoke
- Doesn't distinguish users within app
- Key can be shared
api_key:sk_live_xxx → 1000 requests/hour
IP Address

Limit by client IP address

+ Works without auth
+ Stops basic DoS
- NAT: many users share IP
- Proxies hide real IP
- VPN bypass
ip:192.168.1.1 → 50 requests/minute
Endpoint

Different limits for different API endpoints

+ Protect expensive operations
+ Fine-grained control
- Complex to manage
- May need multiple checks
/search → 10/min, /login → 5/min, /profile → 100/min
Combine Multiple Keys

Use multiple rate limit keys together: user_id + endpoint or api_key + IP. This prevents a single user from hammering one expensive endpoint.

4Where to Implement

API Gateway / Load Balancer

Rate limit before traffic reaches your servers

+ Protects all services
+ Early rejection
+ Centralized
- Less context about request
- Coarse-grained
Examples: AWS API Gateway, Kong, Nginx, Cloudflare
Application Layer

Rate limit within your application code

+ Fine-grained control
+ Access to user context
+ Custom logic
- Must implement in each service
- Traffic reaches server first
Examples: Express rate-limit middleware, Django Ratelimit
Distributed Rate Limiter

Centralized rate limiting service with Redis/Memcached

+ Consistent across instances
+ Scales horizontally
- Extra network hop
- Redis as dependency
Examples: Redis + Lua scripts, rate-limiter-flexible

Distributed Rate Limiting with Redis

-- Sliding window counter in Redis (Lua script for atomicity)
local key = KEYS[1]
local window = tonumber(ARGV[1])  -- window size in seconds
local limit = tonumber(ARGV[2])   -- max requests
local now = tonumber(ARGV[3])     -- current timestamp

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add current request
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- Allowed
else
    return 0  -- Rate limited
end

5Rate Limit Response

Standard HTTP Response Headers

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please retry after 30 seconds.",
  "retry_after": 30
}
429 Too Many Requests
Standard status code for rate limiting
Retry-After: 30
Seconds until client should retry
X-RateLimit-Limit: 100
Max requests allowed in window
X-RateLimit-Remaining: 0
Requests left in current window
Don't Reveal Too Much

Be careful about information disclosure. Detailed rate limit info helps legitimate clients but also helps attackers know exactly how fast they can abuse your API.

6Real-World Examples

Twitter/X API
Free tier: 1,500 tweets/month read
Basic: 100 tweets/month write
Pro: 1M tweets/month read
Strategy: Per-app limits, tiered pricing, separate read/write
GitHub API
Unauthenticated: 60 requests/hour
Authenticated: 5,000 requests/hour
GitHub Actions: 1,000 requests/hour per repo
Strategy: Higher limits for authenticated users, search more restricted
Stripe API
Live mode: 100 requests/second
Test mode: 25 requests/second
Some endpoints: 100/hour (like list all customers)
Strategy: Token bucket, allows bursts, different limits per endpoint
OpenAI API
Tokens per minute (TPM)
Requests per minute (RPM)
Varies by model: GPT-4 more restricted than GPT-3.5
Strategy: Dual limits: both request count AND token count

7Best Practices

Return Retry-After
Tell clients when they can retry
Tip: Include seconds to wait in header and response body
Use Multiple Limits
Different limits for different operations
Tip: /search: 10/min, /read: 100/min, /write: 20/min
Graceful Degradation
Soft limits before hard limits
Tip: At 80%: warn. At 100%: reject.
Monitor and Alert
Track rate limit hits
Tip: Alert if legitimate users frequently hit limits
Different Tiers
Higher limits for paying customers
Tip: Free: 100/hr, Pro: 1000/hr, Enterprise: custom
Whitelist Internal Services
Don't rate limit service-to-service calls
Tip: Use separate auth for internal vs external

8Key Takeaways

1Rate limiting controls how many requests a client can make in a time period.
2Algorithms: Token Bucket (bursts OK), Sliding Window (smooth), Fixed Window (simple but boundary issues).
3Limit by: User ID, API Key, IP, or endpoint. Often combine multiple.
4Use Redis for distributed rate limiting across multiple servers.
5Return 429 status with Retry-After header.
6In interviews: discuss algorithm choice, where to implement, and handling distributed systems.