Module 2 — Traffic & Load

Rate Limiting

Control how many requests a client can make. Protect your APIs from abuse and overload.

1What is Rate Limiting?

Rate Limiting is a technique to control the rate at which clients can make requests to your API. It limits the number of requests a user, IP, or API key can make within a specified time window.

Simple Analogy

Think of a nightclub bouncer:

🚪

Entry Limit

"Only 10 people can enter per minute"

⏱️

Time Window

Counter resets every minute

🚫

Over Limit

"Sorry, wait for the next minute"

Why Rate Limiting?

🛡️Prevent Abuse

Stop bad actors from overwhelming your service with requests

e.g., Scraper trying to download your entire database

⚖️Ensure Fairness

One user shouldn't monopolize resources at others' expense

e.g., Free tier user making 10x more requests than paid users

🏗️Protect Infrastructure

Prevent cascading failures when demand exceeds capacity

e.g., Viral tweet causing 100x normal traffic

💰Control Costs

Limit expensive operations like AI calls or third-party APIs

e.g., GPT-4 API calls at $0.06 per 1K tokens

2Rate Limiting Algorithms

There are several algorithms to implement rate limiting, each with different trade-offs:

🪣

1. Token Bucket

Most common. Allows bursts. Used by AWS, Stripe.

How it works:

1. Bucket holds tokens (e.g., max 10 tokens)
2. Tokens added at constant rate (e.g., 1 per second)
3. Each request takes 1 token from bucket
4. No tokens? Request rejected (429)
5. Bucket never exceeds max capacity

Visual:

Bucket fills up, drains with requests, refills over time

✓ Allows short bursts (e.g., 10 quick requests)

✓ Smooths out traffic over time

✗ Slightly more complex to implement

✗ Need to track last refill time

Implementation:

class TokenBucket:
    def __init__(self, capacity=10, refill_rate=1):
        self.capacity = capacity      # Max tokens
        self.tokens = capacity        # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.now()

    def allow_request(self):
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True  # Request allowed
        return False     # Rate limited (429)

    def _refill(self):
        now = time.now()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

📋

2. Sliding Window Log

Most accurate. Stores timestamps of each request.

How it works:

1. Store timestamp of every request
2. On new request, remove timestamps older than window
3. Count remaining timestamps
4. If count < limit, allow. Else reject.

Example (5 req/min):

Window: [10:00:05, 10:00:23, 10:00:45, 10:00:58]

New request at 10:01:02

Remove < 10:00:02: [10:00:05, 10:00:23, 10:00:45, 10:00:58]

Count = 4, limit = 5 → ALLOWED

✓ Most accurate - no boundary issues

✓ Smooth rate limiting

✗ High memory usage (stores all timestamps)

✗ O(n) to count requests

🪟

3. Fixed Window Counter

Simplest. Counter resets at fixed intervals.

How it works:

1. Divide time into fixed windows (e.g., 1 minute)
2. Counter for each window
3. Increment counter on each request
4. If counter > limit, reject until window ends

Visual (limit: 5/min):

10:00-10:01

10:01-10:02

2 rejected

10:02-10:03

⚠️ Boundary Problem:

5 requests at 10:00:59 + 5 requests at 10:01:01 = 10 requests in 2 seconds! Both pass because they're in different windows.

✓ Simple to implement

✓ Low memory (one counter per window)

✗ Burst at window boundaries

✗ Not smooth rate limiting

📊

4. Sliding Window Counter

Best of both worlds. Weighted average of windows.

How it works:

1. Track counts for current and previous window
2. Calculate weighted sum based on position in window
3. At 10:00:15: prev_count × 0.75 + curr_count × 0.25
4. Compare weighted sum to limit

Example:

Previous window (10:00-10:01): 4 requests

Current window (10:01-10:02): 2 requests

Time: 10:01:15 (25% into current window)

Weighted: 4 × 0.75 + 2 × 0.25 = 3.5

3.5 < 5 limit → ALLOWED

✓ Smooth rate limiting

✓ Low memory (two counters)

✓ No boundary burst problem

~ Approximate, not exact

Algorithm Comparison

Algorithm	Memory	Accuracy	Burst Handling	Complexity
Token Bucket	Low	High	Allows bursts	Medium
Sliding Window Log	High	Perfect	Smooth	Medium
Fixed Window	Very Low	Low	Boundary burst	Simple
Sliding Window Counter	Low	Good	Smooth	Medium

3What to Rate Limit By

User ID

Each authenticated user gets their own limit

+ Fair per-user

+ Works with auth

- Doesn't stop unauthenticated abuse

- User can create multiple accounts

user:12345 → 100 requests/hour

API Key

Each API key (usually per application) gets a limit

+ Track per-app usage

+ Easy to revoke

- Doesn't distinguish users within app

- Key can be shared

api_key:sk_live_xxx → 1000 requests/hour

IP Address

Limit by client IP address

+ Works without auth

+ Stops basic DoS

- NAT: many users share IP

- Proxies hide real IP

- VPN bypass

ip:192.168.1.1 → 50 requests/minute

Endpoint

Different limits for different API endpoints

+ Protect expensive operations

+ Fine-grained control

- Complex to manage

- May need multiple checks

/search → 10/min, /login → 5/min, /profile → 100/min

Combine Multiple Keys

Use multiple rate limit keys together: user_id + endpoint or api_key + IP. This prevents a single user from hammering one expensive endpoint.

4Where to Implement

API Gateway / Load Balancer

Rate limit before traffic reaches your servers

+ Protects all services

+ Early rejection

+ Centralized

- Less context about request

- Coarse-grained

Examples: AWS API Gateway, Kong, Nginx, Cloudflare

Application Layer

Rate limit within your application code

+ Fine-grained control

+ Access to user context

+ Custom logic

- Must implement in each service

- Traffic reaches server first

Examples: Express rate-limit middleware, Django Ratelimit

Distributed Rate Limiter

Centralized rate limiting service with Redis/Memcached

+ Consistent across instances

+ Scales horizontally

- Extra network hop

- Redis as dependency

Examples: Redis + Lua scripts, rate-limiter-flexible

Distributed Rate Limiting with Redis

-- Sliding window counter in Redis (Lua script for atomicity)
local key = KEYS[1]
local window = tonumber(ARGV[1])  -- window size in seconds
local limit = tonumber(ARGV[2])   -- max requests
local now = tonumber(ARGV[3])     -- current timestamp

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add current request
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- Allowed
else
    return 0  -- Rate limited
end

5Rate Limit Response

Standard HTTP Response Headers

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please retry after 30 seconds.",
  "retry_after": 30
}

429 Too Many Requests

Standard status code for rate limiting

Retry-After: 30

Seconds until client should retry

X-RateLimit-Limit: 100

Max requests allowed in window

X-RateLimit-Remaining: 0

Requests left in current window

Don't Reveal Too Much

Be careful about information disclosure. Detailed rate limit info helps legitimate clients but also helps attackers know exactly how fast they can abuse your API.

6Real-World Examples

Twitter/X API

• Free tier: 1,500 tweets/month read

• Basic: 100 tweets/month write

• Pro: 1M tweets/month read

Strategy: Per-app limits, tiered pricing, separate read/write

GitHub API

• Unauthenticated: 60 requests/hour

• Authenticated: 5,000 requests/hour

• GitHub Actions: 1,000 requests/hour per repo

Strategy: Higher limits for authenticated users, search more restricted

Stripe API

• Live mode: 100 requests/second

• Test mode: 25 requests/second

• Some endpoints: 100/hour (like list all customers)

Strategy: Token bucket, allows bursts, different limits per endpoint

OpenAI API

• Tokens per minute (TPM)

• Requests per minute (RPM)

• Varies by model: GPT-4 more restricted than GPT-3.5

Strategy: Dual limits: both request count AND token count

7Best Practices

Return Retry-After

Tell clients when they can retry

Tip: Include seconds to wait in header and response body

Use Multiple Limits

Different limits for different operations

Tip: /search: 10/min, /read: 100/min, /write: 20/min

Graceful Degradation

Soft limits before hard limits

Tip: At 80%: warn. At 100%: reject.

Monitor and Alert

Track rate limit hits

Tip: Alert if legitimate users frequently hit limits

Different Tiers

Higher limits for paying customers

Tip: Free: 100/hr, Pro: 1000/hr, Enterprise: custom

Whitelist Internal Services

Don't rate limit service-to-service calls

Tip: Use separate auth for internal vs external

8Key Takeaways

1Rate limiting controls how many requests a client can make in a time period.

2Algorithms: Token Bucket (bursts OK), Sliding Window (smooth), Fixed Window (simple but boundary issues).

3Limit by: User ID, API Key, IP, or endpoint. Often combine multiple.

4Use Redis for distributed rate limiting across multiple servers.

5Return 429 status with Retry-After header.

6In interviews: discuss algorithm choice, where to implement, and handling distributed systems.