HLD Problem

Design Notification System

Design a scalable notification system supporting push notifications, SMS, email, and in-app messages with prioritization, rate limiting, and personalization.

35 min readMedium

1Requirements Gathering

Functional Requirements
  • Send push notifications (iOS, Android, Web)
  • Send SMS messages
  • Send emails (transactional & marketing)
  • In-app notifications
  • User notification preferences
  • Template management
  • Scheduling (send later)
  • Notification history
  • Analytics and tracking
Non-Functional Requirements
  • High availability (99.99%)
  • Low latency for critical notifications (< 1s)
  • At-least-once delivery guarantee
  • Handle 10M+ notifications/minute
  • Rate limiting per user/channel
  • Soft real-time (eventual delivery)
  • Cost optimization

2Capacity Estimation

Scale Numbers

500M
Notifications/Day
10M
Peak/Minute
100M
Users
4
Channels

Traffic Breakdown by Channel

Push Notifications60% (~300M/day)
In-App25% (~125M/day)
Email12% (~60M/day)
SMS3% (~15M/day)

3High-Level Architecture

System Architecture

User Actions
Scheduled Jobs
System Events
External APIs
Notification API
REST/gRPC endpoints
Message Queue (Kafka)
Decouple producers & consumers
Notification Service
Orchestration
Preference Service
User settings
Template Service
Message templates
Rate Limiter
Throttling
Channel Workers
Push Worker
APNS, FCM
SMS Worker
Twilio, SNS
Email Worker
SES, SendGrid
In-App Worker
WebSocket
PostgreSQL
Users, prefs
Redis
Rate limits, cache
Cassandra
Notification logs
S3
Templates, media

4Core Components Deep Dive

4.1 Notification Flow

1
Trigger received
Event triggers notification (order placed, message received)
2
Publish to queue
Notification request added to Kafka topic
3
Check preferences
Query user preferences - which channels enabled?
4
Rate limiting
Check if user hit notification limits
5
Render template
Inject dynamic data into message template
6
Route to channels
Send to appropriate channel workers
7
Deliver
Each worker sends to external provider (APNS, Twilio, etc.)
8
Log result
Store delivery status for analytics

4.2 Priority Handling

Critical
  • OTP codes
  • Security alerts
  • Payment confirmations
SLA: < 30 seconds
High
  • Order updates
  • Direct messages
  • Mentions
SLA: < 1 minute
Normal
  • Marketing
  • Recommendations
  • Weekly digests
SLA: < 5 minutes
Separate Queues by Priority
Use dedicated Kafka topics or SQS queues for each priority level. Critical queue gets more consumer instances and faster processing.

4.3 Rate Limiting

Per User Limits:
  • Push: Max 5/hour, 20/day
  • SMS: Max 3/hour, 5/day
  • Email: Max 10/day
Global Limits:
  • SMS provider rate limits
  • Email sending reputation
  • Push provider quotas
Use Redis with sliding window algorithm for distributed rate limiting.

5Channel Deep Dives

Push Notifications
Providers:
  • APNS (iOS)
  • FCM (Android/Web)
Flow:
  1. Get device token from user record
  2. Build platform-specific payload
  3. Send to APNS/FCM
  4. Handle delivery receipts
Challenges:
  • Token refresh
  • Silent vs alert
  • Payload size limits
SMS
Providers:
  • Twilio
  • AWS SNS
  • Plivo
Flow:
  1. Validate phone number format
  2. Select provider based on region/cost
  3. Send via provider API
  4. Handle delivery status webhooks
Challenges:
  • Cost (expensive)
  • Country regulations
  • Carrier filtering
Email
Providers:
  • AWS SES
  • SendGrid
  • Mailgun
Flow:
  1. Build HTML/text email from template
  2. Add tracking pixels and links
  3. Send via email provider
  4. Handle bounces and complaints
Challenges:
  • Spam filtering
  • Sender reputation
  • Bounce handling

6API Design

POST/api/v1/notifications/send
Send a notification
{
  "user_id": "12345",
  "template_id": "order_shipped",
  "channels": ["push", "email"],
  "data": {
    "order_id": "ORD-789",
    "tracking_url": "..."
  },
  "priority": "high"
}
POST/api/v1/notifications/bulk
Send to multiple users (async)
GET/api/v1/users/:id/preferences
Get user notification preferences
PUT/api/v1/users/:id/preferences
Update notification preferences

7Scaling Strategies

Horizontal Scaling
  • Add workers per channel independently
  • Kafka partitions for parallelism
  • Auto-scale based on queue depth
  • Regional deployments for latency
Reliability
  • Idempotency keys prevent duplicates
  • Dead letter queues for failures
  • Retry with exponential backoff
  • Circuit breakers for providers
Cost Optimization
  • Batch SMS/email where possible
  • Use cheapest provider per region
  • Aggregate low-priority notifications
  • Time-shift non-urgent sends
Observability
  • Track delivery rates per channel
  • Alert on high failure rates
  • End-to-end latency tracking
  • Provider health monitoring

8Key Takeaways

1Message queue decouples notification triggers from delivery.
2Priority queues ensure critical notifications get delivered first.
3Rate limiting prevents notification fatigue and respects provider limits.
4Per-channel workers allow independent scaling and provider abstraction.
5Idempotency + retries ensure at-least-once delivery without duplicates.
6User preferences respect opt-outs and channel selections.