HLD Problem
Design Notification System
Design a scalable notification system supporting push notifications, SMS, email, and in-app messages with prioritization, rate limiting, and personalization.
35 min readMedium
1Requirements Gathering
Functional Requirements
- •Send push notifications (iOS, Android, Web)
- •Send SMS messages
- •Send emails (transactional & marketing)
- •In-app notifications
- •User notification preferences
- •Template management
- •Scheduling (send later)
- •Notification history
- •Analytics and tracking
Non-Functional Requirements
- •High availability (99.99%)
- •Low latency for critical notifications (< 1s)
- •At-least-once delivery guarantee
- •Handle 10M+ notifications/minute
- •Rate limiting per user/channel
- •Soft real-time (eventual delivery)
- •Cost optimization
2Capacity Estimation
Scale Numbers
500M
Notifications/Day
10M
Peak/Minute
100M
Users
4
Channels
Traffic Breakdown by Channel
Push Notifications60% (~300M/day)
In-App25% (~125M/day)
Email12% (~60M/day)
SMS3% (~15M/day)
3High-Level Architecture
System Architecture
User Actions
Scheduled Jobs
System Events
External APIs
↓
Notification API
REST/gRPC endpoints
↓
Message Queue (Kafka)
Decouple producers & consumers
↓
Notification Service
Orchestration
Preference Service
User settings
Template Service
Message templates
Rate Limiter
Throttling
↓
Channel Workers
Push Worker
APNS, FCM
SMS Worker
Twilio, SNS
Email Worker
SES, SendGrid
In-App Worker
WebSocket
↓
PostgreSQL
Users, prefs
Redis
Rate limits, cache
Cassandra
Notification logs
S3
Templates, media
4Core Components Deep Dive
4.1 Notification Flow
1
Trigger received
Event triggers notification (order placed, message received)
2
Publish to queue
Notification request added to Kafka topic
3
Check preferences
Query user preferences - which channels enabled?
4
Rate limiting
Check if user hit notification limits
5
Render template
Inject dynamic data into message template
6
Route to channels
Send to appropriate channel workers
7
Deliver
Each worker sends to external provider (APNS, Twilio, etc.)
8
Log result
Store delivery status for analytics
4.2 Priority Handling
Critical
- OTP codes
- Security alerts
- Payment confirmations
SLA: < 30 seconds
High
- Order updates
- Direct messages
- Mentions
SLA: < 1 minute
Normal
- Marketing
- Recommendations
- Weekly digests
SLA: < 5 minutes
Separate Queues by Priority
Use dedicated Kafka topics or SQS queues for each priority level. Critical queue gets more consumer instances and faster processing.
4.3 Rate Limiting
Per User Limits:
- Push: Max 5/hour, 20/day
- SMS: Max 3/hour, 5/day
- Email: Max 10/day
Global Limits:
- SMS provider rate limits
- Email sending reputation
- Push provider quotas
Use Redis with sliding window algorithm for distributed rate limiting.
5Channel Deep Dives
Push Notifications
Providers:
- APNS (iOS)
- FCM (Android/Web)
Flow:
- Get device token from user record
- Build platform-specific payload
- Send to APNS/FCM
- Handle delivery receipts
Challenges:
- Token refresh
- Silent vs alert
- Payload size limits
SMS
Providers:
- Twilio
- AWS SNS
- Plivo
Flow:
- Validate phone number format
- Select provider based on region/cost
- Send via provider API
- Handle delivery status webhooks
Challenges:
- Cost (expensive)
- Country regulations
- Carrier filtering
Email
Providers:
- AWS SES
- SendGrid
- Mailgun
Flow:
- Build HTML/text email from template
- Add tracking pixels and links
- Send via email provider
- Handle bounces and complaints
Challenges:
- Spam filtering
- Sender reputation
- Bounce handling
6API Design
POST
/api/v1/notifications/sendSend a notification
{
"user_id": "12345",
"template_id": "order_shipped",
"channels": ["push", "email"],
"data": {
"order_id": "ORD-789",
"tracking_url": "..."
},
"priority": "high"
}POST
/api/v1/notifications/bulkSend to multiple users (async)
GET
/api/v1/users/:id/preferencesGet user notification preferences
PUT
/api/v1/users/:id/preferencesUpdate notification preferences
7Scaling Strategies
Horizontal Scaling
- Add workers per channel independently
- Kafka partitions for parallelism
- Auto-scale based on queue depth
- Regional deployments for latency
Reliability
- Idempotency keys prevent duplicates
- Dead letter queues for failures
- Retry with exponential backoff
- Circuit breakers for providers
Cost Optimization
- Batch SMS/email where possible
- Use cheapest provider per region
- Aggregate low-priority notifications
- Time-shift non-urgent sends
Observability
- Track delivery rates per channel
- Alert on high failure rates
- End-to-end latency tracking
- Provider health monitoring
8Key Takeaways
1Message queue decouples notification triggers from delivery.
2Priority queues ensure critical notifications get delivered first.
3Rate limiting prevents notification fatigue and respects provider limits.
4Per-channel workers allow independent scaling and provider abstraction.
5Idempotency + retries ensure at-least-once delivery without duplicates.
6User preferences respect opt-outs and channel selections.