Notification System
- Summary Generate by ChatGPT when I was a interviewee
๐งฉ Problem Statementโ
Design a Notification System that can send notifications (Email, SMS, Push) to users in real-time.
๐ง Clarificationsโ
| Question | Answer |
|---|---|
| Supported Channels | Email, SMS, Push Notifications |
| Mode | Real-time |
| Scale | 10M notifications/day, ~2K/sec peak |
| Retry Logic | Yes, with DLQ for failed retries |
๐๏ธ High-Level Flowโ
- Trigger Event: Some user or system action triggers a notification event.
- Notification Service: Receives the event and validates user preferences.
- Publish to Queue: Notification service publishes the message to Kafka/RabbitMQ.
- Consumers: Channel-specific workers (Email/SMS/Push) consume messages.
- Delivery: Workers send notifications through external providers (e.g., SendGrid, Twilio).
- Retries/DLQ: Failed attempts are retried or moved to a Dead Letter Queue.
๐งฉ High-Level Architectureโ
Core Components:
- Notification Service: Handles validation, preference check, and publishing.
- Kafka/RabbitMQ: Message broker for decoupled communication.
- Channel Workers: Email, SMS, Push consumers.
- External Providers: SendGrid, Twilio, Firebase, etc.
- DLQ: For failed messages and manual reprocessing.
- Redis: For caching idempotency keys and user preferences.
- Monitoring Layer: Grafana/Prometheus for metrics and alerts.
Flow:
Trigger โ Notification Service โ Kafka โ Channel Workers โ Providers โ User
โ๏ธ Reliability & Fault Toleranceโ
- Retries: Implement retry logic with exponential backoff.
- DLQ: Store permanently failed messages for manual handling.
- Idempotency: Use Redis-based
eventIdormessageIdwith TTL to prevent duplicate notifications. - Backup Providers: Fallback to alternate providers when primary fails.
๐งญ Scalabilityโ
- Use Kafka partitioning (by
userIdornotificationType) for parallel consumption. - Scale workers horizontally based on partitions.
- Use idempotency keys to prevent duplicate sends when consumers crash and recover.
๐ Idempotency Designโ
- Storage: Redis
- Key:
eventIdoruserId:templateType - TTL: ~24 hours to avoid unbounded growth
- Locking: Ues Redis locks to handle concurrent sends
- Cleanup: Expire automatically via TTL or cron job
๐ Monitoring & Metricsโ
| Metric | Component | Description |
|---|---|---|
| Success Rate | All | Percentage of successful sends |
| Error Rate | All | Failure count / total attempts |
| Latency | Service + Worker | Time from trigger to delivery |
| Queue Depth | Kafka | Number of pending messages |
| Retry Count | DLQ | Number of retried messages |
| CPU/Memory | All | System health of services |
Alerting:
- Alert if error rate > 5% in 5 minutes.
- Alert if queue depth > threshold.
- Alert on DLQ growth or worker unresponsiveness.
โ๏ธ Extensibility - User Preferencesโ
- Store preferences in persistent DB (Postgres/DynamoDB).
- Cache in Redis using
userId โ preferenceshash. - On updates, invalidate Redis or update event-driven.
- Apply preference checks before publishing to Kafka.
๐งพ Optional API Contract Exampleโ
POST /notify
{
"userId": "123",
"type": "EMAIL",
"template": "ORDER_SHIPPED",
"data": { "orderId": "A123" }
}
๐ NFRsโ
| Requirement | Description |
|---|---|
| Availability | High, since delay is tolerable |
| Latency | less than 3 seconds for real-time delivery |
| Durability | Guaranteed message persistence via Kafka |
| Scalability | Horizontally scalable consumers |
| Reliability | Retry + DLQ + backup provider |