Skip to main content

Notification System

  • Summary Generate by ChatGPT when I was a interviewee

๐Ÿงฉ Problem Statementโ€‹

Design a Notification System that can send notifications (Email, SMS, Push) to users in real-time.

๐Ÿง  Clarificationsโ€‹

QuestionAnswer
Supported ChannelsEmail, SMS, Push Notifications
ModeReal-time
Scale10M notifications/day, ~2K/sec peak
Retry LogicYes, with DLQ for failed retries

๐Ÿ—๏ธ High-Level Flowโ€‹

  1. Trigger Event: Some user or system action triggers a notification event.
  2. Notification Service: Receives the event and validates user preferences.
  3. Publish to Queue: Notification service publishes the message to Kafka/RabbitMQ.
  4. Consumers: Channel-specific workers (Email/SMS/Push) consume messages.
  5. Delivery: Workers send notifications through external providers (e.g., SendGrid, Twilio).
  6. Retries/DLQ: Failed attempts are retried or moved to a Dead Letter Queue.

๐Ÿงฉ High-Level Architectureโ€‹

Core Components:

  • Notification Service: Handles validation, preference check, and publishing.
  • Kafka/RabbitMQ: Message broker for decoupled communication.
  • Channel Workers: Email, SMS, Push consumers.
  • External Providers: SendGrid, Twilio, Firebase, etc.
  • DLQ: For failed messages and manual reprocessing.
  • Redis: For caching idempotency keys and user preferences.
  • Monitoring Layer: Grafana/Prometheus for metrics and alerts.

Flow: Trigger โ†’ Notification Service โ†’ Kafka โ†’ Channel Workers โ†’ Providers โ†’ User

โš™๏ธ Reliability & Fault Toleranceโ€‹

  • Retries: Implement retry logic with exponential backoff.
  • DLQ: Store permanently failed messages for manual handling.
  • Idempotency: Use Redis-based eventId or messageId with TTL to prevent duplicate notifications.
  • Backup Providers: Fallback to alternate providers when primary fails.

๐Ÿงญ Scalabilityโ€‹

  • Use Kafka partitioning (by userId or notificationType) for parallel consumption.
  • Scale workers horizontally based on partitions.
  • Use idempotency keys to prevent duplicate sends when consumers crash and recover.

๐Ÿ” Idempotency Designโ€‹

  • Storage: Redis
  • Key: eventId or userId:templateType
  • TTL: ~24 hours to avoid unbounded growth
  • Locking: Ues Redis locks to handle concurrent sends
  • Cleanup: Expire automatically via TTL or cron job

๐Ÿ“Š Monitoring & Metricsโ€‹

MetricComponentDescription
Success RateAllPercentage of successful sends
Error RateAllFailure count / total attempts
LatencyService + WorkerTime from trigger to delivery
Queue DepthKafkaNumber of pending messages
Retry CountDLQNumber of retried messages
CPU/MemoryAllSystem health of services

Alerting:

  • Alert if error rate > 5% in 5 minutes.
  • Alert if queue depth > threshold.
  • Alert on DLQ growth or worker unresponsiveness.

โš™๏ธ Extensibility - User Preferencesโ€‹

  • Store preferences in persistent DB (Postgres/DynamoDB).
  • Cache in Redis using userId โ†’ preferences hash.
  • On updates, invalidate Redis or update event-driven.
  • Apply preference checks before publishing to Kafka.

๐Ÿงพ Optional API Contract Exampleโ€‹

POST /notify
{
"userId": "123",
"type": "EMAIL",
"template": "ORDER_SHIPPED",
"data": { "orderId": "A123" }
}

๐Ÿ“š NFRsโ€‹

RequirementDescription
AvailabilityHigh, since delay is tolerable
Latencyless than 3 seconds for real-time delivery
DurabilityGuaranteed message persistence via Kafka
ScalabilityHorizontally scalable consumers
ReliabilityRetry + DLQ + backup provider