Design a highly scalable notification system for 1M+ users supporting push, SMS, and email with strict reliability guarantees:
- No duplicate sends
- No missed notifications
- Graceful degradation when providers fail
- Horizontal scalability
+----------------------+
| Client Applications |
+----------+-----------+
|
v
+----------------------+
| Notification API |
| (Gateway Layer) |
+----------+-----------+
|
v
+----------------------+
| Notification Service |
| Validation + Routing |
+----------+-----------+
|
Writes Notification Jobs
|
v
+----------------------+
| PostgreSQL |
| Notifications Table |
+----------+-----------+
|
v
+----------------------+
| Outbox Table |
+----------+-----------+
|
CDC / Poller
|
v
+----------------------+
| RabbitMQ |
+----------+-----------+
|
+------------------+------------------+
| | |
v v v
+---------------+ +---------------+ +---------------+
| Email Worker | | SMS Worker | | Push Worker |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
v v v
+---------------+ +---------------+ +---------------+
| Provider | | Provider | | Provider |
| Adapters | | Adapters | | Adapters |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
v v v
SendGrid/SES Twilio/Termii FCM/APNs
When a user triggers a notification (email, SMS, or push), the system follows a strict flow to ensure reliability and no duplication:
- The API receives the request and validates it.
- The Notification Service creates a notification record in PostgreSQL.
- In the same database transaction, an outbox event is stored.
- A background publisher reads from the outbox table and publishes messages to RabbitMQ.
- RabbitMQ routes the message to the appropriate channel queue (email, SMS, or push).
- Channel workers consume the message and process it.
- Workers send the notification via external providers.
- The notification status is updated in PostgreSQL.
- If a failure occurs, the message goes through retry or fallback flow.
This flow ensures durability, traceability, and fault tolerance from request to delivery.
Each request carries a unique idempotency key:
idempotency_key = SHA256(user_id + template_id + event_id)
Enforced via a unique constraint in PostgreSQL to prevent duplicate processing across:
- API retries
- Queue redeliveries
- Worker restarts
Notification and outbox event are written in a single database transaction. A background publisher reliably pushes events into RabbitMQ.
This guarantees no message is lost between the database and queue layer.
- RabbitMQ provides at-least-once delivery
- Workers are designed to be idempotent
- Notification state is persisted in PostgreSQL for consistency
- id (UUID)
- user_id (UUID)
- channel (email | sms | push)
- payload (JSONB)
- status (PENDING | PROCESSING | SENT | FAILED | RETRYING | DLQ)
- provider
- provider_message_id
- retry_count
- idempotency_key (unique)
- created_at, updated_at
Indexes:
- (user_id, created_at)
- (status)
- UNIQUE (idempotency_key)
- id
- aggregate_id
- event_type
- payload
- processed
- created_at
- notifications.exchange (topic)
All queues are:
- durable: true (survive broker restart)
- messages persistent: true
- use TTL for retry control
- priority enabled for critical notifications (OTP > alerts > marketing)
Example queues:
- notifications.email.queue
- notifications.sms.queue
- notifications.push.queue
- notifications.retry.queue
- notifications.dlq.queue
To guarantee no message loss between publisher and RabbitMQ:
- Publisher confirms are enabled
- Every publish waits for broker ACK/NACK
- On NACK or timeout, message is retried
This ensures messages are never silently dropped.
- email.send
- sms.send
- push.send
- notification.retry
Dead Letter Exchange is used for retries:
Main Queue → Failure → DLX → Retry Queue → TTL expires → back to Main Queue
After max retries → DLQ for permanent failure handling.
Controls consumer throughput:
- SMS/OTP: 5–10
- Email: 10–50
- Push: 50–100
Rule:
prefetch ≈ worker concurrency × 2
Workers are stateless and horizontally scalable.
Responsibilities:
- Template rendering
- Provider selection
- Idempotency check
- Retry handling
- Status updates
- exponential backoff (30s → 2m → 10m → 30m)
- only retry transient failures
- permanent failures go to DLQ
If a provider fails:
Twilio → Termii
SendGrid → SES
FCM → APNs
Circuit breaker prevents cascading failures.
If one channel fails:
Push → SMS → Email
Applied based on notification priority.
Redis is used for caching hot paths such as templates, rate limits, and deduplication locks.
Reliability is achieved through idempotency keys, transactional outbox, RabbitMQ durable queues, publisher confirms, DLX retries, and idempotent workers.
System scales horizontally across API, workers, and queue consumers per channel.
| Factor | RabbitMQ | Kafka |
|---|---|---|
| Use case fit | Message routing (email vs SMS vs push) | Event streaming |
| Ops complexity | Lower | Higher |
| Replay capability | No | Yes |
| Max throughput | 20K msg/sec | 100K+ msg/sec |
For notifications: RabbitMQ wins due to flexible routing, simpler ops, and sufficient throughput.
- API receives request
- Notification + outbox stored in DB
- Publisher sends event to RabbitMQ
- SMS worker consumes message
- Provider sends OTP
- Status updated in DB
- On failure → retry or fallback provider
This design provides a reliable, scalable notification system using RabbitMQ with:
- Strong delivery guarantees
- No duplicates via idempotency
- No message loss via transactional outbox + publisher confirms
- Controlled retries via DLX
- Graceful degradation across channels
- Horizontal scalability for 1M+ users