Multi-Region Failover & Disaster Recovery

Multi-region failover is the architectural capability of a distributed system to shift traffic from one geographically isolated deployment (e.g., Mumbai) to another (e.g., Singapore) when the primary region becomes unavailable.

Why It Exists (Risks of Single-Region)

Cloud Outages: Provider-level failures (AWS, Azure, GCP).
Physical Disasters: Power failures, natural disasters.
Human Error: Deployment bugs, misconfigurations.
Compliance: Data residency and residency laws (e.g., GDPR).
Performance: Latency optimization by serving users from the nearest region.

Types of Failures & Solutions

Failures occur at different layers and require specific mitigation strategies.

Layer	Example Failure	Solution / Mitigation
Compute	K8s cluster crash	Regional redundancy, automated scaling.
Network	Region isolation	Circuit breakers, cross-region routing.
Data	Corruption / Bad script	Delayed replication, Point-in-Time Recovery (PITR).
Application	Poison Pill (bad code)	Canary deployments, blue-green strategies.
Dependency	DNS / Storage failure	Anycast IP, multi-provider DNS, regional storage replication.

Recovery Metrics (RTO vs. RPO)

These metrics dictate your architectural choice and cost

Metric	Definition	Technical Impact
RTO (Recovery Time Objective)	The maximum acceptable delay between failure and service restoration	Influences traffic steering (DNS vs. Anycast) and server readiness (Warm vs. Cold)
RPO (Recovery Point Objective)	The maximum amount of data loss (measured in time) tolerated during a failure	Dictates database replication strategy (Sync vs. Async)

Strategy Selection Matrix

Strategy	RTO	RPO	Replication	Cost	Use Case
Backup & Restore	Hours+	24 Hours+	Periodic snapshots	Lowest	Non-critical/Archival
Pilot Light	Minutes+	Minutes	Async Data / Core data backup only	Low	General Business Apps
Warm Standby	Minutes	Seconds+	Small live footprint	Medium	SaaS / E-commerce
Active-Active (Multi-Leader)	Near-Zero	Seconds	Bi-directional Async	High	High-scale Global Apps
Active-Active (Strongly Consistent)	Zero	Zero	Synchronous ACK	Highest	Banking / Payments

Multi-Region Architectures

A. Active-Passive (Hot/Warm Standby)

Mechanism: Traffic goes to Region A. Region B is idle or scaled down.
Pros: Simpler consistency, lower operational overhead.
Cons: Non-zero RTO (time to flip the switch), "Cold Cache" problems, wasted infrastructure cost.

B. Active-Active

Mechanism: Both regions serve traffic simultaneously using Geo-routing or Latency-based routing.
Pros: Near-zero RTO, better latency for global users.
Cons: Extremely complex data consistency, risk of Split-Brain.

The "Split-Brain" Problem

A scenario where the connection between regions fails, but both regions stay alive. Both assume the other is dead and promote themselves to "Leader," leading to data divergence.

Solutions:

Quorum (Majority Rule): Using an odd number of nodes (or a 3rd witness region). A region can only accept writes if it sees >50% of the nodes (e.g., 2 out of 3).
Fencing (STONITH): "Shoot The Other Node In The Head." Disabling the power or network access of the isolated region to prevent it from writing data.
Generation Clock: Using monotonically increasing "Term" numbers (Raft/Paxos). Old leaders with lower term numbers are rejected.

Database Failover: The Hardest Part

Application servers are stateless and easy to move; databases carry state and are bound by the CAP Theorem.

Replication Models:

Asynchronous: Fast performance, but risk of data loss (RPO > 0).
Synchronous: Zero data loss (RPO = 0), but high latency because every write must wait for an ACK from the other region.

Conflict Resolution (Active-Active):

Last Write Wins (LWW): Simple but risky due to clock skew.
CRDTs (Conflict-free Replicated Data Types): Automatic merging (e.g., counters, sets).
Vector Clocks: Tracking causality to see which update happened "after" another.

6. Traffic Steering & Health Detection

How do you know when to failover?

Detection:

Synthetic Probes: Pinging /health endpoints.
Deep Health Checks: Checking if the DB is writable, not just if the server is "on."

Steering:

DNS-Based: Fast to set up but plagued by DNS Caching (TTL issues).
Global Load Balancing (Anycast): Best-in-class. Uses a single IP globally; the network routes traffic to the nearest healthy edge.

Operational Best Practices

Idempotency: Ensure retry logic doesn't duplicate transactions (e.g., double-charging).
Thundering Herd: Avoid DB crashes after failover by "warming" the cache in the secondary region or using request coalescing.
Chaos Engineering: Periodically "kill" a region in production (Netflix Chaos Monkey style) to ensure failover works.
No Auto-Failback: Once a region recovers, do not automatically shift traffic back. Stabilize and sync data manually first to avoid "flapping."

Why It Exists (Risks of Single-Region)​

Types of Failures & Solutions​

Recovery Metrics (RTO vs. RPO)​

Strategy Selection Matrix​

Multi-Region Architectures​

A. Active-Passive (Hot/Warm Standby)​

B. Active-Active​

The "Split-Brain" Problem​

Solutions:​

Database Failover: The Hardest Part​

Replication Models:​

Conflict Resolution (Active-Active):​

6. Traffic Steering & Health Detection​

Detection:​

Steering:​

Operational Best Practices​