Zero Downtime: Designing an Azure Multi-Region Architecture

Question

Zero Downtime: Designing an Azure Multi-Region Architecture

architectraghu posted Nov 30, 2025 6 min read

Zero Downtime: Designing an Azure Multi-Region Architecture

Introduction

High availability isn’t just a buzzword to throw into a system design interview; it’s the difference between a minor hiccup and a career-defining disaster. We often build for the "happy path," assuming Azure regions are invincible fortresses. But cables get cut, weather happens, and yes, sometimes entire regions go dark.

If your application lives in a single region, your uptime is at the mercy of that specific geography. Today, we’re going to walk through moving from a fragile, single-region setup to a robust Active-Active Multi-Region Architecture that can survive a total regional outage without your users ever noticing.

The Story: The 3 AM Wake-Up Call

It was 3:14 AM on a Tuesday—it’s always a Tuesday. My phone buzzed off the nightstand, vibrating with the distinctive, heart-stopping pattern of PagerDuty.

"East US is down," the message read.

I groggily opened my laptop, hoping it was just a blip. It wasn't. An underlying storage outage in the primary region had cascaded, taking our entire e-commerce platform offline. We had a "Disaster Recovery" plan, but it was a cold backup in West Europe. Restoring it meant updating DNS records manually, warming up cold caches, and praying the database backups were consistent.

It took us four hours to get back online. In internet time, that’s an eternity. We lost revenue, but worse, we lost trust.

That morning, fueled by too much coffee and regret, we made a pact: Never again. We were going to build a system that could lose an entire region and keep humming along. Here is exactly how we did it.

Core Concepts: Active-Passive vs. Active-Active

Before we write a single line of code, we need to agree on the strategy.

Active-Passive: You have a primary region handling traffic and a secondary region sitting idle (or receiving replication data). Failover usually requires manual intervention or a delay.

Active-Active: Both regions are live. They both handle traffic simultaneously. If one dies, the global load balancer simply stops sending users there. This is the holy grail for zero downtime.

To achieve this, we need three layers of abstraction:

The Brain (Traffic Routing): Something global to direct users.
The Muscle (Compute): Stateless application servers.
The Heart (Data): A database that accepts writes in multiple places.

Step-by-Step Guide

We are going to build this using Azure Front Door, Azure App Service, and Cosmos DB.

1. The Entry Point: Azure Front Door

Old school setups used Azure Traffic Manager (DNS-based). Modern web apps should use Azure Front Door. It operates at Layer 7 (HTTP/S), uses the Microsoft global edge network to speed up content, and fails over almost instantly because it uses active health probes.

Why Front Door? It terminates SSL at the edge and routes traffic to the closest available backend.
The trick: We set up an "Origin Group" containing our East US and West Europe endpoints.

2. The Compute Layer: Go Stateless

This is where most developers get stuck. You cannot store session state (like a shopping cart or user login) in the memory of your web server. If a user hits East US for Request A and West Europe for Request B, the session will be lost.

Solution: Externalize state. Use Azure Redis Cache (Enterprise tier allows active-geo replication) or just keep state in the database.
The Code: Ensure your application connects to the local region's database endpoint to minimize latency.

3. The Data Layer: Multi-Region Writes

This is the hardest part. In a traditional SQL setup, you usually have one Writer and multiple Readers. If the Writer region goes down, you have to promote a Reader, which takes time.

Enter Azure Cosmos DB. It supports Multi-Region Writes. You can write to East US and West Europe simultaneously, and Cosmos DB handles the replication magic in the background.
Consistency: For zero downtime, we usually accept "Session" consistency. It guarantees that a user sees their own writes immediately, which is enough for 99% of apps.

Infrastructure as Code (Terraform)

Let's look at how to define the critical Front Door configuration in Terraform. This snippet creates a backend pool that checks both regions.

```terraform
resource "azurerm_cdn_frontdoor_profile" "main" {
  name                = "example-frontdoor"
  resource_group_name = azurerm_resource_group.main.name
  sku_name            = "Premium_AzureFrontDoor" # Required for some advanced features
}

resource "azurerm_cdn_frontdoor_origin_group" "main" {
  name                       = "example-origin-group"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.main.id

  load_balancing {
    sample_size                     = 4
    successful_samples_required     = 3
    additional_latency_in_milliseconds = 50
  }

  health_probe {
    path                = "/health"
    protocol            = "Https"
    interval_in_seconds = 100 # Aggressive probing for faster failover
    request_type        = "HEAD"
  }
}

# Add East US Origin
resource "azurerm_cdn_frontdoor_origin" "east_us" {
  name                          = "app-eastus"
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.main.id
  enabled                       = true
  host_name                     = azurerm_linux_web_app.east_app.default_hostname
  http_port                     = 80
  https_port                    = 443
  priority                      = 1
  weight                        = 1000
}

# Add West Europe Origin
resource "azurerm_cdn_frontdoor_origin" "west_eu" {
  name                          = "app-westeu"
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.main.id
  enabled                       = true
  host_name                     = azurerm_linux_web_app.west_app.default_hostname
  http_port                     = 80
  https_port                    = 443
  priority                      = 1
  weight                        = 1000
}

Best Practices & "Gotchas"

1. The Cost Factor

Warning: Multi-region is expensive. You are effectively doubling your compute and paying for cross-region data transfer.
Tip: Use auto-scaling rules. In "peace time," run your secondary region with minimal instances (maybe 2). Configure scale-out rules to handle the load only if the primary region fails or traffic spikes.

2. Conflict Resolution

If User A updates a record in the US, and User B updates the same record in Europe at the exact same millisecond, who wins?
Last Write Wins (LWW): The default in Cosmos DB. It uses the system clock. For most apps, this is fine.
Custom Merge: If you need complex logic, you'll need to write a stored procedure to merge the data.

3. Chaos Engineering

You cannot trust this architecture until you break it.
Action: Schedule a "Game Day." Manually turn off the Web App in the primary region during a low-traffic window. Watch your Front Door metrics. Does traffic shift automatically? Did anyone notice?

Community Corner

I’d love to hear how you handle data consistency in distributed systems!

Do you use SQL Failover Groups instead of Cosmos?
Have you ever had a failover that didn't work? What happened?
Drop a comment below. The best war story gets a shoutout in next week's newsletter!

FAQ

Q: Can I use Azure SQL Database for active-active?
A: Generally, no. Azure SQL is typically Active-Passive (one writer). You can use "Failover Groups" to automate the failover, but there is usually a short period (seconds to minutes) of write unavailability.

Q: What is the difference between Availability Zones and Regions?
A: Zones are separate datacenters within the same region (e.g., separate buildings in East US). They protect against fire/power failure in one building. Regions are hundreds of miles apart. They protect against natural disasters affecting a whole city.

Q: How do I handle scheduled deployments?
A: Deploy to one region at a time. Deploy to the "passive" or secondary region first, run smoke tests, and then deploy to the primary. This ensures you don't break the whole world simultaneously.

Q: Does Front Door add latency?
A: Actually, it often reduces it. Front Door uses "Anycast" to onboard user traffic at the nearest Microsoft Edge node (PoP) and rides the dedicated Microsoft backbone network to your app, avoiding the slower public internet.

Q: What is RTO and RPO?
A: RTO (Recovery Time Objective): How long you are down (Goal: Near zero). RPO (Recovery Point Objective): How much data you lose (Goal: Zero).

Conclusion

Building a Zero-Downtime Multi-Region architecture is an investment. It requires more complexity in your deployment pipelines and data handling. But the first time you sleep through a major regional outage because Front Door quietly shifted traffic to Europe, you’ll know it was worth every penny.

Next Step: Go to your Azure Portal right now and check your current application. Is it in a single region? If yes, map out what components (SQL, Storage, Compute) would need to change to support a second region. Start small—maybe just replicate your storage first!

5 Comments

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Andrew Mewbornverified · Answer 1 · 2025-12-01T16:42:22+0000

Andrew Mewbornverified • Dec 1, 2025

That 3 AM wake up part really hit me. Nice point from the author about going fully stateless. How do you usually handle conflict resolution across regions?

architectraghu • Dec 2, 2025

@[Andrew Mewborn] Totally that 3 AM page is usually when “stateless” stops being an architecture buzzword and becomes survival strategy.

On conflict resolution, I try to keep it as boring and predictable as possible:

For important domains, I default to single-writer per domain (one logical write region) and use other regions as read replicas.
Where I do allow multi-region writes, I keep it narrow and use simple rules like last-write-wins for non-critical data and app-level merge logic for business-critical stuff (e.g., store both versions and have a clear way to resolve).

I’m really curious how you’ve handled this on your side and if you’ve hit any nasty edge cases with conflicts, I’d love to hear what bit you.

mohamed.cybersec · Answer 2 · 2025-12-02T00:16:24+0000

mohamed.cybersec • Dec 1, 2025

great post really,keep going.

vibewithsoham · Answer 3 · 2025-12-02T01:58:55+0000

Loved this......felt like listening to a real 3 AM war story that ends with an actually actionable blueprint instead of buzzwords

James Dayalverified · Answer 4 · 2025-12-02T02:44:46+0000

James Dayalverified • Dec 1, 2025

Nice article...

	Designing a Multicloud Cellular Architecture for Blast Radius Containment Cláudio Raposo - May 4
	Implementing Cellular Data Sovereignty: AWS DynamoDB Global Tables vs. Azure Cosmos DB Multi-Region Cláudio Raposo - May 7
	Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou Cláudio Raposo - May 5
	3 Ways to Configure Resources in Terraform Ijay - Apr 14
	Hardening Azure Acmebot for ISO 27001 & NIS2 Compliance with Terraform dwoitzik - May 20

Zero Downtime: Designing an Azure Multi-Region Architecture

Zero Downtime: Designing an Azure Multi-Region Architecture

Introduction

The Story: The 3 AM Wake-Up Call

Core Concepts: Active-Passive vs. Active-Active

Step-by-Step Guide

1. The Entry Point: Azure Front Door

2. The Compute Layer: Go Stateless

3. The Data Layer: Multi-Region Writes

Infrastructure as Code (Terraform)

Best Practices & "Gotchas"

1. The Cost Factor

2. Conflict Resolution

3. Chaos Engineering

Community Corner

FAQ

Conclusion

5 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Designing a Multicloud Cellular Architecture for Blast Radius Containment

Implementing Cellular Data Sovereignty: AWS DynamoDB Global Tables vs. Azure Cosmos DB Multi-Region

Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou

3 Ways to Configure Resources in Terraform

Hardening Azure Acmebot for ISO 27001 & NIS2 Compliance with Terraform

More From architectraghu

Azure Landing Zone - Policy at Scale

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,341 amazing developers

Don't have an account? Sign up

OR

Zero Downtime: Designing an Azure Multi-Region Architecture

Zero Downtime: Designing an Azure Multi-Region Architecture

Introduction

The Story: The 3 AM Wake-Up Call

Core Concepts: Active-Passive vs. Active-Active

Step-by-Step Guide

1. The Entry Point: Azure Front Door

2. The Compute Layer: Go Stateless

3. The Data Layer: Multi-Region Writes

Infrastructure as Code (Terraform)

Best Practices & "Gotchas"

1. The Cost Factor

2. Conflict Resolution

3. Chaos Engineering

Community Corner

FAQ

Conclusion

5 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Designing a Multicloud Cellular Architecture for Blast Radius Containment

Implementing Cellular Data Sovereignty: AWS DynamoDB Global Tables vs. Azure Cosmos DB Multi-Region

Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou

3 Ways to Configure Resources in Terraform

Hardening Azure Acmebot for ISO 27001 & NIS2 Compliance with Terraform

More From architectraghu

Azure Landing Zone - Policy at Scale

Related Jobs

Commenters (This Week)