Zero Downtime: Designing an Azure Multi-Region Architecture

Zero Downtime: Designing an Azure Multi-Region Architecture

Leader posted 6 min read

Zero Downtime: Designing an Azure Multi-Region Architecture


Introduction

High availability isn’t just a buzzword to throw into a system design interview; it’s the difference between a minor hiccup and a career-defining disaster. We often build for the "happy path," assuming Azure regions are invincible fortresses. But cables get cut, weather happens, and yes, sometimes entire regions go dark.

If your application lives in a single region, your uptime is at the mercy of that specific geography. Today, we’re going to walk through moving from a fragile, single-region setup to a robust Active-Active Multi-Region Architecture that can survive a total regional outage without your users ever noticing.


The Story: The 3 AM Wake-Up Call

It was 3:14 AM on a Tuesday—it’s always a Tuesday. My phone buzzed off the nightstand, vibrating with the distinctive, heart-stopping pattern of PagerDuty.

"East US is down," the message read.

I groggily opened my laptop, hoping it was just a blip. It wasn't. An underlying storage outage in the primary region had cascaded, taking our entire e-commerce platform offline. We had a "Disaster Recovery" plan, but it was a cold backup in West Europe. Restoring it meant updating DNS records manually, warming up cold caches, and praying the database backups were consistent.

It took us four hours to get back online. In internet time, that’s an eternity. We lost revenue, but worse, we lost trust.

That morning, fueled by too much coffee and regret, we made a pact: Never again. We were going to build a system that could lose an entire region and keep humming along. Here is exactly how we did it.


Core Concepts: Active-Passive vs. Active-Active

Before we write a single line of code, we need to agree on the strategy.

Active-Passive: You have a primary region handling traffic and a secondary region sitting idle (or receiving replication data). Failover usually requires manual intervention or a delay.

Active-Active: Both regions are live. They both handle traffic simultaneously. If one dies, the global load balancer simply stops sending users there. This is the holy grail for zero downtime.

To achieve this, we need three layers of abstraction:

  • The Brain (Traffic Routing): Something global to direct users.
  • The Muscle (Compute): Stateless application servers.
  • The Heart (Data): A database that accepts writes in multiple places.

Step-by-Step Guide

We are going to build this using Azure Front Door, Azure App Service, and Cosmos DB.

1. The Entry Point: Azure Front Door

Old school setups used Azure Traffic Manager (DNS-based). Modern web apps should use Azure Front Door. It operates at Layer 7 (HTTP/S), uses the Microsoft global edge network to speed up content, and fails over almost instantly because it uses active health probes.

Why Front Door? It terminates SSL at the edge and routes traffic to the closest available backend.
The trick: We set up an "Origin Group" containing our East US and West Europe endpoints.

2. The Compute Layer: Go Stateless

This is where most developers get stuck. You cannot store session state (like a shopping cart or user login) in the memory of your web server. If a user hits East US for Request A and West Europe for Request B, the session will be lost.

Solution: Externalize state. Use Azure Redis Cache (Enterprise tier allows active-geo replication) or just keep state in the database.
The Code: Ensure your application connects to the local region's database endpoint to minimize latency.

3. The Data Layer: Multi-Region Writes

This is the hardest part. In a traditional SQL setup, you usually have one Writer and multiple Readers. If the Writer region goes down, you have to promote a Reader, which takes time.

Enter Azure Cosmos DB. It supports Multi-Region Writes. You can write to East US and West Europe simultaneously, and Cosmos DB handles the replication magic in the background.
Consistency: For zero downtime, we usually accept "Session" consistency. It guarantees that a user sees their own writes immediately, which is enough for 99% of apps.


Infrastructure as Code (Terraform)

Let's look at how to define the critical Front Door configuration in Terraform. This snippet creates a backend pool that checks both regions.

```terraform
resource "azurerm_cdn_frontdoor_profile" "main" {
  name                = "example-frontdoor"
  resource_group_name = azurerm_resource_group.main.name
  sku_name            = "Premium_AzureFrontDoor" # Required for some advanced features
}

resource "azurerm_cdn_frontdoor_origin_group" "main" {
  name                       = "example-origin-group"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.main.id

  load_balancing {
    sample_size                     = 4
    successful_samples_required     = 3
    additional_latency_in_milliseconds = 50
  }

  health_probe {
    path                = "/health"
    protocol            = "Https"
    interval_in_seconds = 100 # Aggressive probing for faster failover
    request_type        = "HEAD"
  }
}

# Add East US Origin
resource "azurerm_cdn_frontdoor_origin" "east_us" {
  name                          = "app-eastus"
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.main.id
  enabled                       = true
  host_name                     = azurerm_linux_web_app.east_app.default_hostname
  http_port                     = 80
  https_port                    = 443
  priority                      = 1
  weight                        = 1000
}

# Add West Europe Origin
resource "azurerm_cdn_frontdoor_origin" "west_eu" {
  name                          = "app-westeu"
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.main.id
  enabled                       = true
  host_name                     = azurerm_linux_web_app.west_app.default_hostname
  http_port                     = 80
  https_port                    = 443
  priority                      = 1
  weight                        = 1000
}

Best Practices & "Gotchas"

1. The Cost Factor

Warning: Multi-region is expensive. You are effectively doubling your compute and paying for cross-region data transfer.
Tip: Use auto-scaling rules. In "peace time," run your secondary region with minimal instances (maybe 2). Configure scale-out rules to handle the load only if the primary region fails or traffic spikes.

2. Conflict Resolution

If User A updates a record in the US, and User B updates the same record in Europe at the exact same millisecond, who wins?
Last Write Wins (LWW): The default in Cosmos DB. It uses the system clock. For most apps, this is fine.
Custom Merge: If you need complex logic, you'll need to write a stored procedure to merge the data.

3. Chaos Engineering

You cannot trust this architecture until you break it.
Action: Schedule a "Game Day." Manually turn off the Web App in the primary region during a low-traffic window. Watch your Front Door metrics. Does traffic shift automatically? Did anyone notice?


Community Corner

I’d love to hear how you handle data consistency in distributed systems!

  • Do you use SQL Failover Groups instead of Cosmos?
  • Have you ever had a failover that didn't work? What happened?
  • Drop a comment below. The best war story gets a shoutout in next week's newsletter!

FAQ

Q: Can I use Azure SQL Database for active-active?
A: Generally, no. Azure SQL is typically Active-Passive (one writer). You can use "Failover Groups" to automate the failover, but there is usually a short period (seconds to minutes) of write unavailability.

Q: What is the difference between Availability Zones and Regions?
A: Zones are separate datacenters within the same region (e.g., separate buildings in East US). They protect against fire/power failure in one building. Regions are hundreds of miles apart. They protect against natural disasters affecting a whole city.

Q: How do I handle scheduled deployments?
A: Deploy to one region at a time. Deploy to the "passive" or secondary region first, run smoke tests, and then deploy to the primary. This ensures you don't break the whole world simultaneously.

Q: Does Front Door add latency?
A: Actually, it often reduces it. Front Door uses "Anycast" to onboard user traffic at the nearest Microsoft Edge node (PoP) and rides the dedicated Microsoft backbone network to your app, avoiding the slower public internet.

Q: What is RTO and RPO?
A: RTO (Recovery Time Objective): How long you are down (Goal: Near zero). RPO (Recovery Point Objective): How much data you lose (Goal: Zero).


Conclusion

Building a Zero-Downtime Multi-Region architecture is an investment. It requires more complexity in your deployment pipelines and data handling. But the first time you sleep through a major regional outage because Front Door quietly shifted traffic to Europe, you’ll know it was worth every penny.

Next Step: Go to your Azure Portal right now and check your current application. Is it in a single region? If yes, map out what components (SQL, Storage, Compute) would need to change to support a second region. Start small—maybe just replicate your storage first!

4 Comments

1 vote
1
1 vote
1 vote
1 vote

More Posts

Azure Landing Zone - Policy at Scale

architectraghu - Dec 2

Deploy an Azure Kubernetes Service (AKS) cluster using Azure CLI

Clever Cottonmouth - Apr 16

MySQL HeatWave Architecture: A Complete Guide

Derrick Ryan - Nov 2

Stop Mocking Everything: How to Test API Resilience in Your Terminal (Curl + Chaos Proxy)

aragossa - Dec 5

From Zero to Infra: Building a Production-Ready Setup Using Our API & Ansible

Nine Internet Solutions AG - Nov 13
chevron_left