How a Single NAT Gateway Can Silently Kill Your AWS High Availability

How a Single NAT Gateway Can Silently Kill Your AWS High Availability

3 6
calendar_today agoschedule5 min read
— Originally published at dev.to

A real-world lesson from a production-like AWS lab challenge

The Scenario That Should Scare You

Imagine this: your AWS environment has two Availability Zones, public and private subnets, an Application Load Balancer, Auto Scaling. Your architecture diagram looks solid. Then one Availability Zone goes down, your ALB fails over instantly, your EC2 instances in AZ-B are running fine. But your application is still broken.

Because every private subnet instance, including those in AZ-B, is routing outbound traffic through one NAT Gateway sitting in AZ-A. Which is now unreachable.

You didn't have a highly available architecture. You had the illusion of one.

Understanding the Problem: NAT Gateways Are Zonal

A NAT Gateway is not a regional resource. It lives in a specific Availability Zone.

When you create a NAT Gateway, you place it in a specific subnet, which belongs to a specific AZ. If that AZ goes down, your NAT Gateway goes down with it.

Many teams create a single NAT Gateway to save costs, then route all private subnet traffic across all AZs through that one gateway:

Private Subnet AZ-A → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ✅
Private Subnet AZ-B → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ❌

The private subnet in AZ-B is routing through a NAT Gateway in AZ-A. This is a cross-AZ dependency, and a silent Single Point of Failure.

What I Found in the Lab

The lab presented a VPC with this structure:

Resource CIDR / Details
VPC 10.0.0.0/16
Public Subnet AZ-A 10.0.128.0/20
Public Subnet AZ-B 10.0.144.0/20
Private Subnet 1A (AZ-A) 10.0.0.0/19
Private Subnet 1B (AZ-A) 10.0.192.0/21
Private Subnet 2A (AZ-B) 10.0.32.0/19
Private Subnet 2B (AZ-B) 10.0.200.0/21

Two NAT Gateways existed: one in AZ-A, one in AZ-B. At first glance, this looked correct.

But when I inspected the Route Tables, the problem was immediately visible. All four private subnet Route Tables had the same entry:

Destination: 0.0.0.0/0 → Target: nat-09xxxxxxxx (AZ-A)

The NAT Gateway in AZ-B existed, but nobody was using it. It was provisioned but completely disconnected from the routing logic. The two private subnets in AZ-B were silently depending on the NAT Gateway in AZ-A for all outbound internet traffic.

Why This Happens

There are two common causes:

1. Cost-cutting gone wrong
Teams create one NAT Gateway to reduce costs, then forget that high availability requires one per AZ. A NAT Gateway costs approximately $0.045/hour plus data transfer charges. Running two instead of one adds roughly $32/month, a small price compared to the cost of an outage. **2. Infrastructure drift** The architecture was correct at some point, then someone modified the Route Tables manually, or via a flawed IaC change, and the second NAT Gateway became orphaned without anyone noticing. No alerts, no errors, no warnings. Everything looks fine until AZ-A goes down. This is what makes this particular SPOF so dangerous: **it is completely invisible during normal operations.** ### The Fix: One NAT Gateway Per AZ, One Route Table Per Private Subnet The solution is straightforward: each private subnet must route its outbound internet traffic through the NAT Gateway **in its own Availability Zone.** **Correct routing after the fix:** ```plaintext Private Subnet 1A (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅ Private Subnet 1B (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅ Private Subnet 2A (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅ Private Subnet 2B (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅ ``` **Step 1 — Identify which NAT Gateway belongs to which AZ** Go to **VPC → NAT Gateways**, click each NAT Gateway and check the **Subnet** field, this tells you which AZ it belongs to. **Step 2 — Fix the Route Tables for AZ-B private subnets** 1. Go to **VPC → Route Tables** 2. Find the Route Table associated with **Private Subnet 2A (AZ-B)** 3. Click **Edit routes** 4. Change `0.0.0.0/0` from `nat-AZ-A` → `nat-AZ-B` 5. Save changes 6. Repeat for **Private Subnet 2B (AZ-B)** **Step 3 — Verify** All four private subnet Route Tables should now point exclusively to the NAT Gateway in their own AZ. If AZ-A goes down, AZ-B is completely self-sufficient. ### Getting It Right From the Start: Terraform If you're provisioning your VPC with Infrastructure as Code, which you should be, here's how to enforce this pattern correctly with Terraform from day one. `ˋ`hcl # NAT Gateway in AZ-A resource "aws_eip" "nat_a" { domain = "vpc" } resource "aws_nat_gateway" "nat_a" { allocation_id = aws_eip.nat_a.id subnet_id = aws_subnet.public_a.id tags = { Name = "nat-gateway-az-a" } } # NAT Gateway in AZ-B resource "aws_eip" "nat_b" { domain = "vpc" } resource "aws_nat_gateway" "nat_b" { allocation_id = aws_eip.nat_b.id subnet_id = aws_subnet.public_b.id tags = { Name = "nat-gateway-az-b" } } # Route Table — AZ-A private subnets resource "aws_route_table" "private_a" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" nat_gateway_id = aws_nat_gateway.nat_a.id } tags = { Name = "private-rt-az-a" } } # Route Table — AZ-B private subnets resource "aws_route_table" "private_b" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" nat_gateway_id = aws_nat_gateway.nat_b.id } tags = { Name = "private-rt-az-b" } } # Associations — AZ-A resource "aws_route_table_association" "private_1a" { subnet_id = aws_subnet.private_1a.id route_table_id = aws_route_table.private_a.id } resource "aws_route_table_association" "private_1b" { subnet_id = aws_subnet.private_1b.id route_table_id = aws_route_table.private_a.id } # Associations — AZ-B resource "aws_route_table_association" "private_2a" { subnet_id = aws_subnet.private_2a.id route_table_id = aws_route_table.private_b.id } resource "aws_route_table_association" "private_2b" { subnet_id = aws_subnet.private_2b.id route_table_id = aws_route_table.private_b.id } `ˋ` The beauty of this approach: **the correct pattern is enforced by design.** Each AZ has its own NAT Gateway, its own Route Table, and explicit associations. Infrastructure drift becomes impossible, any change goes through code review. ###The Broader Lesson: Designing for Failure AWS high availability is built on one fundamental principle: > **Assume everything will fail. Design so that the failure of any single component does not bring down the entire system.** A NAT Gateway is a component. An Availability Zone is a failure domain. When you route cross-AZ traffic through a single NAT Gateway, you create an invisible dependency that violates this principle, and the worst part is that **everything looks fine until the moment it isn't.** The AWS Well-Architected Framework's Reliability Pillar specifically calls for eliminating Single Points of Failure. A shared NAT Gateway is a textbook SPOF, easy to miss precisely because the architecture looks correct at first glance. ###Key Takeaways - A NAT Gateway is **zonal**, it belongs to one specific Availability Zone - Routing all private subnet traffic through a single NAT Gateway creates a **hidden Single Point of Failure** - The fix: **one NAT Gateway per AZ, one Route Table per AZ** - Use **Terraform** to enforce this pattern by design and prevent infrastructure drift - The cost of two NAT Gateways (~$32/month extra) is nothing compared to the cost of an outage


This article is part of my AWS Solutions Architect Associate (SAA-C03) preparation series. I document real hands-on lab experiences, architecture challenges, and lessons learned along the way.

Follow along for more practical AWS architecture and Infrastructure as Code content.

🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou

Cláudio Raposo - May 5

AWS Certifications Are a Building Block, Not the Final Destination

Ijay - Jun 16

How to Reduce Your AWS Bill by 50%

rogo032 - Jan 27

10 Proven Ways to Cut Your AWS Bill

rogo032 - Jan 16

Designing a Multicloud Cellular Architecture for Blast Radius Containment

Cláudio Raposo - May 4
chevron_left
168 Points9 Badges
2Posts
0Comments
1Connections

Related Jobs

View all jobs →

Commenters (This Week)

6 comments
2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!