AWS Architect & Design - Decision Matrix: NAT per AZ vs Single NAT vs Regional NAT for EKS Environments

Final comparison table (all approaches)

Approach	Architecture (AZ / NAT)	Design	Cost profile	Parity to PROD	Terraform complexity	Risk	Best-practice fit	Pros	Cons	Typical fit (envs)
A. Zonal NAT per AZ (classic HA)	3 AZ / 3 NAT (one per AZ)	Each private subnet routes `0.0.0.0/0` to the same-AZ NAT	Highest baseline NAT hourly; lowest cross-AZ egress cost	Excellent	Medium (per-AZ loops for NAT + routes)	Low	Strong	HA for outbound; avoids cross-AZ NAT traffic; “boring” operations	Pays for 3 NATs even when non-prod is quiet	PROD, UAT, Stage if pre-prod
B. Single zonal NAT shared (cost compromise)	3 AZ / 1 NAT (in 1 AZ)	All private subnets (all AZs) route to a single NAT	Lower hourly NAT; may add cross-AZ data transfer for 2 AZs	Good topology parity, weaker egress parity	Low–Medium (simple NAT, but routing must be consistent)	Medium	Compromise	Keeps 3-AZ layout; saves NAT hourly cost	AZ containing NAT becomes egress SPOF; cross-AZ charges/latency possible	Stage/Dev only if outages acceptable & traffic low
C. Single-AZ lower envs	1 AZ / 1 NAT	Reduce VPC + cluster footprint to one AZ	Lowest overall (fewer subnets/nodes/LBs + 1 NAT)	Poor (no multi-AZ testing)	Low–Medium (parameterized modules)	Medium–High (late discovery of AZ issues)	OK if explicitly accepted	Cheapest + simplest; reduces many moving parts	Multi-AZ behavior not tested until UAT/PROD	DEV (often), sometimes Stage (sandbox)
D. Regional NAT Gateway (newer, simplified HA)	3 AZ / “1 NAT object” (regional mode auto-expands across AZs)	Set NAT availability to regional; AWS manages multi-AZ expansion/contraction	Can approach per-AZ cost depending on AZ expansion; simpler ops	Excellent	Low (fewer NAT resources/route concerns)	Low–Medium (newer feature adoption)	Strong (modern best practice option)	“HA by default” + simpler architecture; reduces need to manage per-AZ NAT placement	Newer operational patterns; Terraform changes (auto↔manual) can recreate	PROD/UAT/Stage, possibly Dev if you keep 3 AZ
E. NAT Instance (DIY)	1–3 AZ / EC2-based NAT	Replace NAT GW with EC2 NAT instances + routes	Lowest if tiny, but adds ops overhead	Depends	High (patching/scaling/HA)	Medium–High	Not preferred for platform teams	Cheap for DEV; can stop when unused	You own uptime, scaling, patching, throughput limits	DEV only (if extreme cost pressure)
Add-on: VPC Endpoints (EKS cost lever)	Works with any of the above	Use S3 gateway endpoint + interface endpoints (ECR, CloudWatch Logs, STS/KMS/SSM, etc.)	Often reduces NAT $/GB materially	Improves consistency & reliability	Medium (endpoint set + SG/policies)	Low	Strong	Reduces NAT traffic, improves security (no internet path)	Endpoint hourly costs exist; needs policy/governance	All envs (especially PROD/UAT)

Source-backed notes (why these rows look like they do)

AWS explicitly advises same-AZ placement or NAT per AZ to reduce cross-AZ data transfer charges. (AWS Documentation)
AWS EKS best practices recommend VPC endpoints to reduce costs and avoid internet traversal for AWS services. (AWS Documentation)
Regional NAT Gateway: AWS docs + “What’s New” describe regional mode and that you no longer need a public subnet to host a regional NAT. (AWS Documentation)
Terraform support and lifecycle caveats for regional NAT (availability_mode, availability_zone_address, recreation behavior) are documented in the provider registry. (Terraform Registry)

Environment-by-environment recommendation table (EKS)

Environment	Recommended AZ footprint	Recommended NAT strategy	Why (practical outcome)
PROD	3 AZ	A (Zonal NAT per AZ) or D (Regional NAT)	Multi-AZ egress should not have an AZ SPOF; minimize cross-AZ NAT traffic; keep ops predictable. (AWS Documentation)
UAT	3 AZ	Match PROD (A or D)	UAT’s job is to catch prod-like scaling/AZ behaviors early.
STAGE	Depends on purpose	If pre-prod: match PROD (A or D). If sandbox: C (1 AZ) or B (3 AZ + 1 NAT)	Pre-prod should validate multi-AZ. Sandbox can trade HA for cost. Cross-AZ NAT costs matter if you keep 3 AZ with 1 zonal NAT. (AWS Documentation)
DEV	1 AZ (default)	C (1 AZ + 1 NAT); optionally endpoints	Cheapest while still functional. Keep 3 AZ only if you truly need early multi-AZ validation. (AWS Documentation)

Zonal NAT vs Regional NAT (quick decision points)

Topic	Zonal NAT (classic)	Regional NAT (new)
HA model	You build HA by deploying one NAT per AZ	“HA by default” via regional availability mode and automatic AZ expansion (AWS Documentation)
Ops model	More resources, more route-table patterns	Simpler NAT footprint; fewer NAT objects
Terraform	Straightforward per-AZ loops	Supported, but switching auto/manual modes can recreate resources (Terraform Registry)
Best fit	Teams wanting mature, widely-used patterns	Teams wanting simpler HA with less NAT sprawl

Following Concern has been address in this blog

NAT Gateway Strategy for Multi-Account EKS: HA, Cost, and Parity Trade-offs
EKS Network Egress Design Guide: Zonal vs Regional NAT and Environment Sizing
Standardizing VPC Egress Across PROD/UAT/STAGE/DEV for EKS
Balancing Cost and Reliability: NAT Gateways, AZ Footprint, and VPC Endpoints in EKS
Reference Architecture: Multi-AZ EKS with Optimized NAT and Private Connectivity
Decision Matrix: NAT per AZ vs Single NAT vs Regional NAT for EKS Environments
Reducing NAT Spend in EKS: Environment Tiers + VPC Endpoints Blueprint
EKS Networking Parity Playbook: When to Match PROD and When to Cut Cost

Below is a consolidated recommendation guide for NAT strategy across PROD / UAT / STAGE / DEV for an EKS-based platform in separate AWS accounts, with 3 AZs, 3 public + 3 private subnets per VPC.

Executive conclusion

If an environment runs workloads across multiple AZs, the clean “AWS classic” design is NAT Gateway per AZ (zonal NAT) with each private subnet routing to the NAT in the same AZ. This avoids AZ-level egress SPOF and avoids cross-AZ data transfer for NAT traffic. (AWS Documentation)
“3 AZ + 1 zonal NAT GW” (single NAT in one AZ, shared by all AZs) is a cost compromise, not best practice: it introduces an AZ-level SPOF for egress and typically adds cross-AZ charges for the two “remote” AZs. (Repost)
New option (late 2026): Regional NAT Gateway (“availability mode = regional”) gives you one NAT object that automatically expands across AZs based on workload presence, and simplifies route-table/placement concerns. It’s designed for “HA by default” with simpler ops. (AWS Documentation)
For EKS specifically, the biggest NAT cost driver is often AWS service traffic (ECR image pulls, logs/metrics, STS/KMS, etc.). Use VPC endpoints aggressively to reduce NAT processing/data transfer where possible. (AWS Documentation)

Key AWS facts to anchor decisions

Zonal NAT Gateway is created in a specific AZ and has redundancy within that AZ (not across AZs). (AWS Documentation)
AWS explicitly recommends reducing NAT data transfer charges by keeping traffic in the same AZ (i.e., use NAT per AZ for multi-AZ workloads), because cross-AZ paths can add cost. (Repost)
Regional NAT Gateway automatically expands/contracts across AZs based on workload presence. (AWS Documentation)
Terraform supports regional NAT configuration via availability_mode and (optionally) availability_zone_address blocks. (Terraform Registry)

Approaches you should consider (with Pros/Cons)

Approach A — Multi-AZ + Zonal NAT GW per AZ (3 NATs for 3 AZs)

Design

NAT GW in each public subnet (one per AZ).
Each private subnet’s route table points 0.0.0.0/0 to the NAT in the same AZ.

Pros

Best HA for egress (no AZ-level NAT SPOF).
Avoids cross-AZ NAT traffic/cost.
Most aligned with long-standing AWS reference patterns.

Cons

Highest baseline NAT hourly cost (3x NAT resources).
More resources (though IaC makes it mostly variable-driven).

Terraform complexity

Moderate: for_each/count per AZ for NAT + per-AZ route tables (or per-AZ associations).

Risk

Low.

Best fit

PROD and UAT (and Stage if it is “pre-prod”).

Best-practice alignment

Strong. (AWS Documentation)

Approach B — Multi-AZ + Single zonal NAT GW shared (1 NAT for 3 AZs)

Design

One NAT GW in a single public subnet (e.g., AZ-a).
All private subnet route tables (AZ-a/b/c) point 0.0.0.0/0 to that one NAT.

Pros

Lower NAT hourly cost vs 3 NATs.
Keeps 3-AZ subnet topology (better parity than single-AZ environments).

Cons

AZ-level SPOF for outbound: if the NAT’s AZ is impaired, all private subnets lose internet egress (even if other AZs are fine). (AWS Documentation)
Likely cross-AZ NAT traffic costs for the other AZs; AWS guidance is to keep high-traffic resources in the same AZ as the NAT to reduce transfer charges. (Repost)

Terraform complexity

Low–moderate: simpler NAT resources, but route tables still need careful handling.

Risk

Medium (acceptable only if the environment can tolerate losing outbound internet during AZ events).

Best fit

STAGE/DEV only if you explicitly accept reduced egress HA and low traffic volume.

Best-practice alignment

Compromise (not “best practice” for HA). (Repost)

Approach C — Single-AZ lower environments (1 AZ + 1 NAT)

Design

Only one AZ in the VPC for STAGE/DEV (or DEV only).

Pros

Lowest overall cost: fewer subnets, fewer nodes/LBs, smaller footprint.
Simplest networking.

Cons

Parity drift: you won’t test multi-AZ behavior (scheduling spread, AZ capacity edge cases, load balancer zonal behavior, failover patterns, etc.) until UAT/PROD.

Terraform complexity

Often higher in practice if you maintain separate topology modules; can be low if your module is fully parameterized (az_count=1|3) but you’ll still have more branching.

Risk

Medium–high for “late discovery” of multi-AZ issues.

Best fit

DEV (usually OK), sometimes STAGE if stage is just a sandbox and not “pre-prod”.

Approach D — Regional NAT Gateway (one NAT object, multi-AZ behavior managed by AWS)

Design

Create NAT GW with availability_mode = "regional".
AWS expands/contracts across AZs based on workload presence; simplifies needing “one NAT per AZ” operations. (AWS Documentation)

Pros

Simplifies NAT architecture and ongoing ops (fewer NAT resources to reason about).
“HA by default” posture (AWS-managed multi-AZ expansion model). (AWS Documentation)
Terraform support exists (availability_mode, optional availability_zone_address). (Terraform Registry)

Cons / watchouts

This is newer, so you’ll want strong rollout discipline (testing + observability + runbooks).
Depending on how you configure addresses, resource changes can trigger recreation (Terraform notes). (Terraform Registry)

Terraform complexity

Lowest for NAT (often fewer resources, less per-AZ looping).

Risk

Low–medium (mostly “newness” + change management).

Best fit

Strong candidate for PROD/UAT/STAGE where you want simplicity without losing multi-AZ posture.

Approach E — NAT Instances (only for very low-value envs)

Not a best practice for long-run managed-platform operations:

You trade NAT GW cost for instance management (patching, scaling, HA, monitoring).
Only consider for DEV if cost pressure is extreme.

EKS-specific cost and design guidance (applies to all approaches)

Reduce NAT dependency with VPC endpoints

EKS environments commonly send large volumes to AWS services; use VPC endpoints to bypass NAT for those flows and reduce NAT charges. (AWS Documentation)

Common high-value endpoints in EKS VPCs:

Gateway endpoints: S3 (and DynamoDB if used)
Interface endpoints (often worth it): ECR (api + dkr), CloudWatch Logs, STS, KMS, SSM, EC2, ELB (as relevant)

This is often a bigger win than “1 NAT vs 3 NAT” because it reduces per-GB NAT cost and avoids cross-AZ NAT effects.

Recommendation by environment

PROD (must be boring, resilient)

Recommended

Zonal NAT per AZ (Approach A) OR
Regional NAT (Approach D) if you want simpler ops and are comfortable adopting the newer model after validation.

Avoid

Single zonal NAT shared across AZs (Approach B).

Why

Production multi-AZ workloads should not have an AZ-level egress SPOF. (AWS Documentation)

UAT (closest to PROD)

Recommended

Match PROD architecture (same AZ count and NAT strategy).
If PROD uses Approach A, UAT uses A. If PROD uses D, UAT uses D.

Why

UAT is where you want to catch “prod-like” issues early (networking, scaling, AZ placement behaviors).

STAGE (depends on what “stage” means in your org)

If STAGE is truly pre-prod / release rehearsal

Treat it like PROD/UAT: Approach A or D.

If STAGE is mainly integration sandbox / shared testing

Either:
- Approach C (single-AZ) for cost, accepting parity drift; or
- Approach B (3 AZ + 1 zonal NAT) only if traffic is low and you explicitly accept egress SPOF + some cross-AZ costs. (Repost)
- Approach D (regional NAT) is also attractive here because it preserves multi-AZ posture with simpler ops. (AWS Documentation)

DEV (optimize cost, accept downtime)

Recommended default

Single-AZ (Approach C) is usually the best cost/maintenance outcome.

Alternative

If you want to keep 3-AZ subnet topology for early detection of AZ-related issues:
- Prefer Regional NAT (Approach D) over “3 AZ + 1 zonal NAT”, because it avoids the “deliberate AZ SPOF” pattern while keeping ops simpler. (AWS Documentation)

Side-by-side summary (what to pick when)

If your priority is “AWS classic best practice + lowest risk”:

PROD/UAT/STAGE(pre-prod): A
DEV: C
Endpoints everywhere

If your priority is “keep parity but simplify NAT management”:

PROD/UAT/STAGE: D
DEV: C (or D if you insist on multi-AZ dev)
Endpoints everywhere

If your priority is “maximum cost cutting in non-prod”:

PROD/UAT: A (or D after adoption)
STAGE: C (single-AZ)
DEV: C
Endpoints still recommended

Terraform notes (practical maintainability)

Keeping modules maintainable long-run

Model these as variables, not separate templates:

az_count (1 or 3)
nat_strategy (per_az_zonal | single_zonal | regional)
enable_vpc_endpoints (true/false + list)

Regional NAT in Terraform

Terraform resource docs include availability_mode and availability_zone_address for regional NAT scenarios. (Terraform Registry)

Final recommendation

PROD & UAT: Multi-AZ must not have an egress AZ SPOF → NAT per AZ (zonal) or adopt Regional NAT after validation. (AWS Documentation)
STAGE: decide what Stage is:
- If “pre-prod”: match PROD/UAT.
- If “sandbox”: single-AZ is acceptable; otherwise consider Regional NAT to preserve multi-AZ without per-AZ NAT sprawl. (AWS Documentation)
DEV: single-AZ is usually best.
Across all envs: implement VPC endpoints to reduce NAT traffic and cost for EKS. (AWS Documentation)

Rajesh Kumar

I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I have worked at Cotocus. I share tech blog at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at TrueReviewNow , and SEO strategies at Wizbrand.

Do you want to learn Quantum Computing?

Please find my social handles as below;

Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at WIZBRAND

Rajesh Kumar DailyLogs

Find Trusted Cardiac Hospitals

Compare heart hospitals by city and services — all in one place.

Explore Hospitals

1 Comment

Newest

Oldest Most Voted

Inline Feedbacks

View all comments

Skylar Bennett

1 month ago

This is a very thoughtful and practical breakdown of NAT architecture choices for EKS environments! Comparing per‑AZ, single NAT, and regional NAT approaches — with clear guidance on design trade‑offs — helps cloud architects and DevOps teams make informed decisions that balance cost, availability, and performance. Real‑world context like this makes it much easier to understand not just what the options are, but why one might be a better fit depending on your workload and resilience goals. Thanks for sharing such a clear and insightful analysis!