
Final comparison table (all approaches)
| Approach | Architecture (AZ / NAT) | Design | Cost profile | Parity to PROD | Terraform complexity | Risk | Best-practice fit | Pros | Cons | Typical fit (envs) |
|---|---|---|---|---|---|---|---|---|---|---|
| A. Zonal NAT per AZ (classic HA) | 3 AZ / 3 NAT (one per AZ) | Each private subnet routes 0.0.0.0/0 to the same-AZ NAT | Highest baseline NAT hourly; lowest cross-AZ egress cost | Excellent | Medium (per-AZ loops for NAT + routes) | Low | Strong | HA for outbound; avoids cross-AZ NAT traffic; “boring” operations | Pays for 3 NATs even when non-prod is quiet | PROD, UAT, Stage if pre-prod |
| B. Single zonal NAT shared (cost compromise) | 3 AZ / 1 NAT (in 1 AZ) | All private subnets (all AZs) route to a single NAT | Lower hourly NAT; may add cross-AZ data transfer for 2 AZs | Good topology parity, weaker egress parity | Low–Medium (simple NAT, but routing must be consistent) | Medium | Compromise | Keeps 3-AZ layout; saves NAT hourly cost | AZ containing NAT becomes egress SPOF; cross-AZ charges/latency possible | Stage/Dev only if outages acceptable & traffic low |
| C. Single-AZ lower envs | 1 AZ / 1 NAT | Reduce VPC + cluster footprint to one AZ | Lowest overall (fewer subnets/nodes/LBs + 1 NAT) | Poor (no multi-AZ testing) | Low–Medium (parameterized modules) | Medium–High (late discovery of AZ issues) | OK if explicitly accepted | Cheapest + simplest; reduces many moving parts | Multi-AZ behavior not tested until UAT/PROD | DEV (often), sometimes Stage (sandbox) |
| D. Regional NAT Gateway (newer, simplified HA) | 3 AZ / “1 NAT object” (regional mode auto-expands across AZs) | Set NAT availability to regional; AWS manages multi-AZ expansion/contraction | Can approach per-AZ cost depending on AZ expansion; simpler ops | Excellent | Low (fewer NAT resources/route concerns) | Low–Medium (newer feature adoption) | Strong (modern best practice option) | “HA by default” + simpler architecture; reduces need to manage per-AZ NAT placement | Newer operational patterns; Terraform changes (auto↔manual) can recreate | PROD/UAT/Stage, possibly Dev if you keep 3 AZ |
| E. NAT Instance (DIY) | 1–3 AZ / EC2-based NAT | Replace NAT GW with EC2 NAT instances + routes | Lowest if tiny, but adds ops overhead | Depends | High (patching/scaling/HA) | Medium–High | Not preferred for platform teams | Cheap for DEV; can stop when unused | You own uptime, scaling, patching, throughput limits | DEV only (if extreme cost pressure) |
| Add-on: VPC Endpoints (EKS cost lever) | Works with any of the above | Use S3 gateway endpoint + interface endpoints (ECR, CloudWatch Logs, STS/KMS/SSM, etc.) | Often reduces NAT $/GB materially | Improves consistency & reliability | Medium (endpoint set + SG/policies) | Low | Strong | Reduces NAT traffic, improves security (no internet path) | Endpoint hourly costs exist; needs policy/governance | All envs (especially PROD/UAT) |
Source-backed notes (why these rows look like they do)
- AWS explicitly advises same-AZ placement or NAT per AZ to reduce cross-AZ data transfer charges. (AWS Documentation)
- AWS EKS best practices recommend VPC endpoints to reduce costs and avoid internet traversal for AWS services. (AWS Documentation)
- Regional NAT Gateway: AWS docs + “What’s New” describe regional mode and that you no longer need a public subnet to host a regional NAT. (AWS Documentation)
- Terraform support and lifecycle caveats for regional NAT (
availability_mode,availability_zone_address, recreation behavior) are documented in the provider registry. (Terraform Registry)
Environment-by-environment recommendation table (EKS)
| Environment | Recommended AZ footprint | Recommended NAT strategy | Why (practical outcome) |
|---|---|---|---|
| PROD | 3 AZ | A (Zonal NAT per AZ) or D (Regional NAT) | Multi-AZ egress should not have an AZ SPOF; minimize cross-AZ NAT traffic; keep ops predictable. (AWS Documentation) |
| UAT | 3 AZ | Match PROD (A or D) | UAT’s job is to catch prod-like scaling/AZ behaviors early. |
| STAGE | Depends on purpose | If pre-prod: match PROD (A or D). If sandbox: C (1 AZ) or B (3 AZ + 1 NAT) | Pre-prod should validate multi-AZ. Sandbox can trade HA for cost. Cross-AZ NAT costs matter if you keep 3 AZ with 1 zonal NAT. (AWS Documentation) |
| DEV | 1 AZ (default) | C (1 AZ + 1 NAT); optionally endpoints | Cheapest while still functional. Keep 3 AZ only if you truly need early multi-AZ validation. (AWS Documentation) |
Zonal NAT vs Regional NAT (quick decision points)
| Topic | Zonal NAT (classic) | Regional NAT (new) |
|---|---|---|
| HA model | You build HA by deploying one NAT per AZ | “HA by default” via regional availability mode and automatic AZ expansion (AWS Documentation) |
| Ops model | More resources, more route-table patterns | Simpler NAT footprint; fewer NAT objects |
| Terraform | Straightforward per-AZ loops | Supported, but switching auto/manual modes can recreate resources (Terraform Registry) |
| Best fit | Teams wanting mature, widely-used patterns | Teams wanting simpler HA with less NAT sprawl |
Following Concern has been address in this blog
- NAT Gateway Strategy for Multi-Account EKS: HA, Cost, and Parity Trade-offs
- EKS Network Egress Design Guide: Zonal vs Regional NAT and Environment Sizing
- Standardizing VPC Egress Across PROD/UAT/STAGE/DEV for EKS
- Balancing Cost and Reliability: NAT Gateways, AZ Footprint, and VPC Endpoints in EKS
- Reference Architecture: Multi-AZ EKS with Optimized NAT and Private Connectivity
- Decision Matrix: NAT per AZ vs Single NAT vs Regional NAT for EKS Environments
- Reducing NAT Spend in EKS: Environment Tiers + VPC Endpoints Blueprint
- EKS Networking Parity Playbook: When to Match PROD and When to Cut Cost
Below is a consolidated recommendation guide for NAT strategy across PROD / UAT / STAGE / DEV for an EKS-based platform in separate AWS accounts, with 3 AZs, 3 public + 3 private subnets per VPC.
Executive conclusion
- If an environment runs workloads across multiple AZs, the clean “AWS classic” design is NAT Gateway per AZ (zonal NAT) with each private subnet routing to the NAT in the same AZ. This avoids AZ-level egress SPOF and avoids cross-AZ data transfer for NAT traffic. (AWS Documentation)
- “3 AZ + 1 zonal NAT GW” (single NAT in one AZ, shared by all AZs) is a cost compromise, not best practice: it introduces an AZ-level SPOF for egress and typically adds cross-AZ charges for the two “remote” AZs. (Repost)
- New option (late 2025): Regional NAT Gateway (“availability mode = regional”) gives you one NAT object that automatically expands across AZs based on workload presence, and simplifies route-table/placement concerns. It’s designed for “HA by default” with simpler ops. (AWS Documentation)
- For EKS specifically, the biggest NAT cost driver is often AWS service traffic (ECR image pulls, logs/metrics, STS/KMS, etc.). Use VPC endpoints aggressively to reduce NAT processing/data transfer where possible. (AWS Documentation)
Key AWS facts to anchor decisions
- Zonal NAT Gateway is created in a specific AZ and has redundancy within that AZ (not across AZs). (AWS Documentation)
- AWS explicitly recommends reducing NAT data transfer charges by keeping traffic in the same AZ (i.e., use NAT per AZ for multi-AZ workloads), because cross-AZ paths can add cost. (Repost)
- Regional NAT Gateway automatically expands/contracts across AZs based on workload presence. (AWS Documentation)
- Terraform supports regional NAT configuration via
availability_modeand (optionally)availability_zone_addressblocks. (Terraform Registry)
Approaches you should consider (with Pros/Cons)
Approach A — Multi-AZ + Zonal NAT GW per AZ (3 NATs for 3 AZs)
Design
- NAT GW in each public subnet (one per AZ).
- Each private subnet’s route table points
0.0.0.0/0to the NAT in the same AZ.
Pros
- Best HA for egress (no AZ-level NAT SPOF).
- Avoids cross-AZ NAT traffic/cost.
- Most aligned with long-standing AWS reference patterns.
Cons
- Highest baseline NAT hourly cost (3x NAT resources).
- More resources (though IaC makes it mostly variable-driven).
Terraform complexity
- Moderate:
for_each/countper AZ for NAT + per-AZ route tables (or per-AZ associations).
Risk
- Low.
Best fit
- PROD and UAT (and Stage if it is “pre-prod”).
Best-practice alignment
- Strong. (AWS Documentation)
Approach B — Multi-AZ + Single zonal NAT GW shared (1 NAT for 3 AZs)
Design
- One NAT GW in a single public subnet (e.g., AZ-a).
- All private subnet route tables (AZ-a/b/c) point
0.0.0.0/0to that one NAT.
Pros
- Lower NAT hourly cost vs 3 NATs.
- Keeps 3-AZ subnet topology (better parity than single-AZ environments).
Cons
- AZ-level SPOF for outbound: if the NAT’s AZ is impaired, all private subnets lose internet egress (even if other AZs are fine). (AWS Documentation)
- Likely cross-AZ NAT traffic costs for the other AZs; AWS guidance is to keep high-traffic resources in the same AZ as the NAT to reduce transfer charges. (Repost)
Terraform complexity
- Low–moderate: simpler NAT resources, but route tables still need careful handling.
Risk
- Medium (acceptable only if the environment can tolerate losing outbound internet during AZ events).
Best fit
- STAGE/DEV only if you explicitly accept reduced egress HA and low traffic volume.
Best-practice alignment
- Compromise (not “best practice” for HA). (Repost)
Approach C — Single-AZ lower environments (1 AZ + 1 NAT)
Design
- Only one AZ in the VPC for STAGE/DEV (or DEV only).
Pros
- Lowest overall cost: fewer subnets, fewer nodes/LBs, smaller footprint.
- Simplest networking.
Cons
- Parity drift: you won’t test multi-AZ behavior (scheduling spread, AZ capacity edge cases, load balancer zonal behavior, failover patterns, etc.) until UAT/PROD.
Terraform complexity
- Often higher in practice if you maintain separate topology modules; can be low if your module is fully parameterized (
az_count=1|3) but you’ll still have more branching.
Risk
- Medium–high for “late discovery” of multi-AZ issues.
Best fit
- DEV (usually OK), sometimes STAGE if stage is just a sandbox and not “pre-prod”.
Approach D — Regional NAT Gateway (one NAT object, multi-AZ behavior managed by AWS)
Design
- Create NAT GW with
availability_mode = "regional". - AWS expands/contracts across AZs based on workload presence; simplifies needing “one NAT per AZ” operations. (AWS Documentation)
Pros
- Simplifies NAT architecture and ongoing ops (fewer NAT resources to reason about).
- “HA by default” posture (AWS-managed multi-AZ expansion model). (AWS Documentation)
- Terraform support exists (
availability_mode, optionalavailability_zone_address). (Terraform Registry)
Cons / watchouts
- This is newer, so you’ll want strong rollout discipline (testing + observability + runbooks).
- Depending on how you configure addresses, resource changes can trigger recreation (Terraform notes). (Terraform Registry)
Terraform complexity
- Lowest for NAT (often fewer resources, less per-AZ looping).
Risk
- Low–medium (mostly “newness” + change management).
Best fit
- Strong candidate for PROD/UAT/STAGE where you want simplicity without losing multi-AZ posture.
Approach E — NAT Instances (only for very low-value envs)
Not a best practice for long-run managed-platform operations:
- You trade NAT GW cost for instance management (patching, scaling, HA, monitoring).
- Only consider for DEV if cost pressure is extreme.
EKS-specific cost and design guidance (applies to all approaches)
Reduce NAT dependency with VPC endpoints
EKS environments commonly send large volumes to AWS services; use VPC endpoints to bypass NAT for those flows and reduce NAT charges. (AWS Documentation)
Common high-value endpoints in EKS VPCs:
- Gateway endpoints: S3 (and DynamoDB if used)
- Interface endpoints (often worth it): ECR (api + dkr), CloudWatch Logs, STS, KMS, SSM, EC2, ELB (as relevant)
This is often a bigger win than “1 NAT vs 3 NAT” because it reduces per-GB NAT cost and avoids cross-AZ NAT effects.
Recommendation by environment
PROD (must be boring, resilient)
Recommended
- Zonal NAT per AZ (Approach A) OR
- Regional NAT (Approach D) if you want simpler ops and are comfortable adopting the newer model after validation.
Avoid
- Single zonal NAT shared across AZs (Approach B).
Why
- Production multi-AZ workloads should not have an AZ-level egress SPOF. (AWS Documentation)
UAT (closest to PROD)
Recommended
- Match PROD architecture (same AZ count and NAT strategy).
- If PROD uses Approach A, UAT uses A. If PROD uses D, UAT uses D.
Why
- UAT is where you want to catch “prod-like” issues early (networking, scaling, AZ placement behaviors).
STAGE (depends on what “stage” means in your org)
If STAGE is truly pre-prod / release rehearsal
- Treat it like PROD/UAT: Approach A or D.
If STAGE is mainly integration sandbox / shared testing
- Either:
- Approach C (single-AZ) for cost, accepting parity drift; or
- Approach B (3 AZ + 1 zonal NAT) only if traffic is low and you explicitly accept egress SPOF + some cross-AZ costs. (Repost)
- Approach D (regional NAT) is also attractive here because it preserves multi-AZ posture with simpler ops. (AWS Documentation)
DEV (optimize cost, accept downtime)
Recommended default
- Single-AZ (Approach C) is usually the best cost/maintenance outcome.
Alternative
- If you want to keep 3-AZ subnet topology for early detection of AZ-related issues:
- Prefer Regional NAT (Approach D) over “3 AZ + 1 zonal NAT”, because it avoids the “deliberate AZ SPOF” pattern while keeping ops simpler. (AWS Documentation)
Side-by-side summary (what to pick when)
If your priority is “AWS classic best practice + lowest risk”:
- PROD/UAT/STAGE(pre-prod): A
- DEV: C
- Endpoints everywhere
If your priority is “keep parity but simplify NAT management”:
- PROD/UAT/STAGE: D
- DEV: C (or D if you insist on multi-AZ dev)
- Endpoints everywhere
If your priority is “maximum cost cutting in non-prod”:
- PROD/UAT: A (or D after adoption)
- STAGE: C (single-AZ)
- DEV: C
- Endpoints still recommended
Terraform notes (practical maintainability)
Keeping modules maintainable long-run
Model these as variables, not separate templates:
az_count(1 or 3)nat_strategy(per_az_zonal|single_zonal|regional)enable_vpc_endpoints(true/false + list)
Regional NAT in Terraform
Terraform resource docs include availability_mode and availability_zone_address for regional NAT scenarios. (Terraform Registry)
Final recommendation
- PROD & UAT: Multi-AZ must not have an egress AZ SPOF → NAT per AZ (zonal) or adopt Regional NAT after validation. (AWS Documentation)
- STAGE: decide what Stage is:
- If “pre-prod”: match PROD/UAT.
- If “sandbox”: single-AZ is acceptable; otherwise consider Regional NAT to preserve multi-AZ without per-AZ NAT sprawl. (AWS Documentation)
- DEV: single-AZ is usually best.
- Across all envs: implement VPC endpoints to reduce NAT traffic and cost for EKS. (AWS Documentation)
I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I have worked at Cotocus. I share tech blog at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at TrueReviewNow , and SEO strategies at Wizbrand.
Do you want to learn Quantum Computing?
Please find my social handles as below;
Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at WIZBRAND
Find Trusted Cardiac Hospitals
Compare heart hospitals by city and services — all in one place.
Explore Hospitals