Skip to main content
Source

This page is generated from skills/eks-best-practices/references/networking.md. Edit the source, not this page.

EKS Networking — VPC CNI & IP Management

Part of: eks-best-practices Purpose: VPC CNI configuration, subnet/CIDR planning, IPv4 vs IPv6, custom networking, Security Groups for Pods, and IP address management

For ingress, load balancing, DNS, and private clusters, see: Networking — Ingress & DNS For network policies and east/west traffic control, see: Security — Runtime & Network


Table of Contents

  1. VPC CNI Deep Dive
  2. VPC CNI Operations
  3. Subnet Planning
  4. IPv4 vs IPv6
  5. Security Groups for Pods

VPC CNI Deep Dive

The Amazon VPC CNI assigns pods real VPC IP addresses, enabling native VPC networking (security groups, NACLs, flow logs all work). It has two components:

  • CNI binary — invoked by kubelet on pod add/remove, wires up pod networking
  • ipamd (aws-node DaemonSet) — long-running IPAM daemon that manages ENIs and maintains a warm pool of IPs or prefixes

Mode Decision Matrix

ModeIP UsagePod DensityBest For
Secondary IP (default)1 IP per pod from subnetLimited by ENI × IPs per ENIMost workloads, simplest setup
Prefix Delegation/28 prefix per ENI slot~4-16× more pods per nodeHigh pod density, IP-constrained VPCs
Custom NetworkingPods use different subnet/CIDRSame as mode usedSeparate pod CIDR from node CIDR

Secondary IP Mode (Default)

Each pod receives one secondary private IP from an ENI attached to the node. The warm pool pre-allocates IPs for fast pod startup.

Max pods per node = (Number of ENIs × IPs per ENI) - 1

ENI counts and IPs-per-ENI vary by instance type and change over time — use the live max-pods-calculator.sh script as the source of truth rather than relying on a static table:

./max-pods-calculator.sh --instance-type m5.large --cni-version 1.9.0

IP cooldown: When a pod is deleted, its IP enters a 30-second cooldown cache before returning to the warm pool. This prevents premature IP recycling while kube-proxy updates iptables rules on all nodes.

Prefix Delegation Mode

Instead of assigning individual IPs, the CNI assigns /28 prefixes (16 IPs each) to ENI slots — dramatically increasing pod density without additional ENIs.

Enable prefix delegation:

kubectl set env daemonset aws-node \
-n kube-system \
ENABLE_PREFIX_DELEGATION=true

Max pods with prefix mode = (ENIs × (IPs per ENI - 1) × 16) + 2

# Calculate for prefix mode
./max-pods-calculator.sh --instance-type m5.large --cni-version 1.9.0 \
--cni-prefix-delegation-enabled
# Result: 110 (vs 29 in secondary IP mode)

You must update the max-pods setting on nodes when enabling prefix mode — the default reflects secondary IP mode limits:

# In managed node group launch template user data:
--use-max-pods false --kubelet-extra-args '--max-pods=110'

Prefix allocation is faster than ENI attachment. Attaching a prefix to an existing ENI completes in under a second vs ~10 seconds for a new ENI. In most cases, the CNI only needs a single ENI per node in prefix mode.

✅ DO:

  • Use prefix delegation when running >30 pods per node
  • Set WARM_PREFIX_TARGET=1 (default) — good balance of fast startup and IP efficiency
  • Use WARM_IP_TARGET (set <16) if you need tighter IP conservation per node
  • Use VPC Subnet CIDR reservations to ensure contiguous /28 blocks are available
  • Use similar instance types in the same node group — the lowest max-pods value applies to all nodes in the group
  • Update max-pods on nodes when enabling prefix mode

❌ DON'T:

  • Enable prefix delegation on fragmented subnets without reservations — prefix attachment will fail with InsufficientCidrBlocks
  • Mix secondary IP and prefix delegation modes in the same cluster
  • Downgrade VPC CNI below v1.9.0 after enabling prefix mode — you must delete and recreate nodes if you downgrade
  • Do rolling replacement of existing nodes — create new node groups, cordon/drain old ones, then delete them

Custom Networking

Assigns pod IPs from a different CIDR than node IPs, typically using a secondary VPC CIDR from CG-NAT space (100.64.0.0/10):

# Enable custom networking
kubectl set env daemonset aws-node -n kube-system \
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true
# Create ENIConfig per AZ
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1a
spec:
subnet: subnet-0123456789abcdef0 # Pod subnet in us-east-1a
securityGroups:
- sg-0123456789abcdef0

Automate AZ-based ENIConfig selection — name ENIConfigs after AZs and set:

kubectl set env daemonset aws-node -n kube-system \
ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone

Kubernetes labels nodes with topology.kubernetes.io/zone automatically, so the CNI picks the right ENIConfig per AZ without manual node labeling.

Max pods with custom networking is lower because the primary ENI is not used for pod IPs:

# Without prefix: (3 ENIs - 1) * (10 IPs/ENI - 1) + 2 = 20 (m5.large)
# With prefix: (3 ENIs - 1) * ((10 - 1) * 16) + 2 = 290 (m5.large)
# Recommended: 110 (CPU/memory typically exhausted before IPs)

Use custom networking when:

  • Node subnet CIDR is exhausted but you have other available CIDRs
  • Pod traffic must appear from a different CIDR (for firewall rules)
  • Deploying multiple EKS clusters connecting to on-premise datacenters (CG-NAT space avoids RFC1918 conflicts)

Avoid custom networking when:

  • Ready for IPv6 — IPv6 eliminates IP exhaustion without the operational overhead
  • CG-NAT space already in use — consider an alternate CNI or IPv6
  • Overlapping CIDRs — custom networking alone can't solve this; use a private NAT gateway with transit gateway instead

VPC CNI Configuration Tuning

SettingDefaultRecommendedPurpose
WARM_IP_TARGETN/A2-5Pre-allocated IPs for fast pod startup
MINIMUM_IP_TARGETN/A5-10Minimum IPs to keep available
WARM_ENI_TARGET10 (if using IP targets)Pre-allocated ENIs
WARM_PREFIX_TARGET11Pre-allocated prefixes (prefix mode)
POD_SECURITY_GROUP_ENFORCING_MODEstrictstandard (for NLB/NodeLocal DNS)SG for Pods traffic mode

WARM_IP_TARGET and MINIMUM_IP_TARGET override WARM_PREFIX_TARGET when set. Use WARM_IP_TARGET for fine-grained control; use WARM_PREFIX_TARGET for simplicity.

Warm pool trade-off: Warm ENIs still consume IPs from your subnet CIDR. In secondary IP mode on a 3-node cluster with WARM_ENI_TARGET=1, the CNI can consume 43+ IPs just for warm pools before any application pods are scheduled.


VPC CNI Operations

Use Managed Add-On

Deploy VPC CNI as an EKS managed add-on rather than self-managed. Managed add-ons provide:

  • Validated compatibility with your EKS version
  • Automatic drift prevention — EKS reconciles managed fields every 15 minutes
  • Simpler upgrades via EKS API/Console/CLI

Frequently-used fields like WARM_ENI_TARGET, WARM_IP_TARGET, and MINIMUM_IP_TARGET are not managed and won't be overwritten by drift prevention.

EKS Auto Mode

With EKS Auto Mode, AWS fully manages VPC CNI configuration. You don't install or upgrade networking add-ons. Use Auto Mode when you want AWS to handle CNI operations entirely.

Use Separate IAM Role for CNI

By default, VPC CNI inherits the node IAM role. This gives the CNI (and potentially compromised pods on the node) access to all permissions on the node role.

Strongly recommended: Create a dedicated IAM role for the CNI with only AmazonEKS_CNI_Policy attached. Use IRSA or Pod Identity to bind it to the aws-node service account:

# Specify CNI role when creating managed add-on
aws eks create-addon --cluster-name my-cluster \
--addon-name vpc-cni \
--service-account-role-arn arn:aws:iam::123456789012:role/eks-cni-role

Then remove AmazonEKS_CNI_Policy from the node role. For IPv6 clusters, create a custom IAM policy — the managed AmazonEKS_CNI_Policy only covers IPv4.

Backup CNI Settings Before Update

VPC CNI runs on the data plane, so EKS does not auto-upgrade it. Before updating:

# Backup current settings
kubectl get daemonset aws-node -n kube-system -o yaml > aws-k8s-cni-backup.yaml

Upgrade one minor version at a time (e.g., 1.9 → 1.10 → 1.11). Never delete the DaemonSet during upgrade — that causes application downtime.

Handle Liveness/Readiness Probe Failures

On data-intensive clusters, high CPU usage can cause aws-node probe health failures, leaving pods stuck in containerCreating. Increase the probe timeout:

# Default timeoutSeconds: 10 — increase if experiencing probe failures
livenessProbe:
timeoutSeconds: 30
readinessProbe:
timeoutSeconds: 30

Also ensure cpu resource requests for aws-node are appropriate (default 25m may be too low under heavy load).

IPTables Forward Policy (Custom AMIs)

If using custom AMIs (not EKS Optimized), ensure the iptables forward policy is set to ACCEPT in kubelet.service. Many systems default to DROP, which breaks pod networking.


Subnet Planning

EKS Cluster Architecture

An EKS cluster spans two VPCs:

  • AWS-managed VPC — hosts the Kubernetes control plane (not visible in your account)
  • Customer-managed VPC — hosts nodes, pods, load balancers, and other infrastructure

Nodes connect to the control plane through cross-account ENIs (X-ENIs) that EKS places in your cluster subnets. EKS creates up to 4 X-ENIs across the subnets you specify at cluster creation.

Control Plane Endpoint Access

ModeNode → API Server PathExternal AccessUse When
Public only (default)Leaves VPC via NAT/IGW → public endpointYesDev/test, simplest setup
Public + PrivateStays in VPC via X-ENIsYesProduction with external CI/CD access
Private onlyStays in VPC via X-ENIsNo (VPC/connected networks only)High-security, regulated environments

With public-only, nodes need a public IP or NAT gateway to reach the API server. With private enabled, traffic stays within the VPC via X-ENIs — lower latency and no internet dependency.

VPC CIDR: 10.0.0.0/16 (65,536 IPs)
├── Cluster Subnets (X-ENIs only — NOT for nodes)
│ ├── 10.0.0.0/28 (16 IPs) — us-east-1a
│ ├── 10.0.0.16/28 (16 IPs) — us-east-1b
│ └── 10.0.0.32/28 (16 IPs) — us-east-1c
├── Public Subnets (load balancers, NAT gateways)
│ ├── 10.0.1.0/20 (4,096 IPs) — us-east-1a
│ ├── 10.0.16.0/20 (4,096 IPs) — us-east-1b
│ └── 10.0.32.0/20 (4,096 IPs) — us-east-1c
├── Private Subnets (nodes + pods)
│ ├── 10.0.64.0/18 (16,384 IPs) — us-east-1a
│ ├── 10.0.128.0/18 (16,384 IPs) — us-east-1b
│ └── 10.0.192.0/18 (16,384 IPs) — us-east-1c
└── (Optional) Pod-only Subnets (with custom networking)
└── 100.64.0.0/16 (secondary CIDR, 65,536 IPs)

Dedicated cluster subnets (/28) prevent X-ENI IP consumption from competing with node/pod IPs. During cluster upgrades, EKS provisions additional ENIs in cluster subnets — if nodes share these subnets, IP contention can block upgrades.

Subnet Tagging Requirements

# Public subnets (for internet-facing ALB/NLB)
kubernetes.io/role/elb = 1

# Private subnets (for internal ALB/NLB)
kubernetes.io/role/internal-elb = 1

# All subnets used by EKS
kubernetes.io/cluster/<cluster-name> = shared # or "owned"

IP Exhaustion Strategies

StrategyComplexityIP GainTrade-offs
Prefix delegationLow4-16×Requires contiguous /28 blocks; use subnet reservations
Secondary CIDRMediumUp to /16100.64.0.0/10 (CG-NAT) recommended
Custom networkingMediumSeparate CIDRMore ENIConfig management; primary ENI unused for pods
IPv6HighUnlimitedIrreversible; dual-stack complexity
Private NAT gatewayMediumN/ASolves overlapping CIDRs; adds NAT GW cost

IPv4 vs IPv6

Decision Guide

FactorIPv4IPv6
IP availabilityLimited — plan for exhaustionVirtually unlimited (/80 prefix per node ≈ 10¹⁴ addresses)
Setup complexityStandardRequires dual-stack VPC with /56 CIDR
ReversibilityCan switch modesIrreversible — IPv6 is for the cluster's lifetime
Instance requirementAnyNitro-based instances only
CNI modeSecondary IP or prefix delegationPrefix mode only (auto-enabled)
WARM_IP/ENI tuningRequired for optimizationNot needed — prefix assigned at bootstrap
AWS service supportFullMost services (verify specific ones)
Network policyFull supportFull support (VPC CNI 1.14+)
Load balancersALB/NLB full supportALB/NLB dual-stack (requires LBC, in-tree controller doesn't support IPv6)
RecommendationDefault choiceUse when IPv4 exhaustion is a real concern

IPv6 Technical Details

Services get ULA addresses: Kubernetes services receive IPv6 addresses from Unique Local Address (ULA) space, auto-assigned at cluster creation and not modifiable.

Private subnets use EIGW: In IPv6, every address is internet-routable. Private subnets use egress-only internet gateways (EIGW) — allows outbound traffic while blocking all inbound.

Private IPv6 addressing (since August 2024): You can now use private IPv6 addresses via VPC IPAM instead of public GUA addresses.

IPv4 Egress from IPv6 Pods

IPv6 pods can still reach IPv4 endpoints. The VPC CNI uses a host-local secondary plugin that assigns each pod a non-routable IPv4 address from 169.254.172.0/22 (node-unique, up to 1024 addresses). Outbound IPv4 traffic is SNATed to the node's primary IPv4 address.

Pod (169.254.172.x) → SNAT to Node Primary IPv4 → NAT Gateway → Internet

DNS64 warning: Disable DNS64 on subnets where IPv6 pods run. When DNS64 is enabled, DNS returns synthesized IPv6 addresses for IPv4-only endpoints, routing traffic through NAT64 via the NAT gateway instead of direct SNAT — causing unexpected NAT gateway costs.

IPv6 Operational Considerations

Max pods formula for IPv6:

(ENIs × (IPs per ENI - 1) × 16) + 2
# m5.large: (3 × 9 × 16) + 2 = 434

In practice, CPU and memory exhaust before IPs. Managed node groups calculate max pods automatically — don't override unless using self-managed nodes.

Fargate in IPv6 clusters: Fargate pods consume both an IPv4 and IPv6 address from the VPC. Size dual-stack subnets for growth — new Fargate pods cannot be scheduled if the subnet has no available IPv4 addresses, regardless of IPv6 availability.

Load balancers: Use the AWS Load Balancer Controller (not the in-tree controller) with dual-stack annotations:

annotations:
alb.ingress.kubernetes.io/ip-address-type: dualstack
alb.ingress.kubernetes.io/target-type: ip

Re-evaluate custom networking: If you enabled custom networking to solve IPv4 exhaustion, it's no longer necessary with IPv6. Remove the overhead unless you have a separate security requirement for it.


Security Groups for Pods

Security Groups for Pods assigns AWS security groups directly to individual pods via branch ENIs, rather than sharing the node's security group. This enables AWS-level network isolation per pod — useful for controlling pod access to AWS services like RDS, ElastiCache, and other VPC resources.

For network policy vs SG for Pods comparison, and policy enforcement details, see: Security — Runtime & Network

When to Use

  • Pods need direct access to AWS services (RDS, ElastiCache) and you want to reuse existing SG rules
  • You need AWS-native audit trail (VPC Flow Logs per pod)
  • Migrating from EC2 instances to EKS and preserving existing SG-based access controls

Enforcing Modes

ModeBehaviorUse When
strict (default)Only branch ENI SG applies; SNAT disabled; all traffic leaves node via VPCComplete pod-to-AWS isolation needed
standardBoth node SG and branch ENI SG applyUsing with Network Policy, NodeLocal DNSCache, or need externalTrafficPolicy: Local

Strict mode impact: All pod traffic — even pod-to-pod on the same node — traverses the VPC network. This increases VPC traffic and breaks NodeLocal DNSCache.

Operational Requirements

Disable TCP early demux for liveness/readiness probes in strict mode:

kubectl edit daemonset aws-node -n kube-system
# Under initContainer, set:
# DISABLE_TCP_EARLY_DEMUX=true

Branch ENI capacity is additive to the existing secondary IP limit per instance type. A m5.large supports up to 9 branch ENIs in addition to its standard 29 secondary IPs. However, pods using SG for Pods are still counted toward max-pods — consider increasing max-pods.

Tag a single SG with kubernetes.io/cluster/$name when multiple SGs are assigned to a pod. This allows the AWS Load Balancer Controller to find and update rules for routing traffic to the pod.

NAT for outbound: Source NAT is disabled for pods with security groups. Deploy these pods on private subnets with a NAT gateway and enable external SNAT:

kubectl set env daemonset -n kube-system aws-node AWS_VPC_K8S_CNI_EXTERNALSNAT=true

terminationGracePeriodSeconds must be non-zero (default 30s is fine). When set to zero, the CNI doesn't clean up the pod network, leaving branch ENIs unreclaimed.

Fargate: SG for Pods works on Fargate. Without a SecurityGroupPolicy, Fargate pods get the cluster security group. Include the cluster SG in your SecurityGroupPolicy for simplicity, otherwise add all minimum required rules manually:

# Find cluster security group
aws eks describe-cluster --name CLUSTER_NAME \
--query 'cluster.resourcesVpcConfig.clusterSecurityGroupId'

Requirements not supported:

  • Windows nodes and non-Nitro instances
  • NodeLocal DNSCache in strict mode
  • SG for Pods with custom networking uses the SG from SecurityGroupPolicy, not from ENIConfig

Multus CNI

Multus enables multiple network interfaces on pods, required for workloads such as telco, DPDK, and SR-IOV applications.

How It Works

Multus acts as a meta-plugin that delegates to the primary CNI (VPC CNI) for the default interface and to additional CNI plugins for secondary interfaces. Pods define their network attachments through annotations referencing NetworkAttachmentDefinition CRDs.

Enabling Multus

WARNING: The thick-plugin variant has a known pod-lookup race condition that can break ALL pod creation cluster-wide. Only enable Multus after verifying your version includes the fix (v4.1.1+), or use thin-plugin mode instead.

Deploy Multus as a DaemonSet using the upstream manifests into kube-system:

multus:
# WARNING: thick-plugin has a pod-lookup race that breaks ALL pod creation.
# Only enable after verifying your Multus version includes the fix,
# or use thin-plugin mode instead.
enabled: false
image: ghcr.io/k8snetworkplumbingwg/multus-cni:v4.1.0-thick

Multus is deployed via kubectl_manifest resources that apply the upstream thick-plugin DaemonSet manifests directly into kube-system (not via Helm). The only config keys that matter are enabled and image.

When it is safe to enable:

  • You are using thin-plugin mode (multus-cni:v4.1.0-thin or later) which avoids the race entirely
  • You have confirmed your thick-plugin version includes the pod-lookup race fix (v4.1.1+)
  • You have tested in a non-production cluster first and validated pod creation is not affected

NetworkAttachmentDefinition Example

After Multus is installed, create NetworkAttachmentDefinitions for secondary interfaces:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: macvlan-conf
namespace: my-app
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "macvlan",
"master": "eth1",
"mode": "bridge",
"ipam": {
"type": "host-local",
"subnet": "10.10.0.0/16"
}
}

Reference it in a pod annotation:

metadata:
annotations:
k8s.v1.cni.cncf.io/networks: macvlan-conf

Node Security Group for Multus

If Multus secondary interfaces need access to specific network resources, add additional node security group rules:

node_sg_additional_rules:
multus_traffic:
description: "Allow Multus secondary interface traffic"
protocol: -1
from_port: 0
to_port: 0
type: ingress
cidr_blocks: ["10.10.0.0/16"]

Sources: