Skip to main content
Source

This page is generated from skills/eks-best-practices/SKILL.md. Edit the source, not this page.

EKS Best Practices

Comprehensive guidance for designing, deploying, and operating Amazon EKS clusters. Consolidates guidance from the AWS EKS Best Practices Guide, AWS EKS HA/Resiliency Guide, and terraform-aws-modules/terraform-aws-eks examples.

When to Use This Skill

Activate this skill when:

  • Designing a new EKS cluster architecture
  • Choosing between EKS compute options (Fargate, MNG, Karpenter, Auto Mode)
  • Configuring EKS networking (VPC CNI, ingress, service mesh)
  • Implementing EKS security (IAM, pod security, secrets)
  • Planning cluster upgrades or migrations
  • Reviewing EKS architecture decisions
  • Working with terraform-aws-modules/terraform-aws-eks examples
  • Optimizing EKS cost or scaling to large clusters

Don't use this skill for:

  • Generic Kubernetes concepts (Claude knows these)
  • Provider-specific API reference (link to AWS docs)
  • Non-EKS container orchestration (ECS, Lambda)
  • Step-by-step EKS upgrade execution — this skill covers upgrade strategy and architectural decisions, not the per-version procedures themselves.

EKS Architecture Decision Framework

When to Use EKS

RequirementEKSECSLambda
Kubernetes ecosystem✅ Native K8s❌ AWS-proprietary
Portable across clouds✅ Standard K8s API❌ AWS-only❌ AWS-only
Long-running services⚠️ 15 min limit
Minimal ops overheadMediumLowLowest
GPU/ML workloads✅ Best supportLimited
Complex networking✅ Full controlMediumLimited
Team has K8s expertiseRequiredNot requiredNot required

EKS Deployment Models

ModelDescriptionOperational OverheadUse When
EKS StandardFull control over nodes, add-ons, networkingMedium-HighNeed full customization
EKS Auto ModeAWS manages nodes, add-ons, scalingLowWant minimal ops, standard workloads
EKS with FargateServerless pods, per-pod billingLowBatch, low-density workloads
EKS on OutpostsRun EKS on-premisesHighData residency, low-latency edge
EKS AnywhereEKS on your own infrastructureHighestAir-gapped, custom hardware

Shared Responsibility

ComponentAWS ManagesYou Manage
Control planeAPI server, etcd, HA, patchingRBAC, admission control, audit logging
Data plane (MNG)AMI updates, node healthInstance type, scaling, pod scheduling
Data plane (Fargate)EverythingPod spec, resource requests
Data plane (Auto Mode)Node lifecycle, OS patchingWorkload definitions
NetworkingENI attachment, VPC CNI releasesSubnet design, IP planning, ingress
SecurityControl plane authIAM, pod security, secrets, network policies

Compute Selection Matrix

Decision Table

FactorFargateMNGKarpenterAuto ModeSelf-Managed
Best forBatch, small scaleStable, predictableDynamic, variedMinimal opsCustom AMI/kernel
ScalingPer-podASG-basedFast, flexibleAWS-managedManual ASG
Spot support✅ Native
GPU support
DaemonSets
Cost modelPer vCPU/GB/hrPer EC2 instancePer EC2 instancePer EC2 instancePer EC2 instance
Max pods/node1ENI-basedENI-basedAWS-managedENI-based
Node SSH
OperationalLowestLowLowLowestHighest

Quick Decision Guide

  • Default choice: Karpenter — best balance of flexibility, cost, and automation
  • Zero ops priority: EKS Auto Mode — AWS manages nodes, add-ons, and scaling via managed Karpenter. Best for teams that want Kubernetes benefits without operational overhead around upgrades, autoscaling, load balancing, and storage
  • Serverless/batch: Fargate — no nodes to manage, per-pod billing
  • Predictable, stable: MNG — familiar ASG model, managed updates
  • Custom requirements: Self-managed — full control, highest overhead

✅ DO:

  • Use Karpenter as the default node autoscaler for new clusters
  • Run system components (CoreDNS, Karpenter) on MNG or Fargate
  • Use multiple instance types for availability and cost optimization

❌ DON'T:

  • Use self-managed nodes without a specific technical requirement
  • Run Fargate for GPU or DaemonSet-dependent workloads
  • Mix Karpenter and Cluster Autoscaler on the same node groups

Networking Quick Reference

VPC CNI Mode Decision

ModeUse WhenPod Density
Secondary IP (default)Most workloads, simple setupLimited by ENI × IPs per ENI
Prefix Delegation>30 pods/node, IP-constrained VPC4-16× more pods per node
Custom NetworkingPods need different CIDR than nodesSame as underlying mode

Ingress Pattern Selection

PatternBest ForKey Feature
ALB (via LBC)HTTP/HTTPS web appsNative WAF, Cognito auth
NLB (via LBC)TCP/UDP, gRPC, low latencyStatic IPs, source IP preservation
Gateway APIMulti-team, new deployments✅ Recommended standard
VPC LatticeCross-VPC service-to-serviceNo sidecar, IAM auth

IPv4 vs IPv6

FactorIPv4IPv6
Default choice✅ YesWhen facing IP exhaustion
AWS service supportFullMost (check specific services)
ComplexityStandardRequires dual-stack VPC

For detailed networking guidance, see: Networking — VPC CNI & IP | Networking — Ingress & DNS

Security Essentials

IAM Strategy

ApproachUse WhenSetup
Pod Identity✅ New workloads (EKS 1.24+)EKS add-on + association
IRSAOlder clusters, FargateOIDC provider + trust policy

Key rules:

  • ✅ Use Pod Identity for new workloads — simpler setup, session tags, role chaining
  • ✅ Use EKS access entries (API mode) over aws-auth ConfigMap
  • ✅ Move VPC CNI permissions from node role to Pod Identity/IRSA
  • ❌ Don't use wildcard conditions in IRSA trust policies
  • ❌ Don't attach application permissions to node IAM roles

Pod Security Baseline

Apply Pod Security Admission (PSA) labels to all namespaces:

# Minimum: enforce baseline, warn on restricted
metadata:
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/warn: restricted

Secrets Management

ApproachComplexityBest For
External Secrets OperatorMedium✅ GitOps workflows
Secrets Store CSIMediumMount secrets as volumes
KMS envelope encryptionLowEncrypt etcd secrets

Always enable KMS envelope encryption for Kubernetes secrets.

For detailed security guidance, see: Security Reference | Runtime & Network | Supply Chain & Compliance

Reliability Essentials

Pod Disruption Budgets

Create PDBs for every production workload with >1 replica:

WorkloadRecommended PDB
Stateless (3+ replicas)minAvailable: "50%"
Stateful quorum (3)maxUnavailable: 1
Batch/jobmaxUnavailable: "50%"
SingletonNo PDB (would block all disruptions)

Health Probe Strategy

ProbePurposeKey Rule
StartupWait for slow initUse for apps >10s startup
ReadinessTraffic routing✅ Check dependencies here
LivenessDetect deadlocks❌ Never check dependencies

Critical rule: Liveness probes must NOT check external dependencies. If the database goes down and liveness checks the DB, ALL pods restart — causing cascading failure.

Graceful Shutdown Pattern

spec:
terminationGracePeriodSeconds: 60
containers:
- lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]

Why sleep 15: Gives kube-proxy and load balancer time to remove the pod from traffic routing before SIGTERM.

Multi-AZ Distribution

topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule

For detailed reliability guidance, see: Reliability & Resiliency — Core (see also reliability-advanced.md for DR, deployment strategies, and large-cluster guidance)

Cluster Upgrade Strategy

Upgrade Sequence (Strict Order)

1. Control Plane → 2. EKS Add-ons → 3. Data Plane → 4. Custom Add-ons

Pre-Upgrade Checklist

  1. Check EKS Cluster Insights for upgrade readiness
  2. Scan for deprecated APIs (Pluto, kube-no-trouble)
  3. Verify add-on compatibility with target version
  4. Test in non-prod environment first
  5. Ensure PDBs are configured for graceful node drain
  6. Back up cluster state (Velero or GitOps repo)

Upgrade Strategy Decision

FactorIn-PlaceBlue-Green
RiskLow-MediumLowest
CostNo extra2× during migration
Rollback❌ No CP rollback✅ Switch back
Use when✅ Most upgradesCritical workloads

Data Plane with Karpenter

Karpenter automatically replaces nodes via drift detection after control plane upgrade. Control the speed with disruption.budgets:

disruption:
budgets:
- nodes: "10%" # Max 10% of nodes replaced at a time

For detailed upgrade guidance, see: Cluster Upgrades Reference

Autoscaling Quick Reference

Node Autoscaler Selection

KarpenterCluster AutoscalerAuto Mode
Default choice✅ YesLegacy/OutpostsMinimal ops
Scale-up speed~30s~60-90sAWS-managed
Consolidation✅ Built-in
CustomizationHighMediumLow

Pod Autoscaler Selection

ScalerTriggerUse Case
HPACPU, memory, customStateless services
VPAHistorical usageRight-sizing (recommendation mode)
KEDAExternal events (SQS, Kafka)Event-driven workloads

For detailed autoscaling guidance, see: Autoscaling Reference | Karpenter Reference

Terraform Examples Quick Start

Based on terraform-aws-modules/terraform-aws-eks.

Example Selection

Starting PointRecommended Example
General productionkarpenter (MNG for system + Karpenter for workloads)
Minimal opseks-auto-mode
Managed nodeseks-managed-node-group (AL2023 or Bottlerocket)
Full node controlself-managed-node-group
Platform capabilitieseks-capabilities (ArgoCD, ACK, KRO)
Hybrid/edgeeks-hybrid-nodes

Common Deployment Topologies

Private cluster with Karpenter:

VPC (3 AZs, terraform-aws-modules/vpc/aws)
├── Private subnets → EKS nodes (MNG for system, Karpenter for workloads)
├── Public subnets → ALB (internet-facing)
├── Intra subnets → EKS control plane ENIs
└── NAT Gateway → 1 per AZ for production

Multi-tenant platform:

EKS Cluster (terraform-aws-modules/eks/aws)
├── kube-system (platform: CoreDNS, kube-proxy, VPC CNI)
├── karpenter (Karpenter controller on MNG)
├── monitoring (shared: Prometheus, Grafana)
├── ingress (shared: AWS LBC)
├── team-a namespace (RBAC, NetworkPolicy, ResourceQuota)
├── team-b namespace (RBAC, NetworkPolicy, ResourceQuota)
└── team-c namespace (RBAC, NetworkPolicy, ResourceQuota)

For detailed examples and terraform patterns, see: Terraform Examples Reference

Cost Optimization Quick Wins

ActionSavingsEffort
Graviton (arm64)20-40%Low
Spot for non-critical60-90%Low
Karpenter consolidation20-30%Low
VPA right-sizing15-30%Medium
gp3 over gp220% on EBSLow
VPC endpointsEliminate NAT costsLow

For detailed cost guidance, see: Cost Optimization Reference | For scalability guidance, see: Scalability Reference

Observability Quick Reference

PillarAWS-ManagedOpen Source
MetricsContainer InsightsAMP + Grafana
LogsCloudWatch LogsOpenSearch, Loki
TracesX-RayADOT + Jaeger/Tempo

Essential: Enable EKS audit logging and GuardDuty EKS Runtime Monitoring for security visibility.

For detailed observability guidance, see: Observability Reference

EKS Capabilities

EKS Capabilities are AWS-managed features installed and updated as part of the EKS platform. They run in AWS-owned infrastructure separate from your clusters, with AWS handling scaling, patching, and upgrading.

CapabilityWhat It DoesWhen to Use ManagedWhen to Self-Manage
ArgoCDGitOps continuous deliveryMulti-account hub-and-spoke, IAM IDC integration, minimal opsCustom plugins, air-gapped, existing ArgoCD investment
ACKManage AWS resources via K8s CRDs (S3, RDS, IAM, etc.)Standard AWS resource managementSpecific controller version pinning, custom config
KROPlatform abstractions via ResourceGroupDefinitionsGolden path templates, multi-resource compositionsEarly adoption risk concerns, custom reconciliation logic

Combined pattern: ArgoCD deploys ACK resources + KRO compositions via GitOps, providing a single workflow for both infrastructure and applications.

For detailed ArgoCD patterns, see: ArgoCD Patterns Reference

Sources:

Detailed References

This skill uses progressive disclosure — essential guidance is in this main file, detailed reference material is loaded on demand:

  • Security — IAM, Cluster Access Manager, Pod Identity, IRSA, pod security standards, multi-tenancy, secrets management, data encryption
  • Security — Runtime & Network — Runtime threat detection (GuardDuty, seccomp, AppArmor, Falco), network policies, SG for pods, encryption in transit, detective controls
  • Security — Supply Chain & Compliance — Image security (SBOMs, attestations, ECR hardening), infrastructure hardening (Bottlerocket, CIS benchmarks), regulatory compliance, incident response
  • Networking — VPC CNI modes (secondary IP, prefix delegation, custom networking), subnet/CIDR planning, IPv4 vs IPv6, Security Groups for Pods, IP address management
  • Networking — Ingress & DNS — Ingress patterns (ALB, NLB, Gateway API), AWS Load Balancer Controller, service mesh, DNS/CoreDNS tuning, private cluster connectivity
  • Reliability & Resiliency — Core — HA patterns, PDBs, health probes, load balancer health checks, lifecycle hooks, topology spread, resource management
  • Reliability & Resiliency — Advanced — disaster recovery, zonal shift, deployment strategies, large cluster guidance, chaos engineering, admission-controller topology enforcement
  • Autoscaling — Autoscaler selection, Cluster Autoscaler (IAM, Spot, overprovisioning, parameter tuning), HPA, VPA, KEDA, CoreDNS autoscaling
  • Karpenter — Operational best practices, NodePools, EC2NodeClass, Spot/interruption handling, consolidation, multiple NodePool strategy, cost controls, resource management, private clusters, CoreDNS with Karpenter
  • Cluster Upgrades — In-place and blue-green upgrades, pre-upgrade validation, add-on management, API deprecation detection, version skew policy, Bottlerocket updates, rollback procedures
  • Cost Optimization — CFM framework, compute/networking/storage cost strategies, observability cost management, Spot, Graviton, tagging, Kubecost
  • Scalability — Scaling theory (churn rate, QPS), control plane (APF, monitoring), data plane (node sizing, diversity), cluster services (CoreDNS, Metrics Server), workload patterns, IPVS, large-cluster guidance
  • Observability — Observability strategy, CloudWatch Container Insights & Application Signals, Prometheus/Grafana, control plane monitoring, network performance monitoring, logging architecture, distributed tracing, GPU/AI-ML observability, detective controls, alerting patterns
  • Terraform Examples — terraform-aws-modules/terraform-aws-eks examples, submodules, add-on management, Provisioned Control Plane, EFA, VPC patterns, deployment topologies
  • ArgoCD Patterns — ArgoCD architecture, App of Apps, ApplicationSets, GitOps Bridge, multi-cluster patterns (hub-and-spoke, decentralized, hybrid), EKS ArgoCD Capability (managed vs self-managed, migration), ACK/KRO integration, multi-tenant RBAC
  • Container Registry — ECR architecture, operating models, image promotion, vulnerability scanning, base image curation, lifecycle policies, pull-through cache, repository creation templates, managed signing (AWS Signer), archival storage class, registry configuration
  • EKS Auto Mode — Auto Mode architecture, managed NodePools/NodeClasses, migration from standard EKS, comparison with self-managed Karpenter, limitations and FAQ

How to use: When you need detailed information on a topic, reference the appropriate guide. Claude will load it on demand.

Sources