Skip to main content
Source

This page is generated from skills/eks-best-practices/references/reliability-advanced.md. Edit the source, not this page.

Reliability & Resiliency — Advanced / Operational

Part of: eks-best-practices Purpose: Disaster recovery, deployment strategies, cluster-level enforcement, zonal shift, large-cluster guidance, and chaos engineering for Amazon EKS.


Table of Contents

  1. Disaster Recovery
  2. EKS Zonal Shift (ARC Integration)
  3. Enforcing Default Topology Spread via Admission Controller
  4. Velero Backup Tiers
  5. Recovery Scenarios
  6. Deployment Strategies
  7. Large Cluster Guidance
  8. Chaos Engineering with AWS FIS

Disaster Recovery

Backup Strategy with Velero

# Install Velero with AWS plugin
velero install \
--provider aws \
--bucket my-velero-bucket \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--plugins velero/velero-plugin-for-aws:v1.8.0

# Schedule daily backups
velero schedule create daily-backup \
--schedule "0 2 * * *" \
--ttl 720h \
--include-namespaces production,staging

DR Patterns

PatternRPORTOCost
Backup/RestoreHoursHoursLow
Pilot LightMinutes30-60 minMedium
Warm StandbySecondsMinutesHigh
Active-ActiveNear-zeroNear-zeroHighest

EKS-specific DR considerations:

  • Back up cluster configuration (add-ons, RBAC, CRDs) separately from workloads
  • Use GitOps (ArgoCD/Flux) for declarative cluster state — simplifies recovery
  • Test restore procedures regularly
  • Cross-region ECR replication for image availability

EKS Zonal Shift (ARC Integration)

Amazon Application Recovery Controller (ARC) zonal shift allows you to shift traffic away from an impaired Availability Zone for EKS workloads. When an AZ experiences degradation, zonal shift removes that AZ from the load balancer target group, redirecting traffic to healthy AZs without requiring application changes.

AspectDetail
What it doesRemoves an AZ from ALB/NLB target groups, shifting traffic to healthy AZs
TriggerManual via console/API, or automated via zonal autoshift
DurationUp to 72 hours per shift, extendable
EKS impactPods in the shifted AZ stop receiving traffic but continue running
Pod schedulingExisting pods remain; new pods still schedule to all AZs unless topology constraints prevent it
PrerequisitesMulti-AZ deployment, topology spread constraints, sufficient capacity in remaining AZs

When to use zonal shift:

  • AZ-level impairment (network, storage, compute degradation)
  • Elevated error rates from a specific AZ
  • Proactive shift during planned AZ maintenance

Limitations:

  • Does not evict or reschedule pods — only affects traffic routing
  • Requires sufficient capacity in remaining AZs to handle full load
  • Works with ALB and NLB only (not ClusterIP or NodePort services)

DO:

  • Ensure topology spread constraints distribute pods across all AZs before relying on zonal shift
  • Size capacity for N-1 AZ operation (if 3 AZs, each AZ should handle 50% of peak load)
  • Test zonal shift in non-production before relying on it in production

DON'T:

  • Use zonal shift as a substitute for proper multi-AZ pod distribution
  • Assume pods will automatically move — zonal shift only affects traffic, not scheduling

Enforcing Default Topology Spread via Admission Controller

The default KubeSchedulerConfiguration cannot be changed in Amazon EKS. The built-in defaults use high maxSkew values (3 for hostname, 5 for zone) which are too permissive for small deployments. To enforce stricter topology spread, use a mutating admission controller.

Approach with Kyverno: Create a mutating policy that injects topologySpreadConstraints into Deployments that don't already specify them. The policy matches Deployments with replicas >= 2 and adds zone-based and node-based spread constraints with maxSkew: 1.

Approach with Gatekeeper: Create a ConstraintTemplate that validates Deployments have topologySpreadConstraints defined, and rejects those without. Optionally scope to namespaces with a specific label (e.g., ha=true).

ApproachTypeBehavior
Kyverno mutateInject defaultsAdds constraints if missing; doesn't override explicit ones
Gatekeeper validateReject non-compliantBlocks Deployments without constraints; teams must add their own
Kyverno validate + mutateBothInjects defaults AND validates minimum requirements

Recommendation: Use Kyverno mutate to inject sensible defaults, so teams get topology spread automatically without needing to know the details. Add a validate policy for critical namespaces that require explicit constraints.


Velero Backup Tiers

TierScopeFrequencyRetentionWhat to Back Up
ProductionK8s resourcesHourly30 daysNamespaces, Deployments, Services, ConfigMaps, Secrets, CRDs
ProductionPersistent volumesEvery 4 hours30 daysEBS snapshots for stateful workloads
Non-productionK8s resources + PVsDaily7 daysSame as production but less frequent

What NOT to back up:

  • Resources managed by GitOps (ArgoCD/Flux will reconcile from Git)
  • Node-level state (Karpenter will reprovision)
  • Cached data (Redis, Memcached — ephemeral by design)

What to ALWAYS back up:

  • Custom resources and CRDs not in Git
  • Persistent volume data (databases, file storage)
  • Secrets not managed by external secrets store
  • Namespace-level RBAC and resource quotas (if not in Git)

DO:

  • Encrypt backups with KMS CMK
  • Store backups in a separate AWS account or region
  • Test restore quarterly in an isolated environment
  • Use Velero schedules (not manual backups) for consistency

DON'T:

  • Back up everything — exclude GitOps-managed resources to avoid conflicts on restore
  • Skip PV backups for stateful workloads
  • Store backups in the same account as the cluster (blast radius)

Recovery Scenarios

ScenarioImpactRecovery MechanismEstimated Recovery Time
Single pod failureOne pod downKubernetes self-healing (ReplicaSet recreates pod)10-30 seconds
Node failureAll pods on node downKarpenter provisions replacement node, pods rescheduled1-3 minutes
AZ impairmentPods in one AZ degradedZonal shift (traffic) + topology spread (pods already distributed)1-5 minutes (traffic shift)
Add-on failureCluster functionality degradedHelm rollback or GitOps revert5-15 minutes
Control plane issueAPI server unavailableAWS-managed recovery (automatic)5-15 minutes (AWS SLA)
Full cluster lossEverything downVelero restore + GitOps reconciliation + DNS switch1-4 hours
Region failureAll AZs downMulti-region failover (if configured)15-60 minutes

Full Cluster Recovery Steps (High Level)

  1. Provision new EKS cluster (Terraform apply)
  2. Install core add-ons (VPC CNI, CoreDNS, Karpenter)
  3. Restore Velero backup (K8s resources + PV snapshots)
  4. Reconcile GitOps repository (ArgoCD sync)
  5. Validate workloads are running and healthy
  6. Switch DNS/traffic to new cluster
  7. Verify end-to-end functionality

Deployment Strategies

Rolling Updates (Default)

Control update behavior with maxUnavailable and maxSurge:

strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # No downtime — new pods created before old ones removed
maxSurge: 1 # One extra pod during rollout

The default maxUnavailable: 25% means if you have 100 pods, only 75 may be active during rollout. If your app needs 80+, set maxUnavailable: 20% or lower.

Always use kubectl rollout undo deployment <name> for quick rollbacks.

Blue/Green Deployments

Create a new Deployment identical to the current version, verify pods are healthy, then switch the Service selector to point to the new Deployment. Automate with Flux, Jenkins, Spinnaker, or AWS Load Balancer Controller.

Canary Deployments

Deploy the new version with fewer replicas alongside the existing Deployment, divert a small percentage of traffic, and progressively increase if metrics are healthy. Use Flagger with Istio or AWS App Mesh for automated canary progression.


Large Cluster Guidance

For clusters approaching scale limits:

IssueThresholdSolution
kube-proxy latency>1000 ServicesSwitch to ipvs mode
EC2 API throttlingFrequent node scalingConfigure CNI to cache IPs, use larger instance types
etcd sizeApproaching 8GBMonitor apiserver_storage_size_bytes, reduce CRD churn
DNS pressure>500 nodesDeploy NodeLocal DNSCache, enable CoreDNS auto-scaling

Chaos Engineering with AWS FIS

AWS Fault Injection Service (FIS) provides managed chaos engineering experiments for EKS. FIS integrates with EKS to inject faults at the pod, node, and AZ level, validating that your resilience mechanisms (PDBs, topology spread, autoscaling) work as expected.

Common EKS Experiments

ExperimentWhat It TestsTarget
Pod deleteSelf-healing, PDB behaviorSpecific pods by label
Node terminateNode replacement, pod reschedulingEC2 instances in node group
AZ failureMulti-AZ resilience, zonal shiftSubnet/AZ disruption
CPU stressHPA scaling, resource limitsPods or nodes
Network disruptionTimeout handling, circuit breakersPod network
DNS failureDNS caching, fallback behaviorCoreDNS disruption

Experiment Progression

PhaseExperimentsScope
1. Start smallDelete single pod, verify PDBOne namespace, non-production
2. Node levelTerminate one node, verify reschedulingOne node group
3. Multi-podDelete multiple pods simultaneouslyMultiple namespaces
4. AZ levelSimulate AZ failure, verify topology spreadOne AZ
5. Steady stateRun experiments continuously in productionAutomated, with guardrails

DO:

  • Start with non-production environments
  • Define steady-state hypothesis before each experiment (what "healthy" looks like)
  • Set stop conditions (abort if error rate exceeds threshold)
  • Run experiments during business hours with the team available

DON'T:

  • Run AZ-level experiments without verifying multi-AZ pod distribution first
  • Skip PDB validation before running node-terminate experiments
  • Run chaos experiments without monitoring and alerting in place

Prerequisite reading: reliability-core.md.


Sources: