Source

This page is generated from skills/eks-best-practices/references/reliability-advanced.md. Edit the source, not this page.

Reliability & Resiliency — Advanced / Operational

Part of: eks-best-practices Purpose: Disaster recovery, deployment strategies, cluster-level enforcement, zonal shift, large-cluster guidance, and chaos engineering for Amazon EKS.

Disaster Recovery
EKS Zonal Shift (ARC Integration)
Enforcing Default Topology Spread via Admission Controller
Velero Backup Tiers
Recovery Scenarios
Deployment Strategies
Large Cluster Guidance
Chaos Engineering with AWS FIS

Disaster Recovery

Backup Strategy with Velero

# Install Velero with AWS plugin
velero install \
  --provider aws \
  --bucket my-velero-bucket \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --plugins velero/velero-plugin-for-aws:v1.8.0

# Schedule daily backups
velero schedule create daily-backup \
  --schedule "0 2 * * *" \
  --ttl 720h \
  --include-namespaces production,staging

DR Patterns

Pattern	RPO	RTO	Cost
Backup/Restore	Hours	Hours	Low
Pilot Light	Minutes	30-60 min	Medium
Warm Standby	Seconds	Minutes	High
Active-Active	Near-zero	Near-zero	Highest

EKS-specific DR considerations:

Back up cluster configuration (add-ons, RBAC, CRDs) separately from workloads
Use GitOps (ArgoCD/Flux) for declarative cluster state — simplifies recovery
Test restore procedures regularly
Cross-region ECR replication for image availability

EKS Zonal Shift (ARC Integration)

Amazon Application Recovery Controller (ARC) zonal shift allows you to shift traffic away from an impaired Availability Zone for EKS workloads. When an AZ experiences degradation, zonal shift removes that AZ from the load balancer target group, redirecting traffic to healthy AZs without requiring application changes.

Aspect	Detail
What it does	Removes an AZ from ALB/NLB target groups, shifting traffic to healthy AZs
Trigger	Manual via console/API, or automated via zonal autoshift
Duration	Up to 72 hours per shift, extendable
EKS impact	Pods in the shifted AZ stop receiving traffic but continue running
Pod scheduling	Existing pods remain; new pods still schedule to all AZs unless topology constraints prevent it
Prerequisites	Multi-AZ deployment, topology spread constraints, sufficient capacity in remaining AZs

When to use zonal shift:

AZ-level impairment (network, storage, compute degradation)
Elevated error rates from a specific AZ
Proactive shift during planned AZ maintenance

Limitations:

Does not evict or reschedule pods — only affects traffic routing
Requires sufficient capacity in remaining AZs to handle full load
Works with ALB and NLB only (not ClusterIP or NodePort services)

DO:

Ensure topology spread constraints distribute pods across all AZs before relying on zonal shift
Size capacity for N-1 AZ operation (if 3 AZs, each AZ should handle 50% of peak load)
Test zonal shift in non-production before relying on it in production

DON'T:

Use zonal shift as a substitute for proper multi-AZ pod distribution
Assume pods will automatically move — zonal shift only affects traffic, not scheduling

Enforcing Default Topology Spread via Admission Controller

The default KubeSchedulerConfiguration cannot be changed in Amazon EKS. The built-in defaults use high maxSkew values (3 for hostname, 5 for zone) which are too permissive for small deployments. To enforce stricter topology spread, use a mutating admission controller.

Approach with Kyverno: Create a mutating policy that injects topologySpreadConstraints into Deployments that don't already specify them. The policy matches Deployments with replicas >= 2 and adds zone-based and node-based spread constraints with maxSkew: 1.

Approach with Gatekeeper: Create a ConstraintTemplate that validates Deployments have topologySpreadConstraints defined, and rejects those without. Optionally scope to namespaces with a specific label (e.g., ha=true).

Approach	Type	Behavior
Kyverno mutate	Inject defaults	Adds constraints if missing; doesn't override explicit ones
Gatekeeper validate	Reject non-compliant	Blocks Deployments without constraints; teams must add their own
Kyverno validate + mutate	Both	Injects defaults AND validates minimum requirements

Recommendation: Use Kyverno mutate to inject sensible defaults, so teams get topology spread automatically without needing to know the details. Add a validate policy for critical namespaces that require explicit constraints.

Velero Backup Tiers

Tier	Scope	Frequency	Retention	What to Back Up
Production	K8s resources	Hourly	30 days	Namespaces, Deployments, Services, ConfigMaps, Secrets, CRDs
Production	Persistent volumes	Every 4 hours	30 days	EBS snapshots for stateful workloads
Non-production	K8s resources + PVs	Daily	7 days	Same as production but less frequent

What NOT to back up:

Resources managed by GitOps (ArgoCD/Flux will reconcile from Git)
Node-level state (Karpenter will reprovision)
Cached data (Redis, Memcached — ephemeral by design)

What to ALWAYS back up:

Custom resources and CRDs not in Git
Persistent volume data (databases, file storage)
Secrets not managed by external secrets store
Namespace-level RBAC and resource quotas (if not in Git)

DO:

Encrypt backups with KMS CMK
Store backups in a separate AWS account or region
Test restore quarterly in an isolated environment
Use Velero schedules (not manual backups) for consistency

DON'T:

Back up everything — exclude GitOps-managed resources to avoid conflicts on restore
Skip PV backups for stateful workloads
Store backups in the same account as the cluster (blast radius)

Recovery Scenarios

Scenario	Impact	Recovery Mechanism	Estimated Recovery Time
Single pod failure	One pod down	Kubernetes self-healing (ReplicaSet recreates pod)	10-30 seconds
Node failure	All pods on node down	Karpenter provisions replacement node, pods rescheduled	1-3 minutes
AZ impairment	Pods in one AZ degraded	Zonal shift (traffic) + topology spread (pods already distributed)	1-5 minutes (traffic shift)
Add-on failure	Cluster functionality degraded	Helm rollback or GitOps revert	5-15 minutes
Control plane issue	API server unavailable	AWS-managed recovery (automatic)	5-15 minutes (AWS SLA)
Full cluster loss	Everything down	Velero restore + GitOps reconciliation + DNS switch	1-4 hours
Region failure	All AZs down	Multi-region failover (if configured)	15-60 minutes

Full Cluster Recovery Steps (High Level)

Provision new EKS cluster (Terraform apply)
Install core add-ons (VPC CNI, CoreDNS, Karpenter)
Restore Velero backup (K8s resources + PV snapshots)
Reconcile GitOps repository (ArgoCD sync)
Validate workloads are running and healthy
Switch DNS/traffic to new cluster
Verify end-to-end functionality

Deployment Strategies

Rolling Updates (Default)

Control update behavior with maxUnavailable and maxSurge:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0     # No downtime — new pods created before old ones removed
    maxSurge: 1           # One extra pod during rollout

The default maxUnavailable: 25% means if you have 100 pods, only 75 may be active during rollout. If your app needs 80+, set maxUnavailable: 20% or lower.

Always use kubectl rollout undo deployment <name> for quick rollbacks.

Blue/Green Deployments

Create a new Deployment identical to the current version, verify pods are healthy, then switch the Service selector to point to the new Deployment. Automate with Flux, Jenkins, Spinnaker, or AWS Load Balancer Controller.

Canary Deployments

Deploy the new version with fewer replicas alongside the existing Deployment, divert a small percentage of traffic, and progressively increase if metrics are healthy. Use Flagger with Istio or AWS App Mesh for automated canary progression.

Large Cluster Guidance

For clusters approaching scale limits:

Issue	Threshold	Solution
kube-proxy latency	>1000 Services	Switch to `ipvs` mode
EC2 API throttling	Frequent node scaling	Configure CNI to cache IPs, use larger instance types
etcd size	Approaching 8GB	Monitor `apiserver_storage_size_bytes`, reduce CRD churn
DNS pressure	>500 nodes	Deploy NodeLocal DNSCache, enable CoreDNS auto-scaling

Chaos Engineering with AWS FIS

AWS Fault Injection Service (FIS) provides managed chaos engineering experiments for EKS. FIS integrates with EKS to inject faults at the pod, node, and AZ level, validating that your resilience mechanisms (PDBs, topology spread, autoscaling) work as expected.

Common EKS Experiments

Experiment	What It Tests	Target
Pod delete	Self-healing, PDB behavior	Specific pods by label
Node terminate	Node replacement, pod rescheduling	EC2 instances in node group
AZ failure	Multi-AZ resilience, zonal shift	Subnet/AZ disruption
CPU stress	HPA scaling, resource limits	Pods or nodes
Network disruption	Timeout handling, circuit breakers	Pod network
DNS failure	DNS caching, fallback behavior	CoreDNS disruption

Experiment Progression

Phase	Experiments	Scope
1. Start small	Delete single pod, verify PDB	One namespace, non-production
2. Node level	Terminate one node, verify rescheduling	One node group
3. Multi-pod	Delete multiple pods simultaneously	Multiple namespaces
4. AZ level	Simulate AZ failure, verify topology spread	One AZ
5. Steady state	Run experiments continuously in production	Automated, with guardrails

DO:

Start with non-production environments
Define steady-state hypothesis before each experiment (what "healthy" looks like)
Set stop conditions (abort if error rate exceeds threshold)
Run experiments during business hours with the team available

DON'T:

Run AZ-level experiments without verifying multi-AZ pod distribution first
Skip PDB validation before running node-terminate experiments
Run chaos experiments without monitoring and alerting in place

Prerequisite reading: reliability-core.md.

Sources:

Table of Contents​

Disaster Recovery​

Backup Strategy with Velero​

DR Patterns​

EKS Zonal Shift (ARC Integration)​

Enforcing Default Topology Spread via Admission Controller​

Velero Backup Tiers​

Recovery Scenarios​

Full Cluster Recovery Steps (High Level)​

Deployment Strategies​

Rolling Updates (Default)​

Blue/Green Deployments​

Canary Deployments​

Large Cluster Guidance​

Chaos Engineering with AWS FIS​

Common EKS Experiments​

Experiment Progression​

Table of Contents

Disaster Recovery

Backup Strategy with Velero

DR Patterns

EKS Zonal Shift (ARC Integration)

Enforcing Default Topology Spread via Admission Controller

Velero Backup Tiers

Recovery Scenarios

Full Cluster Recovery Steps (High Level)

Deployment Strategies

Rolling Updates (Default)

Blue/Green Deployments

Canary Deployments

Large Cluster Guidance

Chaos Engineering with AWS FIS

Common EKS Experiments

Experiment Progression