Source

This page is generated from skills/eks-operation-review/references/workload-configuration.md. Edit the source, not this page.

Vendored skill

This skill is sourced from eks-operation-review, also maintained by the APEX team.

Workload Configuration

Purpose

Assess workload resilience: resource requests/limits, health probes, disruption budgets, image hygiene, and storage configuration.

Checks to Execute

5.1 — Resource Requests and Limits

What to check:

Running pods missing resource requests or limits
LimitRange resources in namespaces
ResourceQuota resources in namespaces
Recent OOMKilled events
Admission webhooks enforcing resources (Kyverno, Gatekeeper)

How to check:

List pods (Running) across all namespaces → inspect spec.containers[].resources.requests and .limits
Count pods with no requests vs total running pods → calculate percentage
List LimitRange resources across all namespaces
List ResourceQuota resources across all namespaces
Get events with reason=OOMKilling (count occurrences — >5 in recent events = AMBER, >20 = RED)
List ValidatingWebhookConfigurations and MutatingWebhookConfigurations

Rating:

🟢 GREEN: >90% of pods have requests, LimitRange/ResourceQuota in place, admission enforcement
🟡 AMBER: Most pods have requests but no enforcement mechanism, or frequent OOMKills
🔴 RED: Majority of pods missing requests, no LimitRange, no enforcement
⬜ UNKNOWN: Should not happen with live access

Key talking point: Without resource requests, the scheduler is flying blind. Don't set CPU limits equal to requests — causes unnecessary throttling.

5.2 — Health Probes Configured

What to check:

Deployments missing readiness probes
Deployments missing liveness probes
Deployments missing startup probes (important for slow-starting apps: JVM/Java, Kotlin, Scala, or apps with long initialization >10s)
Pods in CrashLoopBackOff (may indicate bad liveness probes)

How to check:

List Deployments across all namespaces → inspect containers for readinessProbe, livenessProbe, startupProbe
Count deployments missing each probe type
List pods not in Running/Succeeded phase → check for CrashLoopBackOff

Rating:

🟢 GREEN: >90% of deployments have readiness probes, startup probes on slow-starting apps
🟡 AMBER: Readiness probes on most but not all, or no startup probes for JVM apps
🔴 RED: Majority of deployments missing readiness probes
⬜ UNKNOWN: Cannot determine if apps are slow-starting without more context

5.3 — Pod Disruption Budgets (PDBs)

What to check:

PDB resources and their settings
Multi-replica deployments without PDBs
PDBs with disruptionsAllowed=0 (blocks upgrades). If disruptionsAllowed=0 AND replicas=1, mark RED (single point of failure that also blocks node drains)
Single-replica deployments (inherently not disruption-safe)

How to check:

List PodDisruptionBudgets across all namespaces → check minAvailable, maxUnavailable, disruptionsAllowed
List Deployments with replicas > 1 → compare against PDB coverage
List Deployments with replicas == 1

Rating:

🟢 GREEN: PDBs on all multi-replica production deployments with reasonable settings
🟡 AMBER: PDBs on some but not all, or some PDBs blocking disruptions
🔴 RED: No PDBs at all, or critical workloads running single-replica
⬜ UNKNOWN: Cannot determine which deployments are "production" vs "dev"

5.4 — Image Tag Hygiene

What to check:

Running pods using :latest tag or no tag
ECR repositories: tag immutability and scan-on-push settings
Image registries in use (ECR vs Docker Hub vs other)

How to check:

List running pods → inspect container images for :latest or missing tag
Use AWS API to describe ECR repositories → check imageTagMutability and imageScanningConfiguration
Aggregate image registries from pod specs

Rating:

🟢 GREEN: No :latest in production, ECR with tag immutability, scan-on-push enabled
🟡 AMBER: Mostly versioned tags but some :latest, or ECR without immutability
🔴 RED: :latest widely used, or images from untrusted public registries
⬜ UNKNOWN: Cannot determine if tags are mutable without ECR access

5.5 — Persistent Volume & Stateful Workload Configuration

What to check:

StorageClasses: provisioner, reclaimPolicy, volumeBindingMode, gp2 vs gp3
PVCs and their status
CSI drivers installed
EBS CSI driver add-on status
VolumeSnapshotClasses (backup support)
StatefulSets
Deprecated in-tree volume plugin usage

How to check:

List StorageClasses → check for gp3 default, Retain policy, WaitForFirstConsumer
List PVCs across all namespaces
List CSIDrivers
Describe addon aws-ebs-csi-driver
List VolumeSnapshotClasses. If 404/NotFound (CRD not installed) → no snapshot support configured, factor into rating. If 403/Forbidden → mark snapshot capability UNKNOWN.
List StatefulSets across all namespaces
List PersistentVolumes → check for spec.awsElasticBlockStore (deprecated in-tree)

Rating:

🟢 GREEN: gp3, Retain policy, WaitForFirstConsumer, CSI driver managed, snapshots configured
🟡 AMBER: gp2 still in use, or Delete policy on production volumes, or no snapshots
🔴 RED: Deprecated in-tree plugin, Delete policy on databases, or no backup strategy
N/A: No stateful workloads on EKS
⬜ UNKNOWN: Cannot determine if Delete policy is intentional (dev) vs accidental (prod)

Purpose​

Checks to Execute​

5.1 — Resource Requests and Limits​

5.2 — Health Probes Configured​

5.3 — Pod Disruption Budgets (PDBs)​

5.4 — Image Tag Hygiene​

5.5 — Persistent Volume & Stateful Workload Configuration​

Purpose

Checks to Execute

5.1 — Resource Requests and Limits

5.2 — Health Probes Configured

5.3 — Pod Disruption Budgets (PDBs)

5.4 — Image Tag Hygiene

5.5 — Persistent Volume & Stateful Workload Configuration