Source
This page is generated from skills/eks-operation-review/references/addon-management.md. Edit the source, not this page.
Vendored skill
This skill is sourced from eks-operation-review, also maintained by the APEX team.
Add-on & Component Management
Purpose
Assess add-on management maturity, node health monitoring, and cluster insights usage.
Checks to Execute
10.1 — Core Add-ons Managed via EKS Managed Add-ons
What to check:
- All EKS managed add-ons: name, version, status, health
- Compare installed versions against latest compatible for the cluster version
- Self-managed add-ons in kube-system (Helm releases)
- Deprecated in-tree EBS plugin usage
How to check:
- List addons → describe each for version, status, health issues
- For each of the 4 core add-ons (vpc-cni, coredns, kube-proxy, aws-ebs-csi-driver):
- Check if installed as managed add-on
- Compare installed version vs latest compatible
- List PersistentVolumes → check for
spec.awsElasticBlockStore(deprecated in-tree)
Rating:
- 🟢 GREEN: All core add-ons are EKS Managed, on latest or N-1 version, healthy
- 🟡 AMBER: Managed but behind (>1 minor version), or mix of managed and self-managed
- 🔴 RED: Core add-ons self-managed with no version tracking, health issues, or deprecated in-tree plugin
- ⬜ UNKNOWN: Cannot list add-ons
Key talking point: EKS does NOT auto-update add-ons when you upgrade the control plane. Clusters upgraded to 1.31 still running vpc-cni from 1.27 is a ticking time bomb.
10.2 — Node Health Monitoring & Auto-Repair
What to check:
- EKS Node Monitoring Agent add-on (
eks-node-monitoring-agent) - Node auto-repair configuration on managed node groups
- GPU nodes (need NMA for GPU failure detection)
- Current node conditions
How to check:
- Describe addon
eks-node-monitoring-agent - List node groups → describe each → check
nodeRepairConfig - List nodes → check for
nvidia.com/gpuin capacity (GPU nodes) - List nodes → inspect conditions for MemoryPressure, DiskPressure, etc.
Rating:
- 🟢 GREEN: NMA installed and node auto-repair enabled
- 🟡 AMBER: No NMA but node conditions monitored, or NMA without auto-repair
- 🔴 RED: No node health monitoring beyond basic Kubernetes conditions, especially with GPU workloads
- ⬜ UNKNOWN: Should not happen with live access
10.3 — EKS Cluster Insights Reviewed
What to check:
- All cluster insights with status
- Count by status (PASSING, WARNING, ERROR)
- Details on any ERROR or WARNING insights
How to check:
- Get EKS Insights for the cluster
- For any non-PASSING insights → get detailed description and recommendation
Rating:
- 🟢 GREEN: Insights reviewed, no ERROR/WARNING, or all addressed
- 🟡 AMBER: WARNING insights unaddressed
- 🔴 RED: ERROR insights ignored
- ⬜ UNKNOWN: Insights API not accessible