Source

This page is generated from skills/eks-operation-review/references/addon-management.md. Edit the source, not this page.

Vendored skill

This skill is sourced from eks-operation-review, also maintained by the APEX team.

Add-on & Component Management

Purpose

Assess add-on management maturity, node health monitoring, and cluster insights usage.

Checks to Execute

10.1 — Core Add-ons Managed via EKS Managed Add-ons

What to check:

All EKS managed add-ons: name, version, status, health
Compare installed versions against latest compatible for the cluster version
Self-managed add-ons in kube-system (Helm releases)
Deprecated in-tree EBS plugin usage

How to check:

List addons → describe each for version, status, health issues
For each of the 4 core add-ons (vpc-cni, coredns, kube-proxy, aws-ebs-csi-driver):
- Check if installed as managed add-on
- Compare installed version vs latest compatible
List PersistentVolumes → check for spec.awsElasticBlockStore (deprecated in-tree)

Rating:

🟢 GREEN: All core add-ons are EKS Managed, on latest or N-1 version, healthy
🟡 AMBER: Managed but behind (>1 minor version), or mix of managed and self-managed
🔴 RED: Core add-ons self-managed with no version tracking, health issues, or deprecated in-tree plugin
⬜ UNKNOWN: Cannot list add-ons

Key talking point: EKS does NOT auto-update add-ons when you upgrade the control plane. Clusters upgraded to 1.31 still running vpc-cni from 1.27 is a ticking time bomb.

10.2 — Node Health Monitoring & Auto-Repair

What to check:

EKS Node Monitoring Agent add-on (eks-node-monitoring-agent)
Node auto-repair configuration on managed node groups
GPU nodes (need NMA for GPU failure detection)
Current node conditions

How to check:

Describe addon eks-node-monitoring-agent
List node groups → describe each → check nodeRepairConfig
List nodes → check for nvidia.com/gpu in capacity (GPU nodes)
List nodes → inspect conditions for MemoryPressure, DiskPressure, etc.

Rating:

🟢 GREEN: NMA installed and node auto-repair enabled
🟡 AMBER: No NMA but node conditions monitored, or NMA without auto-repair
🔴 RED: No node health monitoring beyond basic Kubernetes conditions, especially with GPU workloads
⬜ UNKNOWN: Should not happen with live access

10.3 — EKS Cluster Insights Reviewed

What to check:

All cluster insights with status
Count by status (PASSING, WARNING, ERROR)
Details on any ERROR or WARNING insights

How to check:

Get EKS Insights for the cluster
For any non-PASSING insights → get detailed description and recommendation

Rating:

🟢 GREEN: Insights reviewed, no ERROR/WARNING, or all addressed
🟡 AMBER: WARNING insights unaddressed
🔴 RED: ERROR insights ignored
⬜ UNKNOWN: Insights API not accessible

Purpose​

Checks to Execute​

10.1 — Core Add-ons Managed via EKS Managed Add-ons​

10.2 — Node Health Monitoring & Auto-Repair​

10.3 — EKS Cluster Insights Reviewed​

Purpose

Checks to Execute

10.1 — Core Add-ons Managed via EKS Managed Add-ons

10.2 — Node Health Monitoring & Auto-Repair

10.3 — EKS Cluster Insights Reviewed