Source

This page is generated from skills/eks-operation-review/references/observability.md. Edit the source, not this page.

Vendored skill

This skill is sourced from eks-operation-review, also maintained by the APEX team.

Observability

Purpose

Assess observability across three layers: control plane, data plane (nodes), and workloads — covering metrics, logs, and alerting.

Checks to Execute

4.1 — EKS Control Plane Logging

What to check:

Which of the 5 log types are enabled (api, audit, authenticator, controllerManager, scheduler)
CloudWatch log group existence and retention policy

How to check:

Describe cluster → logging.clusterLogging → check each entry for enabled: true and which types
Use CloudWatch tools to check log group /aws/eks/{cluster-name}/cluster retention (recommend >= 30 days for audit logs; no retention policy = logs kept forever at cost)

Rating:

🟢 GREEN: All 5 log types enabled with defined retention policy
🟡 AMBER: Some log types enabled (especially if audit is on), or no retention policy
🔴 RED: Control plane logging completely disabled, or audit logs specifically disabled
⬜ UNKNOWN: Should not happen with live access

Key talking point: EKS control plane logging is OFF by default. The audit log is your security camera for every API call.

4.2 — Metrics Collection & Dashboards

What to check:

CloudWatch Container Insights add-on (amazon-cloudwatch-observability)
Prometheus pods (labels: app.kubernetes.io/name=prometheus or app=prometheus)
Grafana pods
kube-state-metrics deployment (critical for cluster state visibility)
node-exporter DaemonSet
ADOT add-on
Third-party monitoring DaemonSets (Datadog, New Relic, Dynatrace)

How to check:

Describe addon amazon-cloudwatch-observability
List pods with label app.kubernetes.io/name=prometheus across all namespaces
List pods with label app.kubernetes.io/name=grafana
List pods with label app.kubernetes.io/name=kube-state-metrics
List DaemonSets across all namespaces (catches node-exporter and third-party agents)

Rating:

🟢 GREEN: Metrics collection + kube-state-metrics + dashboards (Container Insights or Prometheus+Grafana or third-party)
🟡 AMBER: Partial stack (e.g., Container Insights but no kube-state-metrics, or Prometheus without Grafana)
🔴 RED: No metrics collection at all
⬜ UNKNOWN: Cannot determine if dashboards are actively used

4.3 — Centralized Log Aggregation for Workloads

What to check:

Fluent Bit DaemonSet (labels: app.kubernetes.io/name=fluent-bit or k8s-app=fluent-bit)
Fluentd DaemonSet
CloudWatch agent DaemonSet in amazon-cloudwatch namespace
Application log groups in CloudWatch

How to check:

List DaemonSets with Fluent Bit labels across all namespaces
List DaemonSets in amazon-cloudwatch namespace
Use CloudWatch tools to check for log groups with prefix /aws/eks/{cluster-name}

Rating:

🟢 GREEN: Log shipper deployed, logs centralized with retention policy, structured logging
🟡 AMBER: Log shipper exists but no retention policy, or unstructured logging
🔴 RED: No centralized log collection — teams rely on kubectl logs
⬜ UNKNOWN: Cannot determine log format (structured vs unstructured) without sampling

4.4 — Alerting Defined and Actionable

What to check:

CloudWatch Alarms related to EKS/ContainerInsights
Prometheus Alertmanager pods
PrometheusRule resources (alert definitions)

How to check:

List pods with label app.kubernetes.io/name=alertmanager
List PrometheusRule resources. If 404/NotFound (CRD not installed) → Prometheus Operator not deployed, rate alerting based on CloudWatch only. If 403/Forbidden → mark UNKNOWN.
Use CloudWatch tools to list alarms with ContainerInsights namespace

Rating:

🟢 GREEN: Alerts cover critical scenarios (node, pod, capacity), routed to on-call
🟡 AMBER: Some alerts exist but incomplete coverage, or no runbooks linked
🔴 RED: No alerting configured
⬜ UNKNOWN: Cannot determine if alerts have runbooks or if on-call monitors them — suggest user investigate

Minimum viable alert set: NodeNotReady, PodCrashLooping, PodPendingTooLong, HighAPIServerLatency, IPExhaustion.

Purpose​

Checks to Execute​

4.1 — EKS Control Plane Logging​

4.2 — Metrics Collection & Dashboards​

4.3 — Centralized Log Aggregation for Workloads​

4.4 — Alerting Defined and Actionable​

Purpose

Checks to Execute

4.1 — EKS Control Plane Logging

4.2 — Metrics Collection & Dashboards

4.3 — Centralized Log Aggregation for Workloads

4.4 — Alerting Defined and Actionable