Skip to main content
Source

This page is generated from skills/eks-best-practices/references/observability.md. Edit the source, not this page.

EKS Observability Best Practices

Part of: eks-best-practices Purpose: Monitoring, logging, tracing, alerting, network performance monitoring, GPU observability, and detective controls for Amazon EKS clusters


Table of Contents

  1. Observability Strategy
  2. CloudWatch Container Insights
  3. CloudWatch Application Signals
  4. Prometheus and Grafana
  5. Control Plane Monitoring
  6. Network Performance Monitoring
  7. Logging Architecture
  8. Distributed Tracing
  9. GPU and AI/ML Observability
  10. Detective Controls
  11. Alerting Patterns
  12. Monitoring High Availability
  13. Multi-Tenant Observability Isolation
  14. Tiered Log Retention Architecture

Observability Strategy

Three Pillars for EKS

PillarAWS-Managed OptionOpen Source Option
MetricsCloudWatch Container InsightsAmazon Managed Prometheus (AMP) + Grafana
LogsCloudWatch LogsOpenSearch, Loki
TracesAWS X-Ray / Application SignalsOpenTelemetry + Jaeger/Tempo

Decision Matrix

FactorCloudWatch NativeManaged Prometheus + Grafana
Setup effortLow (EKS add-on)Medium (AMP workspace + ADOT)
Custom metricsLimited (Container Insights)Full PromQL
APMApplication Signals (auto-instrument)Manual instrumentation
DashboardingCloudWatch dashboardsGrafana (rich ecosystem)
Cost modelPer metric, per log GBPer metric series ingested
Multi-clusterPer-account aggregationCentral AMP workspace
AlertingCloudWatch AlarmsPrometheus Alertmanager
RecommendationSimple setups, AWS-nativeProduction, complex monitoring

Key Supporting Components

ComponentPurposeDeploy As
kube-state-metrics (KSM)Kubernetes object state (deployments, pods, nodes)Deployment
Metrics ServerCPU/memory for HPA and kubectl topDeployment
Fluent BitLog collection and forwardingDaemonSet
ADOT CollectorMetrics, traces, logs collection (OpenTelemetry)DaemonSet or sidecar
DCGM ExporterGPU metrics (NVIDIA)DaemonSet

CloudWatch Container Insights

Enable Container Insights

# Enable via EKS add-on (recommended -- enables both Container Insights and Application Signals)
aws eks create-addon \
--cluster-name my-cluster \
--addon-name amazon-cloudwatch-observability

Key Container Insights Metrics

MetricLevelAlert On
node_cpu_utilizationNode> 80% sustained
node_memory_utilizationNode> 85% sustained
pod_cpu_utilizationPod> 90% of request
pod_memory_utilizationPod> 85% of limit
pod_number_of_container_restartsPod> 3 in 5 minutes
node_filesystem_utilizationNode> 80%
cluster_failed_node_countCluster> 0

Enhanced Observability

Enhanced observability (EKS add-on v1.5+) provides:

  • Automatic pod-level Prometheus metrics collection
  • EKS control plane metrics
  • GPU metrics for ML workloads (via DCGM Exporter integration)
  • Integration with CloudWatch Application Signals for APM

CloudWatch Application Signals

Application Signals provides auto-instrumentation APM for EKS workloads -- it automatically collects metrics, traces, and service maps without code changes.

Supported Languages

LanguageAuto-InstrumentationNotes
JavaYesBroadest coverage
PythonYes--
Node.jsYesESM modules require special setup
.NETYes--

What It Provides

CapabilityDetail
Service mapAuto-discovered dependency graph of all instrumented services
SLO monitoringDefine and track service-level objectives (latency, availability)
Correlated tracesAutomatic trace correlation with metrics and logs
Pre-built dashboardsLatency, error rate, throughput per service -- no setup required

How to Enable

Enable via the CloudWatch Observability EKS add-on (same add-on as Container Insights), then annotate workloads:

# Annotate deployment to enable auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true" # or inject-python, inject-nodejs, inject-dotnet

Or enable via the AWS console for selected namespaces/workloads.

Application Signals is the simplest path to APM on EKS -- if you're already using the CloudWatch Observability add-on, it's one annotation per workload.


Prometheus and Grafana

Amazon Managed Prometheus (AMP) Setup

# Create AMP workspace
aws amp create-workspace --alias my-cluster-metrics

# Deploy ADOT collector for metric collection
helm install adot-collector \
open-telemetry/opentelemetry-collector \
--namespace observability --create-namespace \
--set config.exporters.prometheusremotewrite.endpoint=<AMP_REMOTE_WRITE_URL> \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=<IRSA_ROLE_ARN>

Key Prometheus Metrics for EKS

Cluster health:

# Nodes not ready
kube_node_status_condition{condition="Ready",status="true"} == 0

# Pod restart rate (per namespace)
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace) > 5

# Pending pods (scheduling issues)
kube_pod_status_phase{phase="Pending"} > 0 # for > 5 minutes

# PVC pending (storage issues)
kube_persistentvolumeclaim_status_phase{phase="Pending"} > 0

Resource efficiency:

# CPU request vs actual usage (over-provisioning)
1 - (
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace)
)

# Memory utilization vs requests
sum(container_memory_working_set_bytes{container!=""}) by (namespace)
/
sum(kube_pod_container_resource_requests{resource="memory"}) by (namespace)

Amazon Managed Grafana (AMG)

Recommended dashboards:

  • Kubernetes cluster overview (ID: 3119)
  • Node Exporter (ID: 1860)
  • Karpenter dashboard (ID: 20398)
  • CoreDNS dashboard (ID: 15762)
  • API server troubleshooter: Troubleshooting Dashboards

Control Plane Monitoring

EKS exposes Prometheus metrics for API server and etcd. Effective control plane monitoring helps distinguish between API server bottlenecks and downstream issues (etcd, webhooks, controllers).

API Server Metrics

MetricWhat It Tells You
apiserver_request_duration_secondsRequest latency by verb and resource
apiserver_request_totalRequest volume and error rates (watch for 429s and 5xx)
apiserver_flowcontrol_nominal_limit_seatsAPF priority group capacity
apiserver_flowcontrol_current_inqueue_requestRequests queued per APF bucket (non-zero = saturation)
apiserver_flowcontrol_rejected_requests_totalRequests dropped per APF bucket
apiserver_admission_controller_admission_duration_secondsAdmission webhook latency

API vs etcd Latency

When API server latency is high, check if etcd is the bottleneck:

# API server request latency heatmap (use max, not avg, across API servers)
max(increase(apiserver_request_duration_seconds_bucket{subresource!="status",subresource!="token",verb!="WATCH"}[$__rate_interval])) by (le)

# etcd request latency
etcd_request_duration_seconds

If 15 seconds of etcd latency occurs alongside 20 seconds of API latency, the root cause is etcd, not the API server. Always check the whole chain before tuning one component.

Asymmetric Traffic

EKS runs 2+ API servers. Never average metrics across them -- one server may be overloaded while others are idle. Use max or break out by instance.

Finding Noisy Controllers

Use CloudWatch Logs Insights to identify controllers making excessive LIST calls:

fields @timestamp, @message
| filter @logStream like "kube-apiserver-audit"
| filter verb = "list"
| parse requestReceivedTimestamp /\d+-\d+-(?<StartDay>\d+)T(?<StartHour>\d+):(?<StartMinute>\d+):(?<StartSec>\d+).(?<StartMsec>\d+)Z/
| parse stageTimestamp /\d+-\d+-(?<EndDay>\d+)T(?<EndHour>\d+):(?<EndMinute>\d+):(?<EndSec>\d+).(?<EndMsec>\d+)Z/
| fields (StartHour * 3600 + StartMinute * 60 + StartSec + StartMsec / 1000000) as StartTime, (EndHour * 3600 + EndMinute * 60 + EndSec + EndMsec / 1000000) as EndTime, (EndTime - StartTime) as DeltaTime
| stats avg(DeltaTime) as AverageDeltaTime, count(*) as CountTime by requestURI, userAgent
| filter CountTime >=50
| sort AverageDeltaTime desc

See also: Scalability -- Control Plane Scaling for APF tuning and burst limit guidance


Network Performance Monitoring

Network performance issues (DNS throttling, connection tracking exhaustion, bandwidth limits) are among the hardest EKS problems to diagnose because packet drops happen in seconds and aren't captured by flow logs.

ENA Driver Metrics

The Elastic Network Adapter (ENA) driver exposes metrics that reveal network-level bottlenecks. All should be zero in healthy clusters:

MetricWhat It MeansImpact
linklocal_allowance_exceededPPS limit hit for local proxy services (DNS, IMDS, NTP)DNS lookup failures, metadata timeouts
conntrack_allowance_exceededConnection tracking table full -- no new connectionsService-to-service connectivity failures
conntrack_allowance_availableRemaining tracked connections before limitEarly warning for conntrack exhaustion
bw_in_allowance_exceededInbound bandwidth limit hitPacket queuing/drops
bw_out_allowance_exceededOutbound bandwidth limit hitPacket queuing/drops
pps_allowance_exceededBidirectional PPS limit hitPacket queuing/drops

Collecting ENA Metrics

Deploy Prometheus Node Exporter with the ethtool collector:

helm upgrade -i prometheus-node-exporter prometheus-community/prometheus-node-exporter \
--set extraArgs[0]="--collector.ethtool" \
--set extraArgs[1]="--collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+" \
--set extraArgs[2]="--collector.ethtool.metrics-include=.*"

Then scrape with ADOT or Prometheus and store in AMP. Set alerts for any non-zero _exceeded metric.

DNS Throttling Detection

DNS queries are throttled at the ENI level (1024 PPS limit). Throttled queries don't appear in query logging or flow logs. The only reliable signal is linklocal_allowance_exceeded.

Remediation:

  • Increase CoreDNS replicas (anti-affinity spreads them across ENIs)
  • Deploy NodeLocal DNSCache
  • Lower ndots to reduce query volume (see Scalability -- CoreDNS)

Logging Architecture

Log Types and Destinations

Log TypeSourceRecommended Destination
Control plane logsEKS API server, audit, scheduler, controller manager, authenticatorCloudWatch Logs (/aws/eks/<cluster>/cluster)
Application logsContainer stdout/stderrCloudWatch Logs or OpenSearch
Node logskubelet, containerdCloudWatch Logs via agent
Data plane logsVPC CNI, kube-proxyCloudWatch Logs

Fluent Bit Configuration

# Deploy as EKS add-on (recommended)
aws eks create-addon --cluster-name my-cluster --addon-name aws-for-fluent-bit

Scaling Fluent Bit for large clusters:

SettingPurpose
Use_Kubelet: OnFetch pod metadata from local kubelet instead of API server -- critical at scale
Kube_Meta_Cache_TTL: 60Cache metadata for 60+ seconds to reduce API calls
Buffer_Chunk_Size / Buffer_Max_SizeTune for log volume to prevent backpressure

For Fargate pods, use the built-in Fluent Bit sidecar -- configure via Fargate profile to send logs to CloudWatch Logs.

Log Retention Strategy

Log TypeRetentionReason
Audit logs90-365 daysCompliance, forensics
Application logs14-30 daysDebugging
Control plane logs30-90 daysTroubleshooting
Access logs (ALB)90 daysSecurity review

Cost optimization -- hot/warm/cold architecture:

  • Hot (0-30 days): CloudWatch Logs -- fast queries with Logs Insights
  • Warm (30-90 days): Export to S3 Standard/IA via subscription filters
  • Cold (90+ days): S3 Glacier for compliance retention

Structured Logging

DO:

  • Output logs as JSON for structured parsing
  • Include request ID, trace ID, and user context in every log line
  • Log at appropriate levels (ERROR, WARN, INFO)
  • Use Kubernetes labels/annotations to add metadata

DON'T:

  • Log sensitive data (tokens, passwords, PII)
  • Use unstructured text logs in production
  • Log at DEBUG level in production (volume + cost)

Distributed Tracing

AWS Distro for OpenTelemetry (ADOT)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: adot-collector
spec:
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
awsxray:
region: us-east-1
service:
pipelines:
traces:
receivers: [otlp]
exporters: [awsxray]

X-Ray vs OpenTelemetry

FactorAWS X-RayOpenTelemetry + Jaeger/Tempo
SetupSimple with ADOTMore configuration
AWS integrationNative (Lambda, API GW, etc.)Manual
Vendor lock-inAWS-specificVendor-neutral
QueryingX-Ray console, CloudWatchGrafana (richer)
CostPer trace recordedStorage-dependent

Strategic Sampling

Not all traces need the same sampling rate. Configure higher rates for critical paths and lower rates for high-volume, low-value routes:

Traffic TypeSuggested RateRationale
Critical user paths (checkout, login)100% or highFull visibility for business-critical flows
Health checks, readiness probes0-1%High volume, low diagnostic value
Internal service-to-service5-10%Balance cost with troubleshooting needs

Use X-Ray sampling rules or OpenTelemetry tail-based sampling in the ADOT collector to implement this.


GPU and AI/ML Observability

For clusters running GPU workloads (training, inference), standard CPU/memory metrics are insufficient. GPU utilization, memory, power, and SM activity need dedicated monitoring.

GPU Metrics

MetricSourceWhat It Tells You
DCGM_FI_DEV_GPU_UTILDCGM ExporterGPU utilization % (time executing any kernel)
DCGM_FI_DEV_MEM_COPY_UTILDCGM ExporterMemory controller utilization
DCGM_FI_DEV_POWER_USAGEDCGM ExporterPower draw -- best proxy for actual GPU engagement
DCGM_FI_DEV_SM_ACTIVEDCGM ExporterStreaming multiprocessor activity -- true parallelism
DCGM_FI_DEV_XID_ERRORSDCGM ExporterGPU error codes -- non-zero needs investigation

GPU Utilization alone is misleading -- 100% can mean one lightweight kernel or full parallel workloads. Compare power draw against Thermal Design Power (TDP) to spot real underutilization.

Collecting GPU Metrics

The CloudWatch Observability add-on auto-deploys DCGM Exporter on GPU nodes. Alternatively, deploy DCGM Exporter manually with Prometheus:

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring --create-namespace

Inference Framework Metrics

FrameworkNative MetricsKey Signals
vLLMYesRequest latency, memory usage, token throughput
Ray ServeYesTask execution time, resource utilization, autoscaling state
Hugging Face TGIYesInference latency, batch size, queue depth

These frameworks expose Prometheus endpoints -- scrape them alongside DCGM Exporter for full-stack GPU observability.


Detective Controls

EKS Audit Logging

# Enable all control plane log types
aws eks update-cluster-config \
--name my-cluster \
--logging '{
"clusterLogging": [{
"types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
"enabled": true
}]
}'

Key audit queries (CloudWatch Logs Insights):

# Who created/deleted resources in last 24h
fields @timestamp, user.username, verb, objectRef.resource, objectRef.name, objectRef.namespace
| filter verb in ["create", "delete", "patch"]
| filter objectRef.resource not in ["events", "leases", "endpoints"]
| sort @timestamp desc
| limit 100

# Failed API calls (potential unauthorized access)
fields @timestamp, user.username, verb, objectRef.resource, responseStatus.code
| filter responseStatus.code >= 400
| sort @timestamp desc
| limit 50

# Exec into pods (security concern)
fields @timestamp, user.username, objectRef.namespace, objectRef.name
| filter objectRef.subresource = "exec"
| sort @timestamp desc

# RBAC changes
fields @timestamp, @message
| filter objectRef.resource in ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
| filter verb in ["create", "update", "patch", "delete"]
| sort @timestamp desc

Amazon GuardDuty for EKS

FindingSeverityMeaning
PrivilegeEscalation:Kubernetes/PrivilegedContainerHighPrivileged container launched
Persistence:Kubernetes/ContainerWithSensitiveMountMediumSensitive host path mounted
Policy:Kubernetes/ExposedDashboardMediumK8s dashboard exposed
CredentialAccess:Kubernetes/MaliciousIPCallerHighAPI call from known malicious IP
Impact:Runtime/CryptoCurrencyMiningDetectedHighCrypto mining in container

CloudTrail for EKS

All eks:* API calls are logged. Key events to monitor:

  • CreateAccessEntry / DeleteAccessEntry -- access changes
  • UpdateClusterConfig -- cluster configuration changes
  • AssociateAccessPolicy -- permission grants
  • CreateAddon / DeleteAddon -- add-on changes

Use CloudTrail Insights to automatically detect unusual API activity patterns, including from pods using IRSA.


Alerting Patterns

Critical Alerts

AlertConditionSeverity
Node NotReadyNode condition NotReady > 5 minCritical
Pod CrashLoopingRestarts > 5 in 10 minHigh
PVC PendingPVC pending > 15 minHigh
API Server Errors5xx rate > 1%Critical
Certificate Expiry< 30 daysWarning
Disk PressureNode disk > 85%Warning
OOMKilledAny OOMKilled eventHigh
ENA allowance exceededAny _exceeded metric > 0High
APF requests droppedflowcontrol_rejected_requests_total > 0Warning

Prometheus Alert Rules

groups:
- name: eks-alerts
rules:
- alert: PodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[10m]) > 5
for: 5m
labels:
severity: high
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} has memory pressure"

- alert: DNSThrottling
expr: rate(node_ethtool_linklocal_allowance_exceeded[5m]) > 0
for: 2m
labels:
severity: high
annotations:
summary: "DNS throttling detected on {{ $labels.instance }}"

Avoiding Alert Fatigue

  • Use multi-stage thresholds: warning before critical
  • Correlate related alerts (node pressure + pod evictions = one incident, not two)
  • Implement maintenance windows to suppress during planned changes
  • Include runbook links and context (cluster name, namespace, pod) in every alert
  • Track false positive rates and refine thresholds quarterly

Monitoring High Availability

Architecture Principles

PrincipleImplementation
Cross-AZ redundancyDeploy monitoring components across multiple AZs
Use managed servicesAMP, AMG, CloudWatch eliminate self-managed HA concerns
Monitor the monitorsDeploy a secondary lightweight system to alert on monitoring failures
Redundant alertingMultiple notification channels (SNS + Slack + PagerDuty)
Dedicated computeRun monitoring workloads on dedicated node groups to avoid contention

Self-Managed Prometheus HA

If running self-managed Prometheus (not AMP), pair it with Thanos or Cortex for:

  • Long-term storage (S3-backed)
  • Query federation across replicas
  • Deduplication of metrics from HA pairs

AMP eliminates this complexity -- it handles replication, storage, and HA automatically.


Multi-Tenant Observability Isolation

In multi-tenant EKS platforms, each tenant needs isolated observability data to prevent cross-tenant visibility and enable accurate cost attribution.

OTEL Routing Processor

Use the OpenTelemetry routing processor to direct telemetry to tenant-specific backends based on resource attributes. The routing processor inspects an attribute on incoming telemetry (typically k8s.namespace.name) and routes matching data to the appropriate exporter. Each tenant maps to a dedicated AMP workspace (or other backend), while unmatched data falls through to a default platform exporter.

Routing configuration pattern:

Routing AttributeMatch PatternTarget BackendPurpose
k8s.namespace.nameteam-a-*AMP workspace for Team AIsolate Team A metrics
k8s.namespace.nameteam-b-*AMP workspace for Team BIsolate Team B metrics
(default)All other namespacesPlatform AMP workspacePlatform/shared metrics

Per-Tenant CloudWatch Isolation

Data TypeIsolation MethodNaming Convention
Log groupsSeparate log group per tenant/eks/<cluster>/<tenant>/application
Metric namespacesMetric dimensions by tenantEKS/Tenant/<tenant-name>
DashboardsGrafana workspace per tenant or folder-based RBAC<tenant>-overview, <tenant>-alerts
AlertsPer-tenant SNS topicseks-<tenant>-critical, eks-<tenant>-warning

Isolated Grafana Dashboards

Option A: Amazon Managed Grafana with workspace-per-tenant

  • Each tenant gets their own AMG workspace
  • IAM Identity Center groups control access
  • Highest isolation but highest cost

Option B: Single AMG workspace with folder-based RBAC

  • One workspace with per-tenant folders
  • Grafana Teams map to IAM Identity Center groups
  • Team permissions scoped to their folder only
  • Lower cost, moderate isolation

Tiered Log Retention Architecture

For cost-effective log management, implement a tiered retention strategy that balances query performance with storage costs.

Tier Architecture

Application Logs
|
v
CloudWatch Logs (Hot Tier)
|-- Retention: 7-14 days
|-- Use: Real-time debugging, recent incident investigation
+-- Cost: ~$0.50/GB ingestion + $0.03/GB/month storage
|
v (Subscription Filter)
Kinesis Data Firehose
|
v
S3 Bucket (Warm Tier)
|-- Storage class: S3 Intelligent-Tiering
|-- Retention: 30-90 days
|-- Use: Historical analysis, compliance queries via Athena
+-- Cost: ~$0.023/GB/month (Frequent) -> $0.0125/GB/month (Infrequent)
|
v (Lifecycle Rule)
S3 Glacier (Cold Tier)
|-- Storage class: Glacier Flexible Retrieval
|-- Retention: 90 days - 7 years (per compliance)
|-- Use: Compliance archive, audit, legal hold
+-- Cost: ~$0.004/GB/month

CloudWatch Subscription Filter

To stream logs from CloudWatch to S3 for archival, create a subscription filter on each log group. The filter forwards matching log events to a Kinesis Data Firehose delivery stream, which batches and writes them to an S3 bucket. Use an empty filter pattern to forward all events, or specify a pattern to selectively archive (e.g., only ERROR-level logs).

Retention Policy by Log Type

Log TypeHot (CloudWatch)Warm (S3 Standard)Cold (Glacier)Total Retention
Application logs7 days23 days335 days1 year
Audit logs30 days60 days275 days1 year
Security logs30 days60 days6+ years7 years (compliance)
Control plane logs14 days76 days90 days
Access logs (ALB)14 days76 days90 days

Querying Archived Logs

For warm-tier logs stored in S3, use Amazon Athena with partition projection for efficient queries. Create an Athena table partitioned by year, month, day, and optionally tenant namespace. Athena can then query the archived logs using standard SQL, filtering by time range, namespace, log level, and other fields. Partition projection eliminates the need to run MSCK REPAIR TABLE as new partitions arrive automatically.


Sources: