Source

This page is generated from skills/eks-best-practices/references/observability.md. Edit the source, not this page.

EKS Observability Best Practices

Part of: eks-best-practices Purpose: Monitoring, logging, tracing, alerting, network performance monitoring, GPU observability, and detective controls for Amazon EKS clusters

Observability Strategy
CloudWatch Container Insights
CloudWatch Application Signals
Prometheus and Grafana
Control Plane Monitoring
Network Performance Monitoring
Logging Architecture
Distributed Tracing
GPU and AI/ML Observability
Detective Controls
Alerting Patterns
Monitoring High Availability
Multi-Tenant Observability Isolation
Tiered Log Retention Architecture

Observability Strategy

Three Pillars for EKS

Pillar	AWS-Managed Option	Open Source Option
Metrics	CloudWatch Container Insights	Amazon Managed Prometheus (AMP) + Grafana
Logs	CloudWatch Logs	OpenSearch, Loki
Traces	AWS X-Ray / Application Signals	OpenTelemetry + Jaeger/Tempo

Decision Matrix

Factor	CloudWatch Native	Managed Prometheus + Grafana
Setup effort	Low (EKS add-on)	Medium (AMP workspace + ADOT)
Custom metrics	Limited (Container Insights)	Full PromQL
APM	Application Signals (auto-instrument)	Manual instrumentation
Dashboarding	CloudWatch dashboards	Grafana (rich ecosystem)
Cost model	Per metric, per log GB	Per metric series ingested
Multi-cluster	Per-account aggregation	Central AMP workspace
Alerting	CloudWatch Alarms	Prometheus Alertmanager
Recommendation	Simple setups, AWS-native	Production, complex monitoring

Key Supporting Components

Component	Purpose	Deploy As
kube-state-metrics (KSM)	Kubernetes object state (deployments, pods, nodes)	Deployment
Metrics Server	CPU/memory for HPA and `kubectl top`	Deployment
Fluent Bit	Log collection and forwarding	DaemonSet
ADOT Collector	Metrics, traces, logs collection (OpenTelemetry)	DaemonSet or sidecar
DCGM Exporter	GPU metrics (NVIDIA)	DaemonSet

CloudWatch Container Insights

Enable Container Insights

# Enable via EKS add-on (recommended -- enables both Container Insights and Application Signals)
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability

Key Container Insights Metrics

Metric	Level	Alert On
`node_cpu_utilization`	Node	> 80% sustained
`node_memory_utilization`	Node	> 85% sustained
`pod_cpu_utilization`	Pod	> 90% of request
`pod_memory_utilization`	Pod	> 85% of limit
`pod_number_of_container_restarts`	Pod	> 3 in 5 minutes
`node_filesystem_utilization`	Node	> 80%
`cluster_failed_node_count`	Cluster	> 0

Enhanced Observability

Enhanced observability (EKS add-on v1.5+) provides:

Automatic pod-level Prometheus metrics collection
EKS control plane metrics
GPU metrics for ML workloads (via DCGM Exporter integration)
Integration with CloudWatch Application Signals for APM

CloudWatch Application Signals

Application Signals provides auto-instrumentation APM for EKS workloads -- it automatically collects metrics, traces, and service maps without code changes.

Supported Languages

Language	Auto-Instrumentation	Notes
Java	Yes	Broadest coverage
Python	Yes	--
Node.js	Yes	ESM modules require special setup
.NET	Yes	--

What It Provides

Capability	Detail
Service map	Auto-discovered dependency graph of all instrumented services
SLO monitoring	Define and track service-level objectives (latency, availability)
Correlated traces	Automatic trace correlation with metrics and logs
Pre-built dashboards	Latency, error rate, throughput per service -- no setup required

How to Enable

Enable via the CloudWatch Observability EKS add-on (same add-on as Container Insights), then annotate workloads:

# Annotate deployment to enable auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"  # or inject-python, inject-nodejs, inject-dotnet

Or enable via the AWS console for selected namespaces/workloads.

Application Signals is the simplest path to APM on EKS -- if you're already using the CloudWatch Observability add-on, it's one annotation per workload.

Prometheus and Grafana

Amazon Managed Prometheus (AMP) Setup

# Create AMP workspace
aws amp create-workspace --alias my-cluster-metrics

# Deploy ADOT collector for metric collection
helm install adot-collector \
  open-telemetry/opentelemetry-collector \
  --namespace observability --create-namespace \
  --set config.exporters.prometheusremotewrite.endpoint=<AMP_REMOTE_WRITE_URL> \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=<IRSA_ROLE_ARN>

Key Prometheus Metrics for EKS

Cluster health:

# Nodes not ready
kube_node_status_condition{condition="Ready",status="true"} == 0

# Pod restart rate (per namespace)
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace) > 5

# Pending pods (scheduling issues)
kube_pod_status_phase{phase="Pending"} > 0  # for > 5 minutes

# PVC pending (storage issues)
kube_persistentvolumeclaim_status_phase{phase="Pending"} > 0

Resource efficiency:

# CPU request vs actual usage (over-provisioning)
1 - (
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
  /
  sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace)
)

# Memory utilization vs requests
sum(container_memory_working_set_bytes{container!=""}) by (namespace)
/
sum(kube_pod_container_resource_requests{resource="memory"}) by (namespace)

Amazon Managed Grafana (AMG)

Recommended dashboards:

Kubernetes cluster overview (ID: 3119)
Node Exporter (ID: 1860)
Karpenter dashboard (ID: 20398)
CoreDNS dashboard (ID: 15762)
API server troubleshooter: Troubleshooting Dashboards

Control Plane Monitoring

EKS exposes Prometheus metrics for API server and etcd. Effective control plane monitoring helps distinguish between API server bottlenecks and downstream issues (etcd, webhooks, controllers).

API Server Metrics

Metric	What It Tells You
`apiserver_request_duration_seconds`	Request latency by verb and resource
`apiserver_request_total`	Request volume and error rates (watch for 429s and 5xx)
`apiserver_flowcontrol_nominal_limit_seats`	APF priority group capacity
`apiserver_flowcontrol_current_inqueue_request`	Requests queued per APF bucket (non-zero = saturation)
`apiserver_flowcontrol_rejected_requests_total`	Requests dropped per APF bucket
`apiserver_admission_controller_admission_duration_seconds`	Admission webhook latency

API vs etcd Latency

When API server latency is high, check if etcd is the bottleneck:

# API server request latency heatmap (use max, not avg, across API servers)
max(increase(apiserver_request_duration_seconds_bucket{subresource!="status",subresource!="token",verb!="WATCH"}[$__rate_interval])) by (le)

# etcd request latency
etcd_request_duration_seconds

If 15 seconds of etcd latency occurs alongside 20 seconds of API latency, the root cause is etcd, not the API server. Always check the whole chain before tuning one component.

Asymmetric Traffic

EKS runs 2+ API servers. Never average metrics across them -- one server may be overloaded while others are idle. Use max or break out by instance.

Finding Noisy Controllers

Use CloudWatch Logs Insights to identify controllers making excessive LIST calls:

fields @timestamp, @message
| filter @logStream like "kube-apiserver-audit"
| filter verb = "list"
| parse requestReceivedTimestamp /\d+-\d+-(?<StartDay>\d+)T(?<StartHour>\d+):(?<StartMinute>\d+):(?<StartSec>\d+).(?<StartMsec>\d+)Z/
| parse stageTimestamp /\d+-\d+-(?<EndDay>\d+)T(?<EndHour>\d+):(?<EndMinute>\d+):(?<EndSec>\d+).(?<EndMsec>\d+)Z/
| fields (StartHour * 3600 + StartMinute * 60 + StartSec + StartMsec / 1000000) as StartTime, (EndHour * 3600 + EndMinute * 60 + EndSec + EndMsec / 1000000) as EndTime, (EndTime - StartTime) as DeltaTime
| stats avg(DeltaTime) as AverageDeltaTime, count(*) as CountTime by requestURI, userAgent
| filter CountTime >=50
| sort AverageDeltaTime desc

See also: Scalability -- Control Plane Scaling for APF tuning and burst limit guidance

Network Performance Monitoring

Network performance issues (DNS throttling, connection tracking exhaustion, bandwidth limits) are among the hardest EKS problems to diagnose because packet drops happen in seconds and aren't captured by flow logs.

ENA Driver Metrics

The Elastic Network Adapter (ENA) driver exposes metrics that reveal network-level bottlenecks. All should be zero in healthy clusters:

Metric	What It Means	Impact
`linklocal_allowance_exceeded`	PPS limit hit for local proxy services (DNS, IMDS, NTP)	DNS lookup failures, metadata timeouts
`conntrack_allowance_exceeded`	Connection tracking table full -- no new connections	Service-to-service connectivity failures
`conntrack_allowance_available`	Remaining tracked connections before limit	Early warning for conntrack exhaustion
`bw_in_allowance_exceeded`	Inbound bandwidth limit hit	Packet queuing/drops
`bw_out_allowance_exceeded`	Outbound bandwidth limit hit	Packet queuing/drops
`pps_allowance_exceeded`	Bidirectional PPS limit hit	Packet queuing/drops

Collecting ENA Metrics

Deploy Prometheus Node Exporter with the ethtool collector:

helm upgrade -i prometheus-node-exporter prometheus-community/prometheus-node-exporter \
  --set extraArgs[0]="--collector.ethtool" \
  --set extraArgs[1]="--collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+" \
  --set extraArgs[2]="--collector.ethtool.metrics-include=.*"

Then scrape with ADOT or Prometheus and store in AMP. Set alerts for any non-zero _exceeded metric.

DNS Throttling Detection

DNS queries are throttled at the ENI level (1024 PPS limit). Throttled queries don't appear in query logging or flow logs. The only reliable signal is linklocal_allowance_exceeded.

Remediation:

Increase CoreDNS replicas (anti-affinity spreads them across ENIs)
Deploy NodeLocal DNSCache
Lower ndots to reduce query volume (see Scalability -- CoreDNS)

Logging Architecture

Log Types and Destinations

Log Type	Source	Recommended Destination
Control plane logs	EKS API server, audit, scheduler, controller manager, authenticator	CloudWatch Logs (`/aws/eks/<cluster>/cluster`)
Application logs	Container stdout/stderr	CloudWatch Logs or OpenSearch
Node logs	kubelet, containerd	CloudWatch Logs via agent
Data plane logs	VPC CNI, kube-proxy	CloudWatch Logs

Fluent Bit Configuration

# Deploy as EKS add-on (recommended)
aws eks create-addon --cluster-name my-cluster --addon-name aws-for-fluent-bit

Scaling Fluent Bit for large clusters:

Setting	Purpose
`Use_Kubelet: On`	Fetch pod metadata from local kubelet instead of API server -- critical at scale
`Kube_Meta_Cache_TTL: 60`	Cache metadata for 60+ seconds to reduce API calls
`Buffer_Chunk_Size` / `Buffer_Max_Size`	Tune for log volume to prevent backpressure

For Fargate pods, use the built-in Fluent Bit sidecar -- configure via Fargate profile to send logs to CloudWatch Logs.

Log Retention Strategy

Log Type	Retention	Reason
Audit logs	90-365 days	Compliance, forensics
Application logs	14-30 days	Debugging
Control plane logs	30-90 days	Troubleshooting
Access logs (ALB)	90 days	Security review

Cost optimization -- hot/warm/cold architecture:

Hot (0-30 days): CloudWatch Logs -- fast queries with Logs Insights
Warm (30-90 days): Export to S3 Standard/IA via subscription filters
Cold (90+ days): S3 Glacier for compliance retention

Structured Logging

DO:

Output logs as JSON for structured parsing
Include request ID, trace ID, and user context in every log line
Log at appropriate levels (ERROR, WARN, INFO)
Use Kubernetes labels/annotations to add metadata

DON'T:

Log sensitive data (tokens, passwords, PII)
Use unstructured text logs in production
Log at DEBUG level in production (volume + cost)

Distributed Tracing

AWS Distro for OpenTelemetry (ADOT)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    exporters:
      awsxray:
        region: us-east-1
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [awsxray]

X-Ray vs OpenTelemetry

Factor	AWS X-Ray	OpenTelemetry + Jaeger/Tempo
Setup	Simple with ADOT	More configuration
AWS integration	Native (Lambda, API GW, etc.)	Manual
Vendor lock-in	AWS-specific	Vendor-neutral
Querying	X-Ray console, CloudWatch	Grafana (richer)
Cost	Per trace recorded	Storage-dependent

Strategic Sampling

Not all traces need the same sampling rate. Configure higher rates for critical paths and lower rates for high-volume, low-value routes:

Traffic Type	Suggested Rate	Rationale
Critical user paths (checkout, login)	100% or high	Full visibility for business-critical flows
Health checks, readiness probes	0-1%	High volume, low diagnostic value
Internal service-to-service	5-10%	Balance cost with troubleshooting needs

Use X-Ray sampling rules or OpenTelemetry tail-based sampling in the ADOT collector to implement this.

GPU and AI/ML Observability

For clusters running GPU workloads (training, inference), standard CPU/memory metrics are insufficient. GPU utilization, memory, power, and SM activity need dedicated monitoring.

GPU Metrics

Metric	Source	What It Tells You
`DCGM_FI_DEV_GPU_UTIL`	DCGM Exporter	GPU utilization % (time executing any kernel)
`DCGM_FI_DEV_MEM_COPY_UTIL`	DCGM Exporter	Memory controller utilization
`DCGM_FI_DEV_POWER_USAGE`	DCGM Exporter	Power draw -- best proxy for actual GPU engagement
`DCGM_FI_DEV_SM_ACTIVE`	DCGM Exporter	Streaming multiprocessor activity -- true parallelism
`DCGM_FI_DEV_XID_ERRORS`	DCGM Exporter	GPU error codes -- non-zero needs investigation

GPU Utilization alone is misleading -- 100% can mean one lightweight kernel or full parallel workloads. Compare power draw against Thermal Design Power (TDP) to spot real underutilization.

Collecting GPU Metrics

The CloudWatch Observability add-on auto-deploys DCGM Exporter on GPU nodes. Alternatively, deploy DCGM Exporter manually with Prometheus:

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring --create-namespace

Inference Framework Metrics

Framework	Native Metrics	Key Signals
vLLM	Yes	Request latency, memory usage, token throughput
Ray Serve	Yes	Task execution time, resource utilization, autoscaling state
Hugging Face TGI	Yes	Inference latency, batch size, queue depth

These frameworks expose Prometheus endpoints -- scrape them alongside DCGM Exporter for full-stack GPU observability.

Detective Controls

EKS Audit Logging

# Enable all control plane log types
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{
    "clusterLogging": [{
      "types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
      "enabled": true
    }]
  }'

Key audit queries (CloudWatch Logs Insights):

# Who created/deleted resources in last 24h
fields @timestamp, user.username, verb, objectRef.resource, objectRef.name, objectRef.namespace
| filter verb in ["create", "delete", "patch"]
| filter objectRef.resource not in ["events", "leases", "endpoints"]
| sort @timestamp desc
| limit 100

# Failed API calls (potential unauthorized access)
fields @timestamp, user.username, verb, objectRef.resource, responseStatus.code
| filter responseStatus.code >= 400
| sort @timestamp desc
| limit 50

# Exec into pods (security concern)
fields @timestamp, user.username, objectRef.namespace, objectRef.name
| filter objectRef.subresource = "exec"
| sort @timestamp desc

# RBAC changes
fields @timestamp, @message
| filter objectRef.resource in ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
| filter verb in ["create", "update", "patch", "delete"]
| sort @timestamp desc

Amazon GuardDuty for EKS

Finding	Severity	Meaning
`PrivilegeEscalation:Kubernetes/PrivilegedContainer`	High	Privileged container launched
`Persistence:Kubernetes/ContainerWithSensitiveMount`	Medium	Sensitive host path mounted
`Policy:Kubernetes/ExposedDashboard`	Medium	K8s dashboard exposed
`CredentialAccess:Kubernetes/MaliciousIPCaller`	High	API call from known malicious IP
`Impact:Runtime/CryptoCurrencyMiningDetected`	High	Crypto mining in container

CloudTrail for EKS

All eks:* API calls are logged. Key events to monitor:

CreateAccessEntry / DeleteAccessEntry -- access changes
UpdateClusterConfig -- cluster configuration changes
AssociateAccessPolicy -- permission grants
CreateAddon / DeleteAddon -- add-on changes

Use CloudTrail Insights to automatically detect unusual API activity patterns, including from pods using IRSA.

Alerting Patterns

Critical Alerts

Alert	Condition	Severity
Node NotReady	Node condition NotReady > 5 min	Critical
Pod CrashLooping	Restarts > 5 in 10 min	High
PVC Pending	PVC pending > 15 min	High
API Server Errors	5xx rate > 1%	Critical
Certificate Expiry	< 30 days	Warning
Disk Pressure	Node disk > 85%	Warning
OOMKilled	Any OOMKilled event	High
ENA allowance exceeded	Any `_exceeded` metric > 0	High
APF requests dropped	`flowcontrol_rejected_requests_total` > 0	Warning

Prometheus Alert Rules

groups:
- name: eks-alerts
  rules:
  - alert: PodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[10m]) > 5
    for: 5m
    labels:
      severity: high
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

  - alert: NodeMemoryPressure
    expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.node }} has memory pressure"

  - alert: DNSThrottling
    expr: rate(node_ethtool_linklocal_allowance_exceeded[5m]) > 0
    for: 2m
    labels:
      severity: high
    annotations:
      summary: "DNS throttling detected on {{ $labels.instance }}"

Avoiding Alert Fatigue

Use multi-stage thresholds: warning before critical
Correlate related alerts (node pressure + pod evictions = one incident, not two)
Implement maintenance windows to suppress during planned changes
Include runbook links and context (cluster name, namespace, pod) in every alert
Track false positive rates and refine thresholds quarterly

Monitoring High Availability

Architecture Principles

Principle	Implementation
Cross-AZ redundancy	Deploy monitoring components across multiple AZs
Use managed services	AMP, AMG, CloudWatch eliminate self-managed HA concerns
Monitor the monitors	Deploy a secondary lightweight system to alert on monitoring failures
Redundant alerting	Multiple notification channels (SNS + Slack + PagerDuty)
Dedicated compute	Run monitoring workloads on dedicated node groups to avoid contention

Self-Managed Prometheus HA

If running self-managed Prometheus (not AMP), pair it with Thanos or Cortex for:

Long-term storage (S3-backed)
Query federation across replicas
Deduplication of metrics from HA pairs

AMP eliminates this complexity -- it handles replication, storage, and HA automatically.

Multi-Tenant Observability Isolation

In multi-tenant EKS platforms, each tenant needs isolated observability data to prevent cross-tenant visibility and enable accurate cost attribution.

OTEL Routing Processor

Use the OpenTelemetry routing processor to direct telemetry to tenant-specific backends based on resource attributes. The routing processor inspects an attribute on incoming telemetry (typically k8s.namespace.name) and routes matching data to the appropriate exporter. Each tenant maps to a dedicated AMP workspace (or other backend), while unmatched data falls through to a default platform exporter.

Routing configuration pattern:

Routing Attribute	Match Pattern	Target Backend	Purpose
`k8s.namespace.name`	`team-a-*`	AMP workspace for Team A	Isolate Team A metrics
`k8s.namespace.name`	`team-b-*`	AMP workspace for Team B	Isolate Team B metrics
(default)	All other namespaces	Platform AMP workspace	Platform/shared metrics

Per-Tenant CloudWatch Isolation

Data Type	Isolation Method	Naming Convention
Log groups	Separate log group per tenant	`/eks/<cluster>/<tenant>/application`
Metric namespaces	Metric dimensions by tenant	`EKS/Tenant/<tenant-name>`
Dashboards	Grafana workspace per tenant or folder-based RBAC	`<tenant>-overview`, `<tenant>-alerts`
Alerts	Per-tenant SNS topics	`eks-<tenant>-critical`, `eks-<tenant>-warning`

Isolated Grafana Dashboards

Option A: Amazon Managed Grafana with workspace-per-tenant

Each tenant gets their own AMG workspace
IAM Identity Center groups control access
Highest isolation but highest cost

Option B: Single AMG workspace with folder-based RBAC

One workspace with per-tenant folders
Grafana Teams map to IAM Identity Center groups
Team permissions scoped to their folder only
Lower cost, moderate isolation

Tiered Log Retention Architecture

For cost-effective log management, implement a tiered retention strategy that balances query performance with storage costs.

Tier Architecture

Application Logs
     |
     v
CloudWatch Logs (Hot Tier)
  |-- Retention: 7-14 days
  |-- Use: Real-time debugging, recent incident investigation
  +-- Cost: ~$0.50/GB ingestion + $0.03/GB/month storage
     |
     v (Subscription Filter)
Kinesis Data Firehose
     |
     v
S3 Bucket (Warm Tier)
  |-- Storage class: S3 Intelligent-Tiering
  |-- Retention: 30-90 days
  |-- Use: Historical analysis, compliance queries via Athena
  +-- Cost: ~$0.023/GB/month (Frequent) -> $0.0125/GB/month (Infrequent)
     |
     v (Lifecycle Rule)
S3 Glacier (Cold Tier)
  |-- Storage class: Glacier Flexible Retrieval
  |-- Retention: 90 days - 7 years (per compliance)
  |-- Use: Compliance archive, audit, legal hold
  +-- Cost: ~$0.004/GB/month

CloudWatch Subscription Filter

To stream logs from CloudWatch to S3 for archival, create a subscription filter on each log group. The filter forwards matching log events to a Kinesis Data Firehose delivery stream, which batches and writes them to an S3 bucket. Use an empty filter pattern to forward all events, or specify a pattern to selectively archive (e.g., only ERROR-level logs).

Retention Policy by Log Type

Log Type	Hot (CloudWatch)	Warm (S3 Standard)	Cold (Glacier)	Total Retention
Application logs	7 days	23 days	335 days	1 year
Audit logs	30 days	60 days	275 days	1 year
Security logs	30 days	60 days	6+ years	7 years (compliance)
Control plane logs	14 days	76 days	—	90 days
Access logs (ALB)	14 days	76 days	—	90 days

Querying Archived Logs

For warm-tier logs stored in S3, use Amazon Athena with partition projection for efficient queries. Create an Athena table partitioned by year, month, day, and optionally tenant namespace. Athena can then query the archived logs using standard SQL, filtering by time range, namespace, log level, and other fields. Partition projection eliminates the need to run MSCK REPAIR TABLE as new partitions arrive automatically.

Sources:

Table of Contents​

Observability Strategy​

Three Pillars for EKS​

Decision Matrix​

Key Supporting Components​

CloudWatch Container Insights​

Enable Container Insights​

Key Container Insights Metrics​

Enhanced Observability​

CloudWatch Application Signals​

Supported Languages​

What It Provides​

How to Enable​

Prometheus and Grafana​

Amazon Managed Prometheus (AMP) Setup​

Key Prometheus Metrics for EKS​

Amazon Managed Grafana (AMG)​

Control Plane Monitoring​

API Server Metrics​

API vs etcd Latency​

Asymmetric Traffic​

Finding Noisy Controllers​

Network Performance Monitoring​

ENA Driver Metrics​

Collecting ENA Metrics​

DNS Throttling Detection​

Logging Architecture​

Log Types and Destinations​

Fluent Bit Configuration​

Log Retention Strategy​

Structured Logging​

Distributed Tracing​

AWS Distro for OpenTelemetry (ADOT)​

X-Ray vs OpenTelemetry​

Strategic Sampling​

GPU and AI/ML Observability​

GPU Metrics​

Collecting GPU Metrics​

Inference Framework Metrics​

Detective Controls​

EKS Audit Logging​

Amazon GuardDuty for EKS​

CloudTrail for EKS​

Alerting Patterns​

Critical Alerts​

Prometheus Alert Rules​

Avoiding Alert Fatigue​

Monitoring High Availability​

Architecture Principles​

Self-Managed Prometheus HA​

Multi-Tenant Observability Isolation​

OTEL Routing Processor​

Per-Tenant CloudWatch Isolation​

Isolated Grafana Dashboards​

Tiered Log Retention Architecture​

Tier Architecture​

CloudWatch Subscription Filter​

Retention Policy by Log Type​

Querying Archived Logs​

Table of Contents