Source

This page is generated from skills/eks-recon/references/observability.md. Edit the source, not this page.

Module: Observability

Part of: eks-recon Purpose: Detect observability stack - metrics, logging, tracing configuration

Prerequisites
Detection Strategy
Detection Commands
Output Schema
Edge Cases
Recommendations Based on Findings

Prerequisites

Cluster name required: Yes
MCP tools used: describe_eks_resource, list_k8s_resources
CLI fallback: aws eks, kubectl, aws logs

Detection Strategy

Observability has three pillars (plus control plane logging):

Metrics          -> Container Insights, Prometheus, Datadog, etc.
Logging          -> CloudWatch, FluentBit, OpenSearch, etc.
Tracing          -> X-Ray, ADOT, Jaeger, etc.
Control Plane    -> API server, audit, authenticator logs

Why detect each pillar:

Pillar	Why It Matters
Metrics	Understand resource utilization, HPA scaling decisions, capacity planning
Logging	Debug application issues, audit security events, compliance requirements
Tracing	Diagnose latency in distributed systems, identify service dependencies
Control Plane	Investigate API failures, audit access, debug networking issues

Detection Commands

1. Metrics Collection

Start with metrics detection to understand how the cluster tracks resource usage and supports autoscaling. Most clusters have at least one metrics solution.

Container Insights (CloudWatch):

Use Container Insights when you need AWS-native monitoring with automatic CloudWatch integration. This is the simplest option for teams already using AWS observability tools.

MCP:

describe_eks_resource(
  resource_type="addon",
  cluster_name="<cluster-name>",
  resource_name="amazon-cloudwatch-observability"
)

CLI:

# Check for CloudWatch add-on
aws eks describe-addon --cluster-name <cluster-name> \
  --addon-name amazon-cloudwatch-observability 2>/dev/null

# Alternative: Check for CloudWatch agent DaemonSet
kubectl get daemonset -n amazon-cloudwatch cloudwatch-agent 2>/dev/null

# Check for Fluent Bit (CloudWatch integration)
kubectl get daemonset -n amazon-cloudwatch fluent-bit 2>/dev/null

Example output (add-on installed):

{
  "addon": {
    "addonName": "amazon-cloudwatch-observability",
    "clusterName": "prod-cluster",
    "status": "ACTIVE",
    "addonVersion": "v1.5.0-eksbuild.1"
  }
}

Example output (not installed):

An error occurred (ResourceNotFoundException) when calling the DescribeAddon operation

Prometheus (Self-Managed):

Use Prometheus when you need flexible metrics collection with PromQL queries, custom recording rules, or Grafana dashboards. Common in teams with existing Prometheus expertise.

# Check for Prometheus deployment
kubectl get deploy -n prometheus prometheus-server 2>/dev/null || \
kubectl get deploy -n monitoring prometheus-server 2>/dev/null || \
kubectl get statefulset -n prometheus prometheus-server 2>/dev/null

# Check for kube-prometheus-stack (Helm)
helm list -A --filter "prometheus\|kube-prometheus" 2>/dev/null

# Check for Prometheus Operator
kubectl get deploy -A -l "app.kubernetes.io/name=prometheus-operator" 2>/dev/null

Example output (Prometheus detected):

NAME                READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-server   1/1     1            1           45d

Amazon Managed Prometheus (AMP):

Check for AMP when you need managed Prometheus with automatic scaling and AWS integration. Look for aps-workspaces URLs in remote write configurations.

# Check for ADOT or Prometheus remote write config
kubectl get configmap -A -o json | jq -r '
  .items[] |
  select(.data | to_entries | .[] | .value | contains("aps-workspaces")) |
  {namespace: .metadata.namespace, name: .metadata.name}'

Example output (AMP configured):

{
  "namespace": "prometheus",
  "name": "prometheus-config"
}

Grafana:

# Check for Grafana deployment
kubectl get deploy -A -l "app.kubernetes.io/name=grafana" 2>/dev/null

# Check for Amazon Managed Grafana (external, check workspace)
# Note: AMG workspaces are external to cluster

Other Metrics Tools:

Third-party tools like Datadog and New Relic provide unified observability platforms. Check for these when the team uses a commercial APM solution.

# Datadog
kubectl get daemonset -n datadog datadog-agent 2>/dev/null

# New Relic
kubectl get daemonset -A -l "app.kubernetes.io/name=nri-bundle" 2>/dev/null

# Metrics Server (required for HPA - almost always present)
kubectl get deploy -n kube-system metrics-server 2>/dev/null

Example output (Metrics Server):

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            1           120d

2. Logging Configuration

Detect logging configuration to understand how application and cluster logs are collected and where they are sent. Control plane logging is critical for debugging and compliance.

Control Plane Logging:

Always check control plane logging first. Missing audit logs is a security/compliance gap.

# Check which control plane logs are enabled
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.logging.clusterLogging[*].{types:types,enabled:enabled}'

Example output (all logs enabled):

[
  {
    "types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
    "enabled": true
  }
]

Example output (no logs enabled - flag this):

[
  {
    "types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
    "enabled": false
  }
]

Fluent Bit / Fluentd:

Fluent Bit (lightweight) and Fluentd (feature-rich) are the most common log forwarders. Check their ConfigMaps to determine where logs are being sent.

# Fluent Bit DaemonSet
kubectl get daemonset -A -l "app.kubernetes.io/name=fluent-bit" 2>/dev/null

# Fluentd DaemonSet
kubectl get daemonset -A -l "app=fluentd" 2>/dev/null

# Check Fluent Bit ConfigMap for destinations
kubectl get configmap -n amazon-cloudwatch fluent-bit-config -o yaml 2>/dev/null | \
  grep -E "cloudwatch|opensearch|s3|kinesis" || true

Example output (Fluent Bit detected):

NAMESPACE          NAME         DESIRED   CURRENT   READY   AGE
amazon-cloudwatch  fluent-bit   3         3         3       60d

OpenSearch / Elasticsearch:

# Check for OpenSearch endpoint in configs
kubectl get configmap -A -o json | jq -r '
  .items[] |
  select(.data | to_entries | .[] | .value | contains("opensearch") or contains("elasticsearch")) |
  {namespace: .metadata.namespace, name: .metadata.name}'

Loki:

# Check for Loki deployment
kubectl get deploy -A -l "app.kubernetes.io/name=loki" 2>/dev/null
kubectl get statefulset -A -l "app.kubernetes.io/name=loki" 2>/dev/null

3. Tracing

Tracing is essential for debugging latency in microservices architectures. Without tracing, diagnosing cross-service issues requires correlating logs manually.

AWS X-Ray / ADOT:

ADOT (AWS Distro for OpenTelemetry) is the AWS-recommended approach for tracing. It can send traces to X-Ray, Jaeger, or other backends.

# Check for ADOT collector
kubectl get deploy -A -l "app.kubernetes.io/name=aws-otel-collector" 2>/dev/null

# Check for X-Ray daemon
kubectl get daemonset -A -l "app=xray-daemon" 2>/dev/null

# Check ADOT add-on
aws eks describe-addon --cluster-name <cluster-name> --addon-name adot 2>/dev/null

Example output (ADOT add-on installed):

{
  "addon": {
    "addonName": "adot",
    "clusterName": "prod-cluster",
    "status": "ACTIVE",
    "addonVersion": "v0.88.0-eksbuild.1"
  }
}

Jaeger:

Jaeger is a popular open-source tracing backend. Check for it when the team uses a self-managed tracing solution.

# Check for Jaeger
kubectl get deploy -A -l "app.kubernetes.io/name=jaeger" 2>/dev/null
kubectl get deploy -A -l "app=jaeger" 2>/dev/null

Tempo:

Grafana Tempo is often used with Grafana and Loki for a unified observability stack.

# Check for Grafana Tempo
kubectl get deploy -A -l "app.kubernetes.io/name=tempo" 2>/dev/null
kubectl get statefulset -A -l "app.kubernetes.io/name=tempo" 2>/dev/null

4. Application Signals (APM)

Application Signals provides automatic instrumentation for common frameworks. Check for this when the team wants APM without modifying application code.

# Check for CloudWatch Application Signals
kubectl get deploy -n amazon-cloudwatch cloudwatch-agent-operator 2>/dev/null

# Check for auto-instrumentation
kubectl get instrumentations.opentelemetry.io -A 2>/dev/null

Example output (auto-instrumentation configured):

NAMESPACE   NAME       AGE   ENDPOINT
default     java-app   30d   http://adot-collector:4317

Output Schema

observability:
  metrics:
    container_insights:
      enabled: bool
      addon_version: string
      
    prometheus:
      detected: bool
      type: string         # self-managed | amp | operator
      namespace: string
      version: string
      
    grafana:
      detected: bool
      type: string         # self-managed | amg
      namespace: string
      
    metrics_server:
      detected: bool
      version: string
      
    other_tools: list      # datadog, newrelic, etc.
    
  logging:
    control_plane:
      enabled: bool
      log_types: list      # api, audit, authenticator, controllerManager, scheduler
      
    application:
      tool: string         # fluent-bit | fluentd | promtail | none
      destination: string  # cloudwatch | opensearch | loki | s3
      namespace: string
      
    log_destinations:
      cloudwatch: bool
      opensearch: bool
      s3: bool
      loki: bool
      
  tracing:
    tool: string           # xray | adot | jaeger | tempo | none
    adot:
      detected: bool
      version: string
    xray:
      detected: bool
    jaeger:
      detected: bool
    tempo:
      detected: bool
      
  apm:
    application_signals:
      enabled: bool
    auto_instrumentation:
      enabled: bool
      namespaces: list

Edge Cases

Multiple Metrics Solutions

Common to have:

Metrics Server (for HPA)
Prometheus (for detailed metrics)
Container Insights (for AWS integration)

Note all and their purposes.

Log Aggregation Outside Cluster

Logs may go to:

External CloudWatch in different account
Third-party SaaS (Datadog, Splunk)
Self-managed OpenSearch/ELK

Check Fluent Bit/Fluentd configs for destinations.

Control Plane Logging Not Enabled

# Check if any logs are enabled
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.logging.clusterLogging[?enabled==`true`].types'

If empty, flag as security/compliance gap.

ADOT vs Self-Managed Collectors

# Check if using ADOT add-on or self-managed
aws eks describe-addon --cluster-name <cluster-name> --addon-name adot 2>/dev/null
# vs
kubectl get deploy -A -l "app=opentelemetry-collector" 2>/dev/null

Recommendations Based on Findings

Finding	Recommendation
No metrics solution	Enable Container Insights or deploy Prometheus
No control plane logs	Enable all log types for debugging/audit
No tracing	Consider ADOT for distributed tracing
Multiple overlapping tools	Consolidate to reduce overhead
No metrics server	Deploy for HPA functionality
Application Signals not enabled	Consider for APM capabilities

Table of Contents​

Prerequisites​

Detection Strategy​

Detection Commands​

1. Metrics Collection​

2. Logging Configuration​

3. Tracing​

4. Application Signals (APM)​

Output Schema​

Edge Cases​

Multiple Metrics Solutions​

Log Aggregation Outside Cluster​

Control Plane Logging Not Enabled​

ADOT vs Self-Managed Collectors​

Recommendations Based on Findings​

Table of Contents