Skip to main content
Source

This page is generated from skills/eks-recon/references/observability.md. Edit the source, not this page.

Module: Observability

Part of: eks-recon Purpose: Detect observability stack - metrics, logging, tracing configuration

Table of Contents


Prerequisites

  • Cluster name required: Yes
  • MCP tools used: describe_eks_resource, list_k8s_resources
  • CLI fallback: aws eks, kubectl, aws logs

Detection Strategy

Observability has three pillars (plus control plane logging):

1. Metrics -> Container Insights, Prometheus, Datadog, etc.
2. Logging -> CloudWatch, FluentBit, OpenSearch, etc.
3. Tracing -> X-Ray, ADOT, Jaeger, etc.
4. Control Plane -> API server, audit, authenticator logs

Why detect each pillar:

PillarWhy It Matters
MetricsUnderstand resource utilization, HPA scaling decisions, capacity planning
LoggingDebug application issues, audit security events, compliance requirements
TracingDiagnose latency in distributed systems, identify service dependencies
Control PlaneInvestigate API failures, audit access, debug networking issues

Detection Commands

1. Metrics Collection

Start with metrics detection to understand how the cluster tracks resource usage and supports autoscaling. Most clusters have at least one metrics solution.

Container Insights (CloudWatch):

Use Container Insights when you need AWS-native monitoring with automatic CloudWatch integration. This is the simplest option for teams already using AWS observability tools.

MCP:

describe_eks_resource(
resource_type="addon",
cluster_name="<cluster-name>",
resource_name="amazon-cloudwatch-observability"
)

CLI:

# Check for CloudWatch add-on
aws eks describe-addon --cluster-name <cluster-name> \
--addon-name amazon-cloudwatch-observability 2>/dev/null

# Alternative: Check for CloudWatch agent DaemonSet
kubectl get daemonset -n amazon-cloudwatch cloudwatch-agent 2>/dev/null

# Check for Fluent Bit (CloudWatch integration)
kubectl get daemonset -n amazon-cloudwatch fluent-bit 2>/dev/null

Example output (add-on installed):

{
"addon": {
"addonName": "amazon-cloudwatch-observability",
"clusterName": "prod-cluster",
"status": "ACTIVE",
"addonVersion": "v1.5.0-eksbuild.1"
}
}

Example output (not installed):

An error occurred (ResourceNotFoundException) when calling the DescribeAddon operation

Prometheus (Self-Managed):

Use Prometheus when you need flexible metrics collection with PromQL queries, custom recording rules, or Grafana dashboards. Common in teams with existing Prometheus expertise.

# Check for Prometheus deployment
kubectl get deploy -n prometheus prometheus-server 2>/dev/null || \
kubectl get deploy -n monitoring prometheus-server 2>/dev/null || \
kubectl get statefulset -n prometheus prometheus-server 2>/dev/null

# Check for kube-prometheus-stack (Helm)
helm list -A --filter "prometheus\|kube-prometheus" 2>/dev/null

# Check for Prometheus Operator
kubectl get deploy -A -l "app.kubernetes.io/name=prometheus-operator" 2>/dev/null

Example output (Prometheus detected):

NAME READY UP-TO-DATE AVAILABLE AGE
prometheus-server 1/1 1 1 45d

Amazon Managed Prometheus (AMP):

Check for AMP when you need managed Prometheus with automatic scaling and AWS integration. Look for aps-workspaces URLs in remote write configurations.

# Check for ADOT or Prometheus remote write config
kubectl get configmap -A -o json | jq -r '
.items[] |
select(.data | to_entries | .[] | .value | contains("aps-workspaces")) |
{namespace: .metadata.namespace, name: .metadata.name}'

Example output (AMP configured):

{
"namespace": "prometheus",
"name": "prometheus-config"
}

Grafana:

# Check for Grafana deployment
kubectl get deploy -A -l "app.kubernetes.io/name=grafana" 2>/dev/null

# Check for Amazon Managed Grafana (external, check workspace)
# Note: AMG workspaces are external to cluster

Other Metrics Tools:

Third-party tools like Datadog and New Relic provide unified observability platforms. Check for these when the team uses a commercial APM solution.

# Datadog
kubectl get daemonset -n datadog datadog-agent 2>/dev/null

# New Relic
kubectl get daemonset -A -l "app.kubernetes.io/name=nri-bundle" 2>/dev/null

# Metrics Server (required for HPA - almost always present)
kubectl get deploy -n kube-system metrics-server 2>/dev/null

Example output (Metrics Server):

NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 120d

2. Logging Configuration

Detect logging configuration to understand how application and cluster logs are collected and where they are sent. Control plane logging is critical for debugging and compliance.

Control Plane Logging:

Always check control plane logging first. Missing audit logs is a security/compliance gap.

# Check which control plane logs are enabled
aws eks describe-cluster --name <cluster-name> \
--query 'cluster.logging.clusterLogging[*].{types:types,enabled:enabled}'

Example output (all logs enabled):

[
{
"types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
"enabled": true
}
]

Example output (no logs enabled - flag this):

[
{
"types": ["api", "audit", "authenticator", "controllerManager", "scheduler"],
"enabled": false
}
]

Fluent Bit / Fluentd:

Fluent Bit (lightweight) and Fluentd (feature-rich) are the most common log forwarders. Check their ConfigMaps to determine where logs are being sent.

# Fluent Bit DaemonSet
kubectl get daemonset -A -l "app.kubernetes.io/name=fluent-bit" 2>/dev/null

# Fluentd DaemonSet
kubectl get daemonset -A -l "app=fluentd" 2>/dev/null

# Check Fluent Bit ConfigMap for destinations
kubectl get configmap -n amazon-cloudwatch fluent-bit-config -o yaml 2>/dev/null | \
grep -E "cloudwatch|opensearch|s3|kinesis" || true

Example output (Fluent Bit detected):

NAMESPACE NAME DESIRED CURRENT READY AGE
amazon-cloudwatch fluent-bit 3 3 3 60d

OpenSearch / Elasticsearch:

# Check for OpenSearch endpoint in configs
kubectl get configmap -A -o json | jq -r '
.items[] |
select(.data | to_entries | .[] | .value | contains("opensearch") or contains("elasticsearch")) |
{namespace: .metadata.namespace, name: .metadata.name}'

Loki:

# Check for Loki deployment
kubectl get deploy -A -l "app.kubernetes.io/name=loki" 2>/dev/null
kubectl get statefulset -A -l "app.kubernetes.io/name=loki" 2>/dev/null

3. Tracing

Tracing is essential for debugging latency in microservices architectures. Without tracing, diagnosing cross-service issues requires correlating logs manually.

AWS X-Ray / ADOT:

ADOT (AWS Distro for OpenTelemetry) is the AWS-recommended approach for tracing. It can send traces to X-Ray, Jaeger, or other backends.

# Check for ADOT collector
kubectl get deploy -A -l "app.kubernetes.io/name=aws-otel-collector" 2>/dev/null

# Check for X-Ray daemon
kubectl get daemonset -A -l "app=xray-daemon" 2>/dev/null

# Check ADOT add-on
aws eks describe-addon --cluster-name <cluster-name> --addon-name adot 2>/dev/null

Example output (ADOT add-on installed):

{
"addon": {
"addonName": "adot",
"clusterName": "prod-cluster",
"status": "ACTIVE",
"addonVersion": "v0.88.0-eksbuild.1"
}
}

Jaeger:

Jaeger is a popular open-source tracing backend. Check for it when the team uses a self-managed tracing solution.

# Check for Jaeger
kubectl get deploy -A -l "app.kubernetes.io/name=jaeger" 2>/dev/null
kubectl get deploy -A -l "app=jaeger" 2>/dev/null

Tempo:

Grafana Tempo is often used with Grafana and Loki for a unified observability stack.

# Check for Grafana Tempo
kubectl get deploy -A -l "app.kubernetes.io/name=tempo" 2>/dev/null
kubectl get statefulset -A -l "app.kubernetes.io/name=tempo" 2>/dev/null

4. Application Signals (APM)

Application Signals provides automatic instrumentation for common frameworks. Check for this when the team wants APM without modifying application code.

# Check for CloudWatch Application Signals
kubectl get deploy -n amazon-cloudwatch cloudwatch-agent-operator 2>/dev/null

# Check for auto-instrumentation
kubectl get instrumentations.opentelemetry.io -A 2>/dev/null

Example output (auto-instrumentation configured):

NAMESPACE NAME AGE ENDPOINT
default java-app 30d http://adot-collector:4317

Output Schema

observability:
metrics:
container_insights:
enabled: bool
addon_version: string

prometheus:
detected: bool
type: string # self-managed | amp | operator
namespace: string
version: string

grafana:
detected: bool
type: string # self-managed | amg
namespace: string

metrics_server:
detected: bool
version: string

other_tools: list # datadog, newrelic, etc.

logging:
control_plane:
enabled: bool
log_types: list # api, audit, authenticator, controllerManager, scheduler

application:
tool: string # fluent-bit | fluentd | promtail | none
destination: string # cloudwatch | opensearch | loki | s3
namespace: string

log_destinations:
cloudwatch: bool
opensearch: bool
s3: bool
loki: bool

tracing:
tool: string # xray | adot | jaeger | tempo | none
adot:
detected: bool
version: string
xray:
detected: bool
jaeger:
detected: bool
tempo:
detected: bool

apm:
application_signals:
enabled: bool
auto_instrumentation:
enabled: bool
namespaces: list

Edge Cases

Multiple Metrics Solutions

Common to have:

  • Metrics Server (for HPA)
  • Prometheus (for detailed metrics)
  • Container Insights (for AWS integration)

Note all and their purposes.

Log Aggregation Outside Cluster

Logs may go to:

  • External CloudWatch in different account
  • Third-party SaaS (Datadog, Splunk)
  • Self-managed OpenSearch/ELK

Check Fluent Bit/Fluentd configs for destinations.

Control Plane Logging Not Enabled

# Check if any logs are enabled
aws eks describe-cluster --name <cluster-name> \
--query 'cluster.logging.clusterLogging[?enabled==`true`].types'

If empty, flag as security/compliance gap.

ADOT vs Self-Managed Collectors

# Check if using ADOT add-on or self-managed
aws eks describe-addon --cluster-name <cluster-name> --addon-name adot 2>/dev/null
# vs
kubectl get deploy -A -l "app=opentelemetry-collector" 2>/dev/null

Recommendations Based on Findings

FindingRecommendation
No metrics solutionEnable Container Insights or deploy Prometheus
No control plane logsEnable all log types for debugging/audit
No tracingConsider ADOT for distributed tracing
Multiple overlapping toolsConsolidate to reduce overhead
No metrics serverDeploy for HPA functionality
Application Signals not enabledConsider for APM capabilities