Observability Agent with Amazon Bedrock AgentCore

An AI-powered observability agent that helps SREs investigate incidents and reduce Mean Time to Resolution (MTTR) using Amazon Bedrock AgentCore, OpenSearch Serverless, and Amazon Managed Prometheus.

Get Started View on GitHub


Overview

When an incident triggers alerts, SREs typically jump between multiple dashboards, write specific queries, and correlate between logs and traces to find root cause. This process is largely manual and creates significant cognitive load.

This project implements an observability agent that automates incident investigation by querying logs, traces, and metrics, then providing root cause analysis with actionable recommendations.

Based on the architecture from Reduce Mean Time to Resolution with an observability agent on the AWS Big Data Blog.

Architecture

Observability Agent Architecture

Components

Component Purpose
Amazon Bedrock AgentCore Runtime Hosts and executes the AI agent
Amazon Bedrock AgentCore Memory Maintains conversation context across sessions
Amazon OpenSearch Serverless Stores logs and distributed traces
Amazon Managed Prometheus Stores infrastructure metrics
Anthropic Claude Sonnet 4.5 Provides reasoning capabilities

Agent Tools

Tool Data Source Description
get_red_metrics OpenSearch (traces) Rate, Error, Duration metrics aggregated by service
search_logs OpenSearch (logs) Search application logs by service and severity
get_spans OpenSearch (traces) Search distributed trace spans across services
query_metrics Amazon Managed Prometheus Query infrastructure metrics using PromQL

Choose Your Path

Option A: Quick Start (Build from Scratch)

Best for: Learning, POC, testing the agent with sample data.

Time: ~15 minutes Cost: ~$5/day for OpenSearch Serverless

Option B: Integrate with Existing Infrastructure

Best for: Production use with your existing observability stack.

Time: ~10 minutes Prerequisites: Existing OpenSearch + Prometheus

Quick Start (Build from Scratch)

Prerequisites

  • AWS account with appropriate permissions
  • Python 3.11+
  • AWS CLI configured with credentials

Step 1: Clone and Setup

git clone https://github.com/aws-samples/sample-observability-agent-bedrock-agentcore.git
cd sample-observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Step 2: Create OpenSearch Serverless Collection

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=us-east-1

# Create encryption policy
aws opensearchserverless create-security-policy \
  --name observability-enc --type encryption \
  --policy '{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]}],"AWSOwnedKey":true}'

# Create network policy
aws opensearchserverless create-security-policy \
  --name observability-net --type network \
  --policy '[{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]},{"ResourceType":"dashboard","Resource":["collection/observability-agent"]}],"AllowFromPublic":true}]'

# Create data access policy
aws opensearchserverless create-access-policy \
  --name observability-access --type data \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\"]}]"

# Create collection
aws opensearchserverless create-collection \
  --name observability-agent --type SEARCH

Wait for the collection to become ACTIVE (~2-3 minutes):

aws opensearchserverless batch-get-collection --names observability-agent

Step 3: Configure Environment

export OPENSEARCH_HOST=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].collectionEndpoint' \
  --output text | sed 's|https://||')

Step 4: Generate Test Data

python scripts/generate_test_data.py

This creates sample logs and traces simulating a payment service failure with ~40% error rate.

Step 5: Deploy the Agent

pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Step 6: Grant Permissions

The AgentCore execution role needs access to OpenSearch Serverless. Get the role name from the deploy output.

export AGENTCORE_ROLE=AmazonBedrockAgentCoreSDKRuntime-us-east-1-XXXXXX
export COLLECTION_ID=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].id' --output text)

# Add OpenSearch permissions
aws iam put-role-policy \
  --role-name $AGENTCORE_ROLE \
  --policy-name OpenSearchServerlessAccess \
  --policy-document "{
    \"Version\": \"2012-10-17\",
    \"Statement\": [{
      \"Effect\": \"Allow\",
      \"Action\": [\"aoss:APIAccessAll\"],
      \"Resource\": \"arn:aws:aoss:${AWS_REGION}:${AWS_ACCOUNT_ID}:collection/${COLLECTION_ID}\"
    }]
  }"

# Update data access policy to include the role
POLICY_VERSION=$(aws opensearchserverless get-access-policy \
  --name observability-access --type data \
  --query 'accessPolicyDetail.policyVersion' --output text)

aws opensearchserverless update-access-policy \
  --name observability-access --type data \
  --policy-version $POLICY_VERSION \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\",\"arn:aws:iam::${AWS_ACCOUNT_ID}:role/${AGENTCORE_ROLE}\"]}]"

Step 7: Test the Agent

sleep 30  # Wait for IAM propagation

# Health check — uses get_red_metrics to show Rate/Error/Duration per service
agentcore invoke '{"prompt": "Give me a health overview of all my services"}'

# Error investigation — uses search_logs + get_red_metrics to find errors
agentcore invoke '{"prompt": "Are there any errors in my application?"}'

# Service deep dive — uses get_spans + search_logs to trace failures
agentcore invoke '{"prompt": "What is wrong with the payment service? Show me the error traces"}'

# Trace correlation — uses get_spans + search_logs to follow a request across services
agentcore invoke '{"prompt": "Find a failed trace and show me the full request flow across all services"}'

# Metrics query — uses query_metrics to check infrastructure health
agentcore invoke '{"prompt": "Query prometheus for CPU and memory metrics"}'

# Root cause analysis — agent combines all tools to investigate
agentcore invoke '{"prompt": "Our checkout is failing for some users. Investigate the root cause and suggest fixes"}'

Integrate with Existing Infrastructure

Step 1: Clone and Setup

git clone https://github.com/aws-samples/sample-observability-agent-bedrock-agentcore.git
cd sample-observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Step 2: Configure Your Endpoints

export OPENSEARCH_HOST=your-collection-id.us-east-1.aoss.amazonaws.com
export AMP_WORKSPACE_ID=ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  # Optional
export AWS_REGION=us-east-1

Step 3: Customize Index Patterns (if needed)

If your indices use different naming conventions, update the index patterns in agent/main.py:

# Default patterns (OpenTelemetry standard)
"otel-v1-apm-span-*"  # For traces
"otel-logs-*"          # For logs

Step 4: Deploy and Grant Permissions

pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Then add the AgentCore role to your OpenSearch data access policy (see Step 6 in Quick Start).

Step 5: Test

agentcore invoke '{"prompt": "Show me the health of my services"}'

Security

This sample follows AWS security best practices:

  • No hardcoded credentials — Uses IAM roles for all authentication
  • TLS everywhere — All connections use HTTPS with certificate verification
  • Input validation — All tool inputs are validated and sanitized
  • Least privilege — IAM policies grant minimal required permissions
  • Query limits — Result sizes and query lengths are capped

Clean Up

OpenSearch Serverless incurs charges while active (~$5/day). Delete resources when done.

agentcore destroy

aws opensearchserverless delete-collection --id YOUR_COLLECTION_ID
aws opensearchserverless delete-security-policy --name observability-enc --type encryption
aws opensearchserverless delete-security-policy --name observability-net --type network
aws opensearchserverless delete-access-policy --name observability-access --type data

About

This project is maintained by AWS Samples and licensed under the MIT-0 License.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

References


Copyright © Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the MIT-0 License.