Evaluation
Evaluation pipeline
Section titled “Evaluation pipeline”This document describes how the platform evaluates agent performance and uses that feedback to improve over time. It aligns with the design principle that the system should be easy to observe and improve. The evaluation pipeline is a future enhancement; MVP relies on manual inspection of task outcomes and logs.
Purpose
Section titled “Purpose”- Measure agent quality — How well does the agent follow instructions, avoid reasoning errors, and produce correct, testable outcomes?
- Learn from failures — Categorize why tasks fail (timeout, missing tests, wrong approach, tool errors) and feed that back into prompts or memory so future runs avoid the same mistakes.
- Improve over time — Use evaluation results to tune system prompts, context hydration, and (future) model or tool selection.
What to evaluate
Section titled “What to evaluate”The plans call for automated trace analysis and failure categorization:
- Reasoning errors — Agent went down a wrong path, misunderstood the task, or made incorrect assumptions.
- Failure to follow instructions — Task spec or issue was clear but the agent did not comply (e.g. skipped tests, changed the wrong scope).
- Missing testing or verification — Agent did not run tests, did not run linters, or did not document how to verify the change.
- Running out of time — Task hit the 8-hour or idle timeout before completing; partial work may still be on the branch.
- Tool or environment failures — GitHub API errors, clone failures, build failures that the agent could not recover from.
Evaluation can be manual (human review of PRs and logs) or automated (scripts or ML that analyze traces, PR content, and task outcomes). The pipeline is the place where automated analysis runs and writes structured results.
Data sources
Section titled “Data sources”- Task outcomes — Status (COMPLETED, FAILED, TIMED_OUT),
error_message,pr_url, branch state. - TaskEvents — Audit log of what happened (agent_started, pr_created, task_completed, task_failed, etc.).
- Agent logs and traces — CloudWatch logs from the AgentCore Runtime session; future: OpenTelemetry traces, reasoning steps, tool calls (if captured and stored).
- Code artifacts — PR description, commits, diff; links to repo, branch, and issue (code attribution).
- PR outcome signals — Whether the PR was merged, revised, or rejected. Tracked via GitHub webhooks for
pull_request.closedevents (checking themergedflag). A merged PR is a positive signal on the task episode; a PR closed without merge is a negative signal. Over time, these outcome signals enable the evaluation pipeline to identify which approaches succeed and which fail for a given repo, and to correlate outcomes with prompt versions, memory state, and context hydration quality. See MEMORY.md (PR outcome signals). - Review feedback — PR review comments captured via the review feedback memory loop (see MEMORY.md). Reviewer comments, requested changes, and approval/rejection status are structured evaluation data: they encode what the agent got wrong and what the team expects.
These are the same data that observability and code attribution capture. Evaluation consumes them to produce scores, categories, or recommendations.
Outputs and feedback loop
Section titled “Outputs and feedback loop”- Structured evaluation results — Per task: success/failure, category, suggested prompt or memory updates.
- Feedback into memory — Insights (e.g. “this repo’s tests require env X”) or failure summaries written to AgentCore Memory so they can be retrieved during context hydration for future tasks.
- Feedback into prompts — System prompt or hydration templates updated to avoid known failure modes (e.g. “always run tests before opening PR” or “for repo X, run lint with —fix first”).
See MEMORY.md for how insights and evaluation feedback are stored and used. See OBSERVABILITY.md for the “Future: evaluation pipeline” section and how observability data feeds evaluation.
Agent self-feedback
Section titled “Agent self-feedback”At the end of each task, the platform explicitly prompts the agent to report what context it lacked. In practice, the agent can often identify missing context that affected execution quality. This is a lightweight, high-value signal source.
- Mechanism — After the agent completes its work (success or failure) but before the session ends, the orchestrator (or agent harness) sends a follow-up prompt: “What information, context, or instructions were missing that would have helped you complete this task more effectively?” The agent’s response is captured as a structured insight.
- Storage — The response is persisted in long-term memory (see MEMORY.md) with metadata:
task_id,repo,insight_type: "agent_self_feedback",timestamp. This enables retrieval during context hydration for future tasks on the same repo. - Feedback loop — Over time, recurring themes in agent self-feedback (e.g. “I needed to know that this repo uses a custom linter”) can be surfaced in evaluation dashboards and used to update per-repo system prompts or onboarding artifacts. The evaluation pipeline can aggregate self-feedback by repo and extract patterns.
- Cost — The follow-up prompt is a single additional turn (minimal token cost). The value of the signal justifies the cost.
Prompt versioning and A/B evaluation
Section titled “Prompt versioning and A/B evaluation”System prompts (platform default + per-repo overrides) should be treated as versioned, testable artifacts, not opaque strings. Static, version-controlled prompts are generally more evaluable than ad hoc prompt assembly.
- Prompt versioning — Each system prompt variant is stored with a version identifier (hash or semantic version). When a task is created, the
prompt_versionis recorded in the task record (see ORCHESTRATOR.md data model). This enables correlation: “did merge rates improve after prompt version X was deployed for repo Y?” - A/B comparison (future) — A framework for running the same task type with two prompt variants and comparing outcomes (merge rate, failure rate, token usage, duration). This requires: (a) a way to assign tasks to prompt variants (e.g. random split or deterministic by task ID), (b) outcome tracking per variant, and (c) a comparison dashboard. Deferred to Iteration 5; the versioning and correlation capability (Iteration 3b) is the foundation.
- Prompt change tracking — Prompt diffs between versions should be reviewable (like code diffs). Store prompt versions in a versioned store (e.g. DynamoDB with version history, or as files in the repo’s onboarding config). This supports audit and rollback.
Memory effectiveness metrics
Section titled “Memory effectiveness metrics”The primary measure of memory’s value is: does the agent produce better PRs over time? These metrics track that:
| Metric | How to measure | What improvement looks like |
|---|---|---|
| First-review merge rate | % of PRs merged without revision requests | Increases over time on the same repo |
| Revision cycles | Average number of review rounds before merge | Decreases over time |
| CI pass rate on first push | % of PRs where CI passes on the initial push | Increases as the agent learns repo-specific build quirks |
| Review comment density | Number of reviewer comments per PR | Decreases as the agent internalizes review patterns |
| Repeated mistakes | Same reviewer comment appearing across multiple PRs | Should drop to zero after the feedback loop captures the rule |
| Time to PR | Duration from task submission to PR creation | May decrease as the agent reuses past approaches |
The most telling metric is repeated mistakes. If a reviewer says “don’t use any types” on PR #10 and the agent uses any types again on PR #15, the review feedback memory has failed. This metric requires tracking review comments across PRs and detecting semantic duplicates.
Semantic similarity dependency: Detecting repeated mistakes requires embedding-based similarity between review comments — simple string matching is insufficient (“don’t use any” vs. “please use proper TypeScript types instead of any” are the same feedback). Implementation approach:
- The review feedback extraction prompt (see MEMORY.md, Extraction prompts) should normalize comments into canonical rule forms (e.g. “Rule: use explicit TypeScript types, not
any”) to make downstream deduplication easier. - New review comments are compared against the history of stored rules using embedding similarity (Bedrock embedding model or AgentCore’s built-in semantic search). A similarity score above a threshold (e.g. 0.85) indicates a repeated mistake.
- This is a lightweight ML task that runs as part of the evaluation pipeline, not a separate system.
These metrics should be surfaced in the evaluation dashboard (Iteration 4/5) and broken down by repo, user, and prompt version. Correlating metrics with prompt versions (see Prompt versioning above) enables data-driven prompt improvement.
Tiered validation pipeline
Section titled “Tiered validation pipeline”The platform validates agent-created content through three sequential tiers before a PR is finalized. Each tier targets a different class of defect, from concrete tool failures to structural quality issues to cross-codebase impact. The tiers run as post-agent steps in the blueprint execution framework (see REPO_ONBOARDING.md).
Tier 1 — Tool validation (build, test, lint)
Section titled “Tier 1 — Tool validation (build, test, lint)”What it checks: Deterministic, binary pass/fail signals from the repo’s own tooling.
- Test suites (
npm test,pytest,go test, etc.) - Linters and formatters (
eslint,ruff,prettier, etc.) - Type checkers (
tsc --noEmit,mypy,pyright) - SAST scanners (e.g.
semgrep,bandit, custom scripts) - Build verification (
npm run build,cargo build)
Implementation: The orchestrator invokes a post-agent Lambda (or runs commands inside the agent session before finalization) that executes the repo’s configured validation commands. Validation commands are discovered during onboarding (from package.json scripts, Makefile targets, CI config) or explicitly configured in the blueprint’s custom_steps.
On failure: Tool output (test failures, lint errors) is fed back to the agent for a fix cycle (up to 2 retries). If the agent cannot fix the issues, the PR is created with the failures documented in the validation report.
Status: Partially implemented — the system prompt already instructs the agent to run tests and fix errors (in-session retry, option (c) from ORCHESTRATOR.md Q6). The orchestrator-driven post-agent step (option (b)) is the Iteration 3c enhancement.
Tier 2 — Code quality analysis
Section titled “Tier 2 — Code quality analysis”What it checks: Structural and design quality of the agent’s diff, beyond what linters catch.
| Quality dimension | What to detect | Example finding |
|---|---|---|
| DRY violations | Duplicated or near-duplicated code blocks introduced by the agent | ”Lines 45–62 in auth.ts duplicate the logic in session.ts:30–47. Extract a shared helper.” |
| SOLID violations | Single responsibility breaches, interface segregation issues, dependency inversion gaps | ”Class TaskHandler now handles both validation and persistence — consider splitting.” |
| Design pattern adherence | Deviations from patterns established in the codebase (factory, strategy, repository, etc.) | ”Existing services use the repository pattern, but the new UserService queries DynamoDB directly.” |
| Complexity | Cyclomatic complexity, cognitive complexity, deeply nested control flow | ”Function processTask has cyclomatic complexity 18 (threshold: 10).” |
| Naming and conventions | Inconsistent naming, casing, file organization relative to existing code | ”get_data uses snake_case but the codebase convention is camelCase.” |
| Repo-specific rules | Custom rules from onboarding config (e.g. “no any types”, “all API handlers must validate input”) | “TypeScript any type used in handler.ts:23 — repo policy requires explicit types.” |
Implementation: A combination of:
- Static analysis tools — Complexity metrics (e.g.
eslint-plugin-complexity,radon), duplication detection (e.g.jscpd), custom lint rules. These run as Lambda-invoked scripts. - LLM-based review — An LLM (invoked via Bedrock) reviews the diff against the quality dimensions above. The review prompt includes: the diff, the repo’s conventions (from onboarding config / system prompt overrides), and a structured output schema. This catches semantic issues that static tools miss (SOLID violations, pattern adherence).
Output format: Structured findings:
interface QualityFinding { tier: 'code-quality'; severity: 'info' | 'warning' | 'error'; // error = blocking, warning/info = advisory rule: string; // e.g. "DRY", "SRP", "complexity" file: string; line?: number; message: string; suggestion?: string; // actionable fix suggestion}On failure: Findings with severity error trigger a fix cycle (agent receives the findings and attempts to address them). Findings with severity warning or info are included in the PR validation report as review comments but do not block finalization. The severity threshold for blocking vs. advisory is configurable per repo in the blueprint config.
Tier 3 — Risk and blast radius analysis
Section titled “Tier 3 — Risk and blast radius analysis”What it checks: The scope, impact, and regression risk of the agent’s changes on the broader codebase.
Analysis dimensions:
| Dimension | Method | Output |
|---|---|---|
| Change surface area | Count files, lines added/removed/modified, modules touched | Quantitative metrics included in the risk report |
| Dependency graph impact | Analyze imports/exports, call graphs, and type references to identify downstream consumers of changed code | List of affected modules and their distance from the change |
| Public API changes | Detect modifications to exported functions, types, interfaces, class signatures, REST endpoints, or database schemas | Flag breaking vs. non-breaking changes |
| Shared infrastructure | Detect changes to shared utilities, base classes, configuration files, CI/CD pipelines, or infrastructure code | Elevated risk flag |
| Test coverage of affected area | Cross-reference changed code and its downstream dependents with existing test coverage (if coverage data is available from Tier 1) | Coverage gaps flagged as risk factors |
| New external dependencies | Detect additions to package.json, requirements.txt, go.mod, etc. | Flag new dependencies with license, maintenance, and security metadata |
Implementation: An LLM-based analysis step that receives:
- The full diff (
git diffoutput) - A dependency/import graph of the changed files (generated by a pre-analysis script or extracted during the agent session)
- The repo’s module structure (from onboarding artifacts or a quick
find/treesnapshot) - Test coverage data (if available from Tier 1 output)
The LLM produces a structured risk assessment following a defined output schema.
PR risk level
Section titled “PR risk level”Every agent-created PR receives a computed risk level based on Tier 3 analysis:
| Risk level | Criteria | PR behavior |
|---|---|---|
| Low | Small change, no public API changes, high test coverage, no shared infrastructure touched | PR created normally with risk:low label |
| Medium | Moderate change surface, some downstream dependents, or partial test coverage | PR created with risk:medium label and risk summary in validation report |
| High | Large change surface, public API changes, shared infrastructure touched, low test coverage of affected area, or new external dependencies | PR created with risk:high label, detailed blast radius report, and recommendation for thorough review |
| Critical | Breaking API changes, database schema modifications, CI/CD pipeline changes, or security-sensitive code touched | PR created with risk:critical label and optional hold for human approval (foundation for HITL approval mode in Iteration 6) |
Risk level persistence: The computed risk level is stored in the task record (risk_level field) and emitted as a TaskEvent (validation_completed with risk metadata). This enables:
- Evaluation trending: track risk distribution over time, per repo, per agent prompt version
- Correlation: do high-risk PRs get rejected more often? Do they take longer to review?
- Alerting: notify team leads when a critical-risk PR is created
Validation report format: The combined output of all three tiers is posted to the PR as a structured comment (or GitHub Check Run):
## Validation Report
### Tier 1 — Tool Validation- Tests: PASS (42 passed, 0 failed)- Lint: PASS (0 errors, 2 warnings)- Type check: PASS
### Tier 2 — Code Quality- 0 errors, 1 warning, 2 info- ⚠️ Cognitive complexity of `processTask()` is 14 (threshold: 10)- ℹ️ Consider extracting shared validation logic (DRY)- ℹ️ New utility function follows existing naming conventions ✓
### Tier 3 — Risk Assessment- **Risk level: Medium** 🟡- Files changed: 4 | Lines: +87 / -12- Downstream dependents: 3 modules import from changed files- Public API changes: None- New dependencies: None- Test coverage of affected area: 78%Configuration
Section titled “Configuration”Validation tiers are configured per repo in the blueprint config (stored in DynamoDB during onboarding):
interface ValidationConfig { tier1?: { enabled: boolean; // default: true commands?: string[]; // override auto-discovered commands timeoutSeconds?: number; // default: 300 }; tier2?: { enabled: boolean; // default: true blockingSeverity: 'error' | 'warning'; // default: 'error' customRules?: string[]; // repo-specific quality rules (from onboarding) timeoutSeconds?: number; // default: 120 }; tier3?: { enabled: boolean; // default: true riskThresholdForHold?: 'high' | 'critical'; // default: 'critical' (future HITL integration) timeoutSeconds?: number; // default: 120 }; maxFixCyclesPerTier?: number; // default: 2}Phasing
Section titled “Phasing”- Iteration 3c (initial): Tier 1 as orchestrator-driven post-agent step (upgrading from in-session prompt-based validation). Tier 2 and Tier 3 as LLM-based analysis steps. PR risk level labeling and validation report.
- Iteration 5 (advanced): Tier 2 enhanced with per-repo learned rules from evaluation and memory feedback loops. Tier 3 enhanced with historical risk correlation (do repos with pattern X produce more rejected PRs?). Risk trending dashboards in the control panel.
Scope and phasing
Section titled “Scope and phasing”- MVP — No automated evaluation pipeline. Operators and users inspect task status, PRs, and CloudWatch logs. Improvement is manual.
- Iteration 3b — Agent self-feedback after each task. Prompt versioning (store prompt hash with task records). These are lightweight and provide immediate value.
- Iteration 3c — Tiered validation pipeline (Tier 1: tool validation, Tier 2: code quality analysis, Tier 3: risk/blast radius analysis). PR risk level computation and labeling. Validation report posted to PRs. Risk level persisted in task records for trending.
- Iteration 3d — Review feedback memory loop. PR outcome tracking. Basic evaluation pipeline: failure categorization, memory effectiveness metrics (first-review merge rate, revision cycles, repeated mistakes). Requires new webhook infrastructure.
- Iteration 5 — Advanced evaluation: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates. Tier 2 enhanced with learned rules from memory. Tier 3 enhanced with historical risk correlation. Risk trending dashboards. AgentCore has a built-in Evaluations service; the platform should evaluate whether it meets these needs before building custom tooling.
Requirements (future)
Section titled “Requirements (future)”- Ingest task lifecycle and, when available, agent traces and logs.
- Support at least: failure categorization, simple success/failure and timeout metrics.
- Write evaluation-derived insights or labels into memory (or a dedicated store) for retrieval during context hydration.
- Capture agent self-feedback at end of each task and persist as searchable insights.
- Track prompt versions per task and support correlation between prompt changes and outcome metrics.
- Optionally drive prompt or template updates from evaluation results (e.g. per-repo or global rules).
- Integrate with observability (same data sources, shared dashboards or alarms).
- Run tiered validation (tool, code quality, risk/blast radius) as post-agent steps and persist results.
- Compute and persist PR risk level (
low/medium/high/critical) in the task record. - Post structured validation reports to PRs (comment or Check Run) summarizing all three tiers.
- Track risk level distribution over time per repo, user, and prompt version for trending and correlation.