Orchestrator
Orchestrator
Section titled “Orchestrator”Overview
Section titled “Overview”The orchestrator is the component that executes the task lifecycle from submission to completion. It is the runtime engine for blueprints: it takes a task definition (the blueprint), runs each step in sequence, manages state transitions, handles failures and timeouts, and ensures that every task reaches a terminal state with proper cleanup.
The orchestrator does not run the agent. The agent runs inside an isolated compute session (see COMPUTE.md); the orchestrator starts that session, monitors it, and acts on its outcome. The orchestrator runs the deterministic parts of the pipeline (admission control, context hydration, session start, result inference, cleanup) and delegates the non-deterministic part (the agent workload) to the compute environment. This separation is deliberate: deterministic steps are cheap, predictable, and testable; the agent step is expensive, long-running, and unpredictable. The orchestrator wraps the unpredictable part with predictable bookkeeping.
Why a separate design document? The architecture document (see ARCHITECTURE.md) defines the blueprint model and the high-level step sequence (deterministic–agentic–deterministic sandwich). Other documents define individual components: INPUT_GATEWAY.md covers how tasks enter the system, COMPUTE.md covers the session runtime, MEMORY.md covers context sources. No existing document defines: the task state machine with formal states and transitions, the execution model for each blueprint step in detail, failure modes and recovery, concurrency management, or the implementation strategy for the orchestrator itself. This document fills that gap.
At a glance
Section titled “At a glance”- Use this doc for: task state machine, admission/finalization flow, cancellation behavior, and failure recovery.
- Most important sections for readers: Responsibilities, State machine, Admission control, and Cancellation.
- Scope: orchestrator behavior only; API surface and security policy are defined in their dedicated docs.
Relationship to blueprints. The orchestrator is a framework that enforces platform invariants — the task state machine, event emission, concurrency management, and cancellation handling — and delegates variable work to blueprint-defined step implementations. A blueprint defines which steps run, in what order, and how each step is implemented (built-in strategy, Lambda-backed custom step, or custom sequence). The default blueprint is defined in this document (Section 4). Per-repo customization (see REPO_ONBOARDING.md) changes the steps the orchestrator executes, not the framework guarantees it enforces. The orchestrator wraps every step with state transitions, event emission, and cancellation checks — regardless of whether the step is a built-in or a custom Lambda.
Iteration 1 vs. target state
Section titled “Iteration 1 vs. target state”In Iteration 1 (current), the orchestrator does not exist as a distinct component. The client calls invoke_agent_runtime synchronously, the agent runs to completion inside the AgentCore Runtime MicroVM, and the caller infers the result from the response. There is no durable state, no task management, no concurrency control, and no recovery. If the caller disconnects, the session is orphaned.
The target state (Iteration 2 and beyond) introduces a durable orchestrator that manages the full task lifecycle. This document designs for the target state; where Iteration 1 constraints apply, they are called out explicitly.
Responsibilities
Section titled “Responsibilities”What the orchestrator owns
Section titled “What the orchestrator owns”| Responsibility | Description |
|---|---|
| Task lifecycle | Accept tasks from the input gateway, drive them through the state machine to a terminal state, persist state at each transition. |
| Admission control | Validate that a task can be accepted: repo onboarded, user within concurrency limits, rate limits, idempotency. |
| Context hydration | Assemble the agent prompt from multiple sources (user message, GitHub issue, memory, repo config, system prompt template). |
| Session start | Invoke the compute runtime (AgentCore invoke_agent_runtime) with the hydrated payload. Map the task ID to the runtime session ID. |
| Session monitoring | Track whether the session is still running, detect completion, enforce timeouts (idle and absolute). |
| Result inference | After the session ends, determine success or failure by inspecting GitHub state (branch, PR, commits) and/or the session response. |
| Finalization and cleanup | Update task status, emit events, release concurrency counters, persist audit records, emit notifications. |
| Cancellation | Accept cancel requests at any point in the lifecycle and drive the task to CANCELLED, including stopping the runtime session if running. |
| Concurrency management | Track how many tasks are running per user and system-wide; enforce limits at admission and release counters at finalization. |
What the orchestrator does NOT own
Section titled “What the orchestrator does NOT own”| Component | Owner | Reference |
|---|---|---|
| Request authentication and normalization | Input gateway | INPUT_GATEWAY.md |
| Agent logic (clone, code, test, PR) | Agent harness inside compute | AGENT_HARNESS.md |
| Compute session lifecycle (VM creation, /ping, image pull) | AgentCore Runtime | COMPUTE.md |
| Memory storage and retrieval APIs | AgentCore Memory / MemoryStore | MEMORY.md |
| Repository onboarding and per-repo configuration | Onboarding pipeline | REPO_ONBOARDING.md |
| Outbound notification rendering and delivery | Notification adapters (input gateway outbound) | INPUT_GATEWAY.md |
| Evaluation and feedback | Evaluation pipeline | EVALUATION.md |
Task state machine
Section titled “Task state machine”States
Section titled “States”| State | Description | Typical duration |
|---|---|---|
SUBMITTED | Task accepted by the input gateway, persisted, awaiting orchestration. | Milliseconds |
HYDRATING | Context hydration in progress (fetching GitHub issue, querying memory, assembling prompt). | Seconds |
RUNNING | Agent session is active inside the compute environment. | Minutes to hours (up to 8h) |
FINALIZING | Session ended; orchestrator is performing result inference, build verification, PR check, cleanup. | Seconds |
COMPLETED | Terminal. Task finished successfully (PR created, or work committed). | — |
FAILED | Terminal. Task could not be completed (agent error, session crash, hydration failure, etc.). | — |
CANCELLED | Terminal. Task was cancelled by the user or system. | — |
TIMED_OUT | Terminal. Task exceeded the maximum allowed duration or was killed by an idle timeout without recovery. | — |
State transition diagram
Section titled “State transition diagram” +-----------+ | SUBMITTED | +-----+-----+ | admission control passes | +------+------+ | HYDRATING | +------+------+ | | hydration complete slot becomes available | | | +------+------+ | | HYDRATING | | +------+------+ | | +-------------+-------------+ | session started (invoke_agent_runtime) | +------+------+ | RUNNING | +------+------+ | +---------+-------+-------+---------+ | | | | session end timeout cancel req crash | | | | +------+------+ | +------+------+ | | FINALIZING | | | CANCELLED | | +------+------+ | +-------------+ | | | | +--------+--------+| | | | | | success failure timed_out failure | | | |+---------+ +------+ +--------+ +------+|COMPLETED| |FAILED| |TIMED_OUT| |FAILED|+---------+ +------+ +--------+ +------+Transition table
Section titled “Transition table”| From | To | Trigger | Guard / condition |
|---|---|---|---|
SUBMITTED | HYDRATING | Admission passes, slot available | Concurrency counter incremented |
SUBMITTED | FAILED | Admission rejected | Repo not onboarded, rate limit, validation failure |
SUBMITTED | CANCELLED | User cancels | Cancel request received |
HYDRATING | RUNNING | Hydration complete, session invoked | invoke_agent_runtime returns session ID |
HYDRATING | FAILED | Hydration error | GitHub API failure, memory failure, prompt assembly error |
HYDRATING | CANCELLED | User cancels during hydration | Cancel request received |
RUNNING | FINALIZING | Session ends (response received or session status = terminated) | — |
RUNNING | CANCELLED | User cancels | stop_runtime_session called, then transition |
RUNNING | TIMED_OUT | Max duration exceeded | Wall-clock timer fires (configurable, default 8h matching AgentCore max) |
RUNNING | FAILED | Session crash detected (runtime error, unrecoverable) | Session status indicates failure |
FINALIZING | COMPLETED | Result inference determines success | PR exists or commits on branch |
FINALIZING | FAILED | Result inference determines failure | No commits, no PR, or agent reported error |
FINALIZING | TIMED_OUT | Finalization discovers the session ended due to idle timeout | Session metadata indicates idle timeout termination |
Cancellation behavior by state
Section titled “Cancellation behavior by state”| State when cancel arrives | Action |
|---|---|
SUBMITTED | Transition directly to CANCELLED. No resources to clean up. |
HYDRATING | Abort hydration (best-effort), transition to CANCELLED. Release concurrency counter. |
RUNNING | Call stop_runtime_session to terminate the agent session. Wait for confirmation. Transition to CANCELLED. Release concurrency counter. Partial work (branch, commits) remains on GitHub for the user to inspect or delete. |
FINALIZING | Let finalization complete (it is fast). Mark as CANCELLED only if the cancel was received before the terminal state was written. |
| Terminal states | Reject cancel request (task already done). |
Timeout behavior
Section titled “Timeout behavior”| Timeout type | Value | Source | Effect |
|---|---|---|---|
| Max session duration | 8 hours | AgentCore Runtime hard limit | AgentCore terminates the session. Orchestrator detects session end, transitions to TIMED_OUT. |
| Idle timeout | 15 minutes | AgentCore Runtime inactivity threshold | If the agent is idle for 15 min, AgentCore terminates the session. See Session management section for mitigation. |
| Orchestrator max duration | Configurable (default: 8h) | Orchestrator timer | Orchestrator calls stop_runtime_session if its own timer fires. Safety net if AgentCore’s timeout fails or if the orchestrator wants a shorter limit. |
| Max turns / iterations | Configurable per task (default: 100, range 1–500) | API max_turns field / agent harness | Limits the number of agent loop iterations (tool calls or reasoning turns) per session. Complements time-based limits with a cost-oriented bound. Capping turns prevents runaway sessions that burn tokens without progress. The platform default (100) is applied when no per-task value is specified. Users can override via the API (max_turns field on POST /v1/tasks) or CLI (--max-turns). The value is persisted in the task record, included in the orchestrator payload, and consumed by the agent’s server.py -> ClaudeAgentOptions(max_turns=...). The MAX_TURNS env var on the AgentCore Runtime provides a defense-in-depth fallback. Per-repo overrides via blueprint_config are supported. |
| Max cost budget | Configurable per task ($0.01–$100) | API max_budget_usd field / agent harness | Limits the total cost in USD for a single agent session. When the budget is reached, the agent stops regardless of remaining turns. Users can set via the API (max_budget_usd field on POST /v1/tasks) or CLI (--max-budget). Per-repo defaults can be configured via blueprint_config.max_budget_usd. If neither the task nor the Blueprint specifies a value, no budget limit is applied (turn limit and session timeout still apply). The value is persisted in the task record, resolved via a 2-tier override (task → Blueprint, absent = unlimited), and consumed by the agent’s server.py → ClaudeAgentOptions(max_budget_usd=...). |
| Hydration timeout | Configurable (default: 2 min) | Orchestrator timer | If context hydration takes too long (e.g. GitHub API slow), fail the task. |
Blueprint execution model
Section titled “Blueprint execution model”The default blueprint
Section titled “The default blueprint”The default blueprint is the “deterministic–agentic–deterministic sandwich” described in ARCHITECTURE.md. Every task follows this blueprint unless per-repo customization overrides specific steps.
Step 1: Admission control (deterministic)
Section titled “Step 1: Admission control (deterministic)”See the Admission control section for details. Validates that the task is allowed to run: repo is onboarded, user is within limits, request is not a duplicate. On success, the orchestrator acquires a concurrency slot and transitions the task to HYDRATING.
Step 2: Context hydration (deterministic)
Section titled “Step 2: Context hydration (deterministic)”See the Context hydration section for details. Assembles the agent’s prompt from multiple sources: user message, GitHub issue (title, body, comments), memory (relevant past context), repo configuration (system prompt template, rules), and platform defaults. The output is a fully assembled prompt and system prompt, ready to pass to the compute session.
Step 3: Session start and agent execution (deterministic start + agentic execution)
Section titled “Step 3: Session start and agent execution (deterministic start + agentic execution)”The orchestrator calls invoke_agent_runtime with the assembled payload and receives a session ID. It records the mapping (task ID → session ID) and transitions the task to RUNNING. From this point, the agent runs autonomously inside the MicroVM (see AGENT_HARNESS.md and COMPUTE.md). The orchestrator monitors the session but does not influence the agent’s behavior.
Invocation model. In Iteration 1, invoke_agent_runtime is called synchronously: the call blocks until the agent finishes and returns the response. In the target state, the orchestrator uses AgentCore’s asynchronous invocation model (see Runtime async docs): the agent receives the payload, starts the coding task in a background thread, and returns an acknowledgment immediately. The orchestrator then polls for completion by re-invoking on the same session (sticky routing — see Session management for details). This frees the orchestrator to manage other tasks concurrently and eliminates the need for a blocking call that spans hours.
Step 4: Result inference and finalization (deterministic)
Section titled “Step 4: Result inference and finalization (deterministic)”See the Result inference and finalization section for details. After the session ends, the orchestrator inspects the outcome: checks GitHub for a PR on the agent’s branch, verifies the build, examines the session response for errors. Based on this, it transitions the task to COMPLETED, FAILED, or TIMED_OUT. It then runs cleanup: releases the concurrency counter, emits task events, sends notifications, and persists the final task record.
Step execution contract
Section titled “Step execution contract”Each step in the blueprint is executed as a function with these properties:
- Idempotent. If the orchestrator retries a step (e.g. after a crash or transient failure), the step produces the same result or safely detects that it already ran. For example, context hydration produces the same prompt for the same inputs; session start is idempotent if the session ID is pre-generated and reused on retry.
- Timeout-bounded. Each step has a configurable timeout so a stuck step does not block the pipeline indefinitely.
- Failure-aware. Each step returns a success/failure signal via
StepOutput.status. On explicit failure (status === 'failed'), the orchestrator transitions the task toFAILEDwithout retry. On infrastructure-level failures (Lambda timeout, throttle, transient errors), the framework retries with exponential backoff (default: 2 retries, base 1s, max 10s). See REPO_ONBOARDING.md for the full retry policy. - Least-privilege input. Each step receives a filtered
blueprintConfigcontaining only the fields it needs. Custom Lambda steps receive a sanitized config with credential ARNs stripped. See REPO_ONBOARDING.md for the config filtering policy. - Bounded output.
StepOutput.metadatais limited to 10KB serialized per step.previousStepResultsis pruned to the last 5 steps to keep durable execution checkpoints within the 256KB limit.
Extension points: the 3-layer customization model
Section titled “Extension points: the 3-layer customization model”The orchestrator is a framework that enforces platform invariants and delegates variable work to blueprint-defined step implementations. Per REPO_ONBOARDING.md, blueprints customize execution through three layers:
Layer 1: Parameterized built-in strategies. Select and configure built-in step implementations without writing code. Examples: compute.type: 'agentcore' selects AgentCore Runtime as the compute provider; compute.type: 'ecs' selects ECS Fargate. Each strategy exposes its own configuration surface (e.g. runtime_arn for agentcore, taskDefinitionArn for ECS). The orchestrator resolves the strategy by compute_type key, instantiates it with the provided config, and delegates step execution.
Layer 2: Lambda-backed custom steps. Inject custom logic at specific pipeline phases by providing a Lambda ARN. Each custom step declares a phase (pre-agent or post-agent), a name, an optional timeoutSeconds, and optional config. The orchestrator invokes the Lambda with a StepInput payload and expects a StepOutput response (see REPO_ONBOARDING.md for the contracts). Examples: SAST scan before the agent, custom lint after the agent, notification webhook on finalization.
Layer 3: Custom step sequences. Override the default step order entirely. A step_sequence is an ordered list of StepRef entries, each referencing either a built-in step (by name) or a custom step (by CustomStepConfig.name). The orchestrator iterates the sequence, resolving each reference to a built-in implementation or Lambda invocation. This enables inserting custom steps between built-in steps or reordering the pipeline. If step_sequence is absent, the default sequence applies.
What the framework enforces (regardless of customization):
- State transitions: every step runs within a state machine transition — the task cannot skip states.
- Event emission: step start/end events are emitted automatically.
- Cancellation: the framework checks for cancellation between steps and aborts if a cancel request is pending.
- Concurrency: slot acquisition and release are managed by the framework, not by individual steps.
- Timeouts: each step is bounded by a configurable timeout.
Step resolution
Section titled “Step resolution”When the orchestrator loads a task’s blueprint_config, it resolves the step pipeline:
- Load
RepoConfigfrom theRepoTablebyrepo(PK). Merge with platform defaults (see REPO_ONBOARDING.md for default values and override precedence). - Resolve compute strategy from
compute_type(default:agentcore). The strategy implements theComputeStrategyinterface (see REPO_ONBOARDING.md). - Build step list. If
step_sequenceis provided, use it; otherwise use the default sequence (admission-control→hydrate-context→start-session→await-agent-completion→finalize). For each entry, resolve to a built-in step function or a Lambda invocation wrapper. - Inject custom steps. If
custom_stepsare defined and no explicitstep_sequenceis provided, insert them at their declaredphaseposition (pre-agent steps beforestart-session, post-agent steps afterawait-agent-completion). - Validate. Check that required steps are present and correctly ordered (see step sequence validation). If invalid, fail the task with
INVALID_STEP_SEQUENCE. - Execute. Iterate the resolved list. For each step: check cancellation, filter
blueprintConfigto only the fields that step needs (stripping credential ARNs for custom Lambda steps), execute with retry policy, enforceStepOutput.metadatasize budget (10KB), prunepreviousStepResultsto last 5 steps, emit events. Built-in steps that need durable waits (e.g.await-agent-completion) receive theDurableContextandComputeStrategyso they can callwaitForConditionandcomputeStrategy.pollSession()internally — no name-based special-casing in the framework loop.
Admission control
Section titled “Admission control”Admission control runs immediately after the input gateway dispatches a “create task” message. It is the first step of the blueprint. Its purpose is to reject tasks that should not run, before any compute resources are consumed.
Checks (in order)
Section titled “Checks (in order)”-
Repo onboarding check (Iteration 3+). Is the target repository registered with the platform? If not, reject with an error. In Iteration 1–2, this check is skipped (any repo the credentials can access is allowed). In Iteration 3+, this check is performed at the API handler level (
createTaskCore) rather than in the orchestrator, for faster rejection (no orphanSUBMITTEDtasks). The handler does aGetItemon theRepoTablebyrepo(PK). If not found orstatus !== 'active', the request is rejected with 422REPO_NOT_ONBOARDED. The orchestrator’s admission control step can optionally re-check as defense-in-depth. See REPO_ONBOARDING.md for theRepoConfigschema and blueprint contract. -
User concurrency limit. How many tasks is this user currently running? If the count equals or exceeds the per-user limit (configurable, e.g. 3), the task is rejected. A
UserConcurrencycounter is checked atomically. If below the limit, the counter is incremented and the task proceeds to hydration. If at the limit, the task is rejected with a concurrency limit error. -
System-wide concurrency limit. Is the system at capacity? The total number of
RUNNING+HYDRATINGtasks is compared to the system-wide limit (bounded by AgentCore quotas, e.g. concurrent session limit per account). If at capacity, the task is queued even if the user has room. -
Rate limiting. A per-user rate limit (e.g. 10 tasks per hour) prevents abuse. Implemented as a sliding window counter (e.g. in DynamoDB with TTL). Tasks that exceed the rate are rejected, not queued.
-
Idempotency check. If the task request includes an idempotency key (e.g. client-supplied header), check whether a task with that key already exists. If so, return the existing task ID and status without creating a duplicate. Idempotency keys are stored with a TTL (e.g. 24 hours).
Admission result
Section titled “Admission result”- Accepted. Task transitions to
HYDRATING. Concurrency counter incremented. - Rejected. Task transitions to
FAILEDwith a reason (repo not onboarded, rate limit exceeded, concurrency limit, validation error). No counter change. - Deduplicated. Existing task ID returned. No new task created.
Context hydration
Section titled “Context hydration”Context hydration assembles the agent’s user prompt from multiple sources. It runs as a deterministic step in the orchestrator Lambda after admission control and before session start. The goal is to perform I/O-bound work (GitHub API calls, Secrets Manager lookups) before expensive agent compute is consumed, enabling fast failure when external APIs are unavailable.
Current implementation (Iteration 3a)
Section titled “Current implementation (Iteration 3a)”The orchestrator’s hydrateAndTransition() function calls hydrateContext() (src/handlers/shared/context-hydration.ts) which:
- Resolves the GitHub token from Secrets Manager (if
GITHUB_TOKEN_SECRET_ARNis configured). The token is cached in a module-level variable with a 5-minute TTL for Lambda execution context reuse. - Fetches the GitHub issue (title, body, comments) via the GitHub REST API using native
fetch()— ifissue_numberis present on the task and a token is available. - Enforces a token budget on the combined context. Uses a character-based heuristic (~4 chars per token). Default budget: 100K tokens (configurable via
USER_PROMPT_TOKEN_BUDGETenvironment variable). When the budget is exceeded, oldest comments are removed first. Thetruncatedflag is set in the result. - Assembles the user prompt — a structured markdown document containing Task ID, Repository, GitHub Issue section (title, body, comments), and Task section (user’s task description or default instruction). The format mirrors the Python
assemble_prompt()inagent/entrypoint.pyfor cross-language consistency. - Returns a
HydratedContextobject containingversion,user_prompt,issue,sources,token_estimate, andtruncated.
The hydrated context is passed to the agent as a new hydrated_context field in the invocation payload, alongside the existing legacy fields (repo_url, task_id, branch_name, issue_number, prompt). The agent checks for hydrated_context with version == 1; if present, it uses the pre-assembled user_prompt directly and skips in-container GitHub fetching and prompt assembly. If absent (e.g. during a deployment rollout or when the secret ARN isn’t configured), the agent falls back to its existing behavior.
Graceful degradation: If any step fails (Secrets Manager unavailable, GitHub API error, network timeout), the orchestrator proceeds with whatever context is available. The worst case is a minimal prompt with just the task ID and repository — the agent can still attempt its own GitHub fetch as a fallback via the legacy issue_number field.
Hydration events
Section titled “Hydration events”The orchestrator emits two task events during hydration:
hydration_started— emitted when the task transitions toHYDRATINGhydration_complete— emitted after context assembly, with metadata:sources(array of context sources used, e.g.["issue", "task_description"]),token_estimate(estimated token count of the assembled prompt),truncated(whether the token budget was exceeded)
AgentCore Gateway — evaluated and deferred
Section titled “AgentCore Gateway — evaluated and deferred”We evaluated routing GitHub API calls through AgentCore Gateway (with the GitHub MCP server or GitHub REST API as an OpenAPI target). Conclusion: not needed for this iteration. The core agent operations (git clone, commit, push) are git-protocol operations that cannot go through the MCP server — the agent must keep its direct PAT regardless. The Gateway would only abstract the read-only operations (issue fetching) used in hydration, adding infrastructure complexity for minimal benefit over direct API calls. If AgentCore Gateway is introduced later (e.g. for multi-provider git support or centralized credential management), the hydration code’s fetchGitHubIssue function can be swapped to call the Gateway endpoint without changing the pipeline’s structure.
Sources (in assembly order)
Section titled “Sources (in assembly order)”-
System prompt template. The platform’s default system prompt (see
agent/system_prompt.py). Stays in the agent container because the template has a{setup_notes}placeholder that depends onsetup_repo()running inside the container. In future, this template may be overridden per-repo via onboarding config. -
Repo configuration (Iteration 3+). Per-repo rules, instructions, or context loaded from the onboarding store. This can include static artifacts discovered during onboarding (e.g. content from
.cursor/rules,CLAUDE.md,CONTRIBUTING.md) and dynamic artifacts generated by the onboarding pipeline (e.g. codebase summaries, dependency graphs). See REPO_ONBOARDING.md. -
GitHub issue context. If the task references a GitHub issue: fetch the issue title, body, and comments via the GitHub REST API. Now done in the orchestrator (
fetchGitHubIssueinsrc/handlers/shared/context-hydration.ts), not in the agent container. -
User message. The free-text task description provided by the user (via CLI
--taskflag or equivalent). May supplement or replace the issue context. -
Memory context (Iteration 3+). Query long-term memory (e.g. AgentCore Memory) for relevant past context: insights from previous tasks on this repo, failure summaries, learned patterns. See MEMORY.md for how insights and code attribution feed into hydration. Not yet implemented.
-
Attachments. Images or files provided by the user (multi-modal input). Passed through to the agent prompt as base64 or URLs.
Prompt assembly
Section titled “Prompt assembly”The orchestrator assembles one artifact during hydration:
- User prompt. Assembled by
assembleUserPrompt()in the orchestrator from the issue context and user message. Format:Task ID: {id}\nRepository: {repo}\n\n## GitHub Issue #{n}: {title}\n...\n\n## Task\n\n{description}. This mirrors the Pythonassemble_prompt()function.
The system prompt is not assembled in the orchestrator — it remains in the agent container because it depends on setup_repo() output ({setup_notes} placeholder). In the target state, additional sections may be injected: repo-specific rules, memory-derived insights.
Payload contract
Section titled “Payload contract”Legacy: { repo_url, task_id, branch_name, issue_number?, prompt? }Current: { repo_url, task_id, branch_name, issue_number?, prompt?, hydrated_context }Where hydrated_context:
{ "version": 1, "user_prompt": "Task ID: ...\nRepository: ...\n\n## GitHub Issue #42: ...", "issue": { "number": 42, "title": "...", "body": "...", "comments": [...] }, "sources": ["issue", "task_description"], "token_estimate": 1250, "truncated": false}Token budget
Section titled “Token budget”The orchestrator enforces a token budget on the user prompt before assembly:
- Estimation heuristic:
Math.ceil(text.length / 4)(~4 characters per token). - Default budget: 100,000 tokens (configurable via
USER_PROMPT_TOKEN_BUDGETCDK prop / environment variable). - Truncation strategy: When the combined estimated token count (issue body + comments + task description) exceeds the budget, oldest comments are removed first. If still over budget after removing all comments, the issue body and task description are kept as-is (they are assumed to be essential). The
truncatedflag is set in the hydrated context metadata. - The agent harness handles its own context compaction during the run for multi-turn conversations.
Session management
Section titled “Session management”Starting a session
Section titled “Starting a session”The orchestrator invokes invoke_agent_runtime (AgentCore API) with:
agentRuntimeArn— the ARN of the deployed runtime (from CDK stack output).runtimeSessionId— a pre-generated UUID tied to the task. Pre-generating the session ID is important for idempotency: if the orchestrator retries after a crash, it reuses the same session ID. If the session was already started, AgentCore either returns the existing session or rejects the duplicate.payload— the hydrated prompt and configuration (repo, max turns, model).
The orchestrator records the (task_id, session_id) mapping in the task record immediately before the invocation call. This ensures that even if the orchestrator crashes after the call succeeds, the session ID is recoverable.
Invocation model: synchronous vs. asynchronous
Section titled “Invocation model: synchronous vs. asynchronous”Iteration 1 (current). invoke_agent_runtime is called synchronously with a long read timeout. The call blocks until the agent finishes. This is simple but limits concurrency: one orchestrator process per task.
Target state. The orchestrator uses AgentCore’s asynchronous processing model (Runtime async docs). The key capabilities:
-
Non-blocking invocation. The agent’s
@app.entrypointhandler receives the payload and starts the coding task in a background thread (using the SDK’sadd_async_task/complete_async_taskAPI for task tracking). It returns an acknowledgment immediately. Theinvoke_agent_runtimecall completes in seconds, not hours. -
Sticky routing on session. Subsequent calls to
invoke_agent_runtimewith the sameruntimeSessionIdare routed to the same instance. This enables a poll pattern: the orchestrator re-invokes on the same session to ask for status, and the agent responds with its current state (running, completed, failed) and, on completion, the result payload (PR URL, cost, error, etc.). -
Health status via
/ping. The agent’s/pingendpoint reports processing status:{"status": "HealthyBusy"}while the background task is running,{"status": "Healthy"}when idle. AgentCore polls/pingautomatically; the 15-minute idle timeout starts only when the status isHealthy(idle). As long as the agent reportsHealthyBusy, the session stays alive.
Agent-side contract. The agent entrypoint must:
- Start the coding task in a separate thread (so
/pingremains responsive). - Call
app.add_async_task(...)when work begins andapp.complete_async_task(...)when work ends. - On subsequent invocations (poll requests), return the current status and, if complete, the result.
This model eliminates the need for a wrapper Lambda or Fargate task to hold a blocking call. The orchestrator’s poll is a lightweight, fast invoke_agent_runtime call that returns immediately.
Liveness monitoring
Section titled “Liveness monitoring”The orchestrator needs to know whether the session is still running. Two complementary mechanisms:
-
/pinghealth status. AgentCore automatically polls the agent’s/pingendpoint. The agent reportsHealthyBusywhile the coding task is active andHealthywhen idle. The orchestrator does not call/pingdirectly — AgentCore does. However, the/pingstatus drives the session lifecycle: a session inHealthy(idle) state for 15 minutes is automatically terminated. As long as the agent reportsHealthyBusy, the session stays alive indefinitely (up to the 8-hour hard cap). -
Re-invocation on the same session (target state). The orchestrator calls
invoke_agent_runtimewith the sameruntimeSessionId. Sticky routing ensures the request reaches the same instance. The agent’s entrypoint can detect this is a poll (e.g., via apoll: truefield in the payload or by tracking the initial task) and return the current status without starting a new task. This is a fast, lightweight call that returns immediately.
Iteration 1. The invoke_agent_runtime call blocks; when it returns, the session is over. No explicit liveness check needed.
Fallback: DynamoDB heartbeat (optional enhancement). As defense in depth, the agent can write a heartbeat timestamp to DynamoDB every N minutes. The orchestrator reads it during its poll cycle. A missing heartbeat (e.g. none in the last 10 minutes while /ping reports HealthyBusy) could indicate the agent is stuck but not idle — triggering investigation or forced termination.
The 15-minute idle timeout problem
Section titled “The 15-minute idle timeout problem”AgentCore Runtime terminates sessions after 15 minutes of inactivity (no /ping response or no invocations). This is a critical constraint for coding tasks: the agent may take several minutes between tool calls (e.g. during a long build or a complex reasoning step).
Mitigation (async model). In the target state, the agent uses the AgentCore SDK’s async task management: add_async_task registers a background task, and the SDK automatically reports HealthyBusy via /ping while any async task is active. AgentCore polls /ping and sees the agent is busy, preventing idle termination. When the agent calls complete_async_task, the status reverts to Healthy. The /ping endpoint runs on the main thread (or async event loop) while the coding task runs in a separate thread, so /ping remains responsive.
Mitigation (Iteration 1 / current). The agent container’s FastAPI server defines /ping as a separate async endpoint. Because the agent task runs in a threadpool worker (not in the asyncio event loop), the /ping endpoint remains responsive while the agent works. AgentCore calls /ping periodically and the server responds, preventing idle timeout.
Risk. If the agent’s computation blocks the entire process (not just a thread) — e.g. due to a subprocess that consumes all resources, or the server becomes unresponsive — the /ping response may be delayed, triggering idle termination. This risk applies to both models. The defense is to ensure the coding task runs in a separate thread or process and does not starve the main thread.
Session completion detection
Section titled “Session completion detection”When the session ends (agent finishes, crashes, or is terminated), the orchestrator detects this:
- Iteration 1: The
invoke_agent_runtimecall returns (it blocks). The response body contains the agent’s output (status, PR URL, cost, etc.). - Target state: The orchestrator polls the agent via re-invocation on the same session (see Invocation model above). Completion is detected when: (a) the agent responds with a “completed” or “failed” status in the poll response, or (b) the re-invocation fails because the session was terminated (idle timeout, crash, or 8-hour limit reached). In the durable orchestrator, a
waitForConditionevaluates the poll result at each interval and resumes the pipeline when the condition is met. See the session monitoring pattern in the Implementation options section.
External termination (cancellation)
Section titled “External termination (cancellation)”When the user cancels a task in RUNNING state, the orchestrator calls stop_runtime_session. The orchestrator must:
- Call
stop_runtime_session. - Wait for confirmation (the call succeeds or the session is already terminated).
- Transition the task to
CANCELLED. - Run partial finalization: release concurrency counter, emit events, persist state. Do not attempt result inference (the session was intentionally killed).
Result inference and finalization
Section titled “Result inference and finalization”How the orchestrator determines success or failure
Section titled “How the orchestrator determines success or failure”After the session ends, the orchestrator examines multiple signals:
-
Session response. If the
invoke_agent_runtimecall returns a response body (as in Iteration 1), parse it for the agent’s self-reported status (success,error,end_turn), PR URL, cost, and error message. -
GitHub state inspection. Regardless of the agent’s self-report, verify against GitHub:
- Branch exists? Check if the agent’s branch (
bgagent/{task_id}/{slug}) was pushed to the remote. - PR exists? Query the GitHub API for a PR from the agent’s branch.
- Commit count. How many commits are on the branch beyond
main? Zero commits with no PR likely means the agent did nothing useful.
- Branch exists? Check if the agent’s branch (
-
Decision matrix.
Agent self-report PR exists Commits on branch Outcome success / end_turn Yes > 0 COMPLETEDsuccess / end_turn Yes > 0 (build failed) COMPLETED(with warning: build failed post-agent)success / end_turn No > 0 COMPLETED(partial: work committed but no PR; orchestrator may attempt PR creation as a post-hook)success / end_turn No 0 FAILED(agent reported success but did nothing)error Yes > 0 COMPLETED(with warning: agent reported error but PR exists)error No > 0 FAILED(partial work on branch, no PR)error No 0 FAILEDunknown / no response — — FAILED(session ended unexpectedly)
Fragility of GitHub-based inference and proposed improvements
Section titled “Fragility of GitHub-based inference and proposed improvements”Relying solely on GitHub state to determine task outcome is fragile:
- Race condition. The agent may have pushed commits but not yet created the PR when the session was terminated (timeout or crash). The orchestrator sees commits but no PR.
- GitHub API availability. If the GitHub API is down when finalization runs, the orchestrator cannot determine the outcome. It must retry or mark as
FAILEDwith an infrastructure-error reason. - Ambiguity. Commits exist but no PR — is this a failure or partial success?
Proposed improvement: explicit completion signal. In the target state, the agent should write a completion record to an external store (e.g. DynamoDB or AgentCore Memory) before the session ends. This record would contain: task_id, status (success/failure), pr_url (if any), error_message (if any), branch_name, commit_count. The orchestrator reads this record during finalization. GitHub inspection becomes a fallback, not the primary signal.
This is more reliable because the agent writes the record as the last step before exiting (deterministic, under its control), and the orchestrator reads it from DynamoDB (fast, highly available, independent of GitHub). If the record is missing (crash before write), the orchestrator falls back to GitHub inspection.
Cleanup
Section titled “Cleanup”After determining the outcome, the orchestrator:
- Updates task status in the Tasks table (terminal state + metadata: PR URL, error, duration, cost).
- Stamps TTL for data retention. When the task reaches a terminal state, a
ttlattribute is set on the task record (current time +taskRetentionDays, default 90 days). DynamoDB automatically deletes the record after the TTL expires. If the agent wrote the terminal status directly (e.g. COMPLETED), the orchestrator retroactively stamps the TTL during finalization. All task events also carry a TTL set at creation time. - Emits task events to the TaskEvents audit log (e.g.
task_completed,task_failed). - Releases concurrency counter. Decrements the user’s
UserConcurrencycounter. If this fails (e.g. DynamoDB error), the counter drifts; a reconciliation job detects and corrects drift (see OBSERVABILITY.md). - Emits notifications. Sends an internal notification (per INPUT_GATEWAY.md outbound schema) so channel adapters can inform the user.
- Future: queue processing. Reserved for future implementation of task queuing when capacity is at limit.
- Persists code attribution data (Iteration 3+). Writes task metadata (task_id, repo, branch, commits, PR URL, outcome) to memory for future retrieval. See MEMORY.md and OBSERVABILITY.md.
Failure modes and recovery
Section titled “Failure modes and recovery”This section uses an FMEA (Failure Mode and Effects Analysis) approach: for each component and step, what can go wrong, what is the impact, and what the orchestrator does.
Admission control failures
Section titled “Admission control failures”| Failure mode | Impact | Recovery |
|---|---|---|
| DynamoDB unavailable (cannot read repo config or concurrency counters) | Task cannot be admitted | Retry with backoff (up to 3 attempts). If still failing, reject the task with a transient error. |
| Concurrency counter is drifted (shows higher than actual) | Legitimate tasks get queued unnecessarily | Reconciliation job runs periodically (e.g. every 5 min) and corrects counter based on actual RUNNING task count. |
Context hydration failures
Section titled “Context hydration failures”| Failure mode | Impact | Recovery |
|---|---|---|
| GitHub API unavailable or rate limited | Cannot fetch issue context | Retry with backoff. If the issue is essential (issue-based task), fail the task. If the user also provided a task description, proceed with degraded context (no issue body). |
| Memory service unavailable | Cannot retrieve past insights | Proceed without memory context (memory is an enrichment, not required for MVP). Log warning. |
| Prompt exceeds token budget | Agent may lose coherence or fail to start | Truncate lower-priority sources (old comments, memory) to fit budget. |
Session start failures
Section titled “Session start failures”| Failure mode | Impact | Recovery |
|---|---|---|
invoke_agent_runtime returns error (e.g. throttled — 25 TPS limit) | Session not started | Retry with exponential backoff. If repeatedly failing, transition task to FAILED with reason “session start failed”. |
invoke_agent_runtime returns but session crashes immediately | Session starts then dies | Orchestrator detects session end (from the blocking call returning or from polling). Result inference finds no commits, no PR. Task transitions to FAILED. |
| AgentCore Runtime is unavailable (service outage) | No sessions can start | All tasks in HYDRATING that attempt session start will fail. Queue new tasks. Alert operators (see OBSERVABILITY.md). |
Agent execution failures (during RUNNING)
Section titled “Agent execution failures (during RUNNING)”| Failure mode | Impact | Recovery |
|---|---|---|
| Agent crashes mid-task (unhandled exception) | Partial branch may exist on GitHub | Orchestrator detects session end. Finalization inspects GitHub state. If commits exist, may mark as partial completion. Task transitions to FAILED or COMPLETED with partial flag. |
| Agent runs out of turns (max_turns limit) | Agent stopped by SDK, not by crash | Session ends normally with status end_turn. Orchestrator finalizes; if PR exists, task is COMPLETED. |
| Agent exceeds cost budget (max_budget_usd limit) | Agent stopped by SDK when budget is reached | Session ends normally. Orchestrator finalizes; if PR exists, task is COMPLETED. |
| Agent is idle for 15 min (AgentCore kills session) | Work in progress may be lost if not committed | Task transitions to TIMED_OUT. Partial work may be on the branch if the agent committed before going idle. |
| Agent exceeds 8-hour max session duration | AgentCore terminates session | Task transitions to TIMED_OUT. Partial work may be on the branch. |
Result inference failures
Section titled “Result inference failures”| Failure mode | Impact | Recovery |
|---|---|---|
| GitHub API unavailable during finalization | Cannot determine outcome | Retry finalization after a delay (e.g. 1 min, up to 3 retries). If still failing, mark task as FAILED with reason “finalization failed — could not verify GitHub state”. |
| Explicit completion signal missing and GitHub shows ambiguous state | Outcome uncertain | Apply decision matrix. When truly ambiguous, mark as FAILED with the ambiguity reason and let the user inspect the branch. |
Orchestrator failures
Section titled “Orchestrator failures”| Failure mode | Impact | Recovery |
|---|---|---|
Orchestrator crashes during HYDRATING | Task stuck in HYDRATING | Durable execution (Lambda Durable Functions) automatically replays from the last checkpoint, skipping completed steps. Without durable orchestration, a recovery process detects stuck tasks (in HYDRATING for > N minutes) and restarts them. |
Orchestrator crashes during RUNNING | Task stuck in RUNNING, session may still be alive | Recovery process detects task is in RUNNING but orchestrator is not managing it. It resumes monitoring the session (using the stored session ID). When the session ends, it runs finalization. |
Orchestrator crashes during FINALIZING | Task stuck in FINALIZING | Recovery process detects and restarts finalization. Finalization steps must be idempotent (decrementing a counter twice should be detected and handled). |
| DynamoDB unavailable during state transition | State not persisted | Retry with backoff. If the state transition cannot be persisted, the orchestrator must not proceed (risk of inconsistency). After retries are exhausted, alert operators. |
Recovery mechanisms summary
Section titled “Recovery mechanisms summary”- Durable execution. The orchestrator uses a durable execution model (see Implementation options) that survives process crashes. State is checkpointed at each transition.
- Idempotent operations. All steps and transitions are designed to be safely retried.
- Stuck-task detection. A periodic process (e.g. CloudWatch Events + Lambda) scans for tasks stuck in non-terminal states beyond expected durations and either resumes or fails them.
- Counter reconciliation. A periodic process compares concurrency counters to actual running task counts and corrects drift.
- Dead-letter queue. Tasks that fail all retries are sent to a DLQ for manual investigation.
Concurrency and scaling
Section titled “Concurrency and scaling”How multiple tasks run in parallel
Section titled “How multiple tasks run in parallel”Each task runs in its own isolated AgentCore Runtime session. The orchestrator manages multiple tasks concurrently. There is no shared mutable state between tasks at the compute layer; the orchestrator’s concurrency management is purely at the coordination layer (counters, state transitions, queue processing).
Capacity limits
Section titled “Capacity limits”| Limit | Value | Source |
|---|---|---|
invoke_agent_runtime TPS | 25 per agent, per account | AgentCore quota (adjustable) |
| Concurrent sessions | Account-level limit (check AgentCore quotas) | AgentCore quota |
| Per-user concurrency | Configurable (recommended default: 3–5) | Platform config |
| System-wide max concurrent tasks | Configurable (bounded by AgentCore session limit) | Platform config |
Queue design
Section titled “Queue design”When tasks cannot start immediately (user or system at capacity), they are placed in a queue.
Note: Task queuing (QUEUED state) was removed from the implementation in Iteration 3bis. Tasks that exceed the concurrency limit are rejected immediately rather than queued. If queuing is needed in the future, a DynamoDB-based queue design can be added back.
The queue processor is triggered by:
- Task finalization (when a slot opens) via EventBridge or DynamoDB Streams
- A periodic sweep (e.g. every 30 seconds via CloudWatch Events) to catch missed triggers
Counter management
Section titled “Counter management”Concurrency is tracked using atomic counters:
- UserConcurrency. A DynamoDB item per user:
{ user_id, active_count }. Incremented atomically (conditional update:active_count < max) during admission. Decremented during finalization. - SystemConcurrency. A single DynamoDB item:
{ pk: "SYSTEM", active_count }. Same pattern.
Counter drift. If the orchestrator crashes after starting a session but before persisting the session-to-task mapping, or after a session ends but before decrementing the counter, the counter drifts. Mitigation:
- Always persist the task state transition before taking the action (write-ahead pattern). For example, persist the task as
RUNNINGand record the session ID before callinginvoke_agent_runtime. - Run a reconciliation Lambda every 5 minutes (EventBridge schedule): query the Tasks table for tasks in
RUNNING+HYDRATINGstate per user (GSI onuser_id+status), compare the count toUserConcurrency.active_count, and correct viaUpdateItemif different. The Lambda emits acounter_drift_correctedCloudWatch metric (dimensions:user_id,drift_amount) when it corrects a value, and acounter_reconciliation_runmetric on every execution for health monitoring. - Emit a CloudWatch alarm when drift is detected (see OBSERVABILITY.md). If automated reconciliation fails (e.g. Lambda error), escalate to operator via SNS notification.
Implementation options
Section titled “Implementation options”Option A: Lambda Durable Functions
Section titled “Option A: Lambda Durable Functions”How it works. The orchestrator is a single Lambda function using the Lambda Durable Execution SDK (available for TypeScript and Python). The blueprint is written as sequential code with durable operations (step, wait, waitForCallback, waitForCondition). Each operation creates a checkpoint; if the function is interrupted or needs to wait, it suspends without compute charges. On resumption, the SDK replays from the beginning, skipping completed checkpoints using stored results.
Conceptual orchestrator code (TypeScript):
export const handler = withDurableExecution( async (event: TaskEvent, context: DurableContext) => {
// --- Framework: load blueprint, validate, and resolve step pipeline --- const blueprint = await context.step('load-blueprint', async () => { const repoConfig = await loadRepoConfig(event.repo); const merged = mergeWithDefaults(repoConfig); const pipeline = resolveStepPipeline(merged); validateStepSequence(pipeline.steps); // Throws INVALID_STEP_SEQUENCE if invalid return pipeline; // Returns: { steps: ResolvedStep[], computeStrategy, config } });
// --- Framework: iterate steps with invariant enforcement --- let pipelineState: PipelineState = { event };
for (const step of blueprint.steps) { // Framework: check for cancellation between steps await context.step(`cancel-check-${step.name}`, async () => { const task = await getTask(event.taskId); if (task.cancelRequested) throw new CancellationError(); });
// Framework: filter config per step (least-privilege) const filteredConfig = filterConfigForStep(step, blueprint.config);
// Framework: build step input with pruned previous results const input: StepInput = { taskId: event.taskId, repo: event.repo, blueprintConfig: filteredConfig, previousStepResults: pruneResults(pipelineState, /* keepLast */ 5), };
// Framework: emit step-start event, execute step, emit step-end event const stepResult = await context.step(step.name, async () => { await emitEvent(event.taskId, `${step.name}_started`); try { let result: StepOutput; if (step.type === 'builtin') { // Built-in step: call the registered step function. // Built-in steps that need durable waits (e.g. await-agent-completion) // receive the DurableContext and ComputeStrategy so they can call // waitForCondition + computeStrategy.pollSession() internally. result = await step.execute(input, { durableContext: context, computeStrategy: blueprint.computeStrategy, }); } else { // Custom Lambda step: invoke with retry policy result = await invokeCustomStepWithRetry( step.functionArn, input, step.timeoutSeconds, step.maxRetries ?? 2, // default: 2 retries ); }
enforceMetadataSize(result, /* maxBytes */ 10_240); await emitEvent(event.taskId, `${step.name}_completed`, result.metadata); return result; } catch (err) { await emitEvent(event.taskId, `${step.name}_failed`, { error: String(err) }); throw err; } });
pipelineState[step.name] = stepResult; }
return pipelineState['finalize']; });
// --- Built-in step: await-agent-completion ---// Polling is delegated to the ComputeStrategy, not hardcoded by step name.async function awaitAgentCompletion( input: StepInput, opts: { durableContext: DurableContext; computeStrategy: ComputeStrategy },): Promise<StepOutput> { const sessionHandle = input.previousStepResults['start-session']?.metadata?.sessionHandle; const pollIntervalMs = input.blueprintConfig.poll_interval_ms ?? 30_000;
const sessionResult = await opts.durableContext.waitForCondition( 'agent-completion-poll', async () => { const status = await opts.computeStrategy.pollSession(sessionHandle); return status.status !== 'running' ? status : undefined; }, { interval: { seconds: pollIntervalMs / 1000 }, timeout: { hours: 8, minutes: 30 }, }, );
return { status: sessionResult.status === 'completed' ? 'success' : 'failed', metadata: { sessionResult }, error: sessionResult.status === 'failed' ? sessionResult.error : undefined, };}Pros:
- Durable execution natively in Lambda. Checkpoint/replay mechanism survives interruptions. State is automatically persisted at each durable operation. No separate orchestration service needed.
- Sequential code, not a DSL. The blueprint is standard TypeScript/Python — no Amazon States Language, no JSON state machine definitions. Easier to read, test, debug, and refactor. The orchestrator logic lives in the same codebase and language as the CDK infrastructure.
- No compute charges during waits. When the orchestrator waits for the agent session to finish (hours), it suspends between poll intervals via
waitForCondition. No Lambda compute is billed during suspension. Charges apply only to actual processing (admission, hydration, poll calls, finalization). - Execution duration up to 1 year. Far exceeds the 8-hour agent session limit. No risk of the orchestrator timing out before the agent finishes.
- Condition-based polling for session completion. The
waitForConditionprimitive evaluates a condition function at configurable intervals (e.g. every 30 seconds). Combined with AgentCore’s async invocation model and sticky routing, the orchestrator re-invokes the same session to check status — a fast, lightweight call. This cleanly solves the “how does the orchestrator know the session is done” problem without a blocking wrapper, Fargate sidecar, or external callback infrastructure. - Built-in retry with checkpointing. Steps support configurable retry strategies and
at-most-once/at-least-onceexecution semantics. Failed steps can retry without re-executing already-completed work. - Parallel execution.
context.parallel()andcontext.map()enable concurrent operations (e.g. parallel hydration sources, parallel post-agent checks). - Operational simplicity. Serverless, auto-scaling, scale-to-zero. No Step Functions state machines to deploy and manage separately.
- Same development toolchain. Standard Lambda development: CDK, SAM, IDE, unit tests, LLM agents for code generation. No separate visual designer or DSL required.
Cons:
- New service (launched 2025). Lambda Durable Functions is relatively new. Less battle-tested than Step Functions. Documentation and community examples are still growing.
- Determinism requirement. Code outside durable operations must be deterministic (same result on replay). Non-deterministic operations (UUID generation, timestamps, API calls) must be wrapped in
step. This is a programming discipline requirement that developers must understand. - Checkpoint size limit. 256 KB per checkpoint. Step results larger than this require child contexts and re-execution during replay. For this orchestrator, step results (task metadata, hydrated prompt references) are small — not expected to be an issue.
- No visual workflow editor. Unlike Step Functions, there is no drag-and-drop visual designer or built-in execution graph view. Debugging relies on CloudWatch logs, execution history API, and code-level tracing.
- Less mature cross-service integration. Step Functions has 220+ native service integrations. Durable Functions operates within Lambda — external service calls go through the SDK in steps. For this orchestrator (which calls DynamoDB, AgentCore, GitHub), this is not a limitation since all calls are made via SDKs anyway.
Option B: AWS Step Functions (Standard Workflows)
Section titled “Option B: AWS Step Functions (Standard Workflows)”How it works. Each task triggers a Step Functions state machine execution. The state machine defines the blueprint steps as states: admission control (Lambda), hydration (Lambda), session start (Lambda + wait), session monitor (Lambda + wait loop), finalization (Lambda). State is automatically persisted at each transition.
Pros:
- Mature, battle-tested service with extensive documentation.
- Visual workflow in the AWS console for debugging.
- Native support for wait states (up to 1 year), retries with backoff, parallel branches.
- 220+ native AWS service integrations.
- Pay per state transition, not per second of wait time.
Cons:
- Workflow defined in ASL/DSL, not code. The blueprint must be translated to Amazon States Language or CDK Step Functions constructs. This is a separate abstraction from the application code, harder to test as a unit, and requires context-switching between code and state machine definitions.
- Session monitoring requires a Wait+Poll state machine loop. With the async invocation model,
invoke_agent_runtimereturns immediately, so the 15-minute Lambda limit is no longer a problem. However, the poll loop must be modeled as a Wait state + Lambda task + Choice state cycle in the state machine definition (ASL), which is verbose compared to a singlewaitForConditioncall in code. - Separate infrastructure to manage. The state machine is a separate deployed resource. Changes to the orchestration logic require redeploying the state machine, not just a Lambda function.
- Cost per state transition. $0.025 per 1,000 transitions. For ~50 transitions per task, ~$0.00125 per task — negligible but non-zero.
Option C: Lambda + DynamoDB (manual orchestration)
Section titled “Option C: Lambda + DynamoDB (manual orchestration)”How it works. A coordinator Lambda is triggered by task creation. It reads the task record, runs admission control, performs hydration, starts the session, and writes state back to DynamoDB. A separate Lambda on a schedule checks for tasks in RUNNING state. Finalization is triggered when session completion is detected.
Pros:
- Full control over the implementation.
- No dependency on durable execution framework.
Cons:
- Must implement state persistence, retry logic, error handling, timeout management, and crash recovery manually. This is error-prone and the core value proposition of durable execution frameworks.
- Lambda 15-minute max execution time means the monitoring loop must be periodic invocations.
- No built-in checkpoint/replay — all durability is hand-rolled.
- Idempotency and exactly-once semantics are the developer’s responsibility.
Option D: EventBridge + Lambda (event-driven)
Section titled “Option D: EventBridge + Lambda (event-driven)”How it works. Each state transition emits an EventBridge event. Lambda functions trigger on events and perform the next step.
Pros:
- Loosely coupled; easy to add new steps or side-effects.
- EventBridge provides retry, DLQ, and filtering.
Cons:
- No centralized view of workflow state.
- Debugging distributed event chains is harder.
- Session monitoring does not fit naturally into an event-driven model.
- All durability is the developer’s responsibility.
Recommendation: Lambda Durable Functions
Section titled “Recommendation: Lambda Durable Functions”Lambda Durable Functions is the recommended implementation. Rationale:
- Durable execution is the core requirement. Tasks run for hours. The orchestrator must survive crashes, resume from checkpoints, and handle retries. Durable Functions provides this natively with checkpoint/replay.
- The blueprint maps to sequential code. The blueprint’s step sequence (admission → hydration → session start → wait for completion → finalize) is naturally expressed as sequential code with durable operations. No DSL translation, no state machine abstraction — the code is the orchestrator.
- Condition-based polling solves the session-monitoring problem cleanly. The
waitForConditionprimitive suspends the orchestrator between poll intervals (no compute charges). Combined with AgentCore’s async invocation model (non-blocking start, sticky routing for status polls), the orchestrator detects session completion without a blocking wrapper Lambda, Fargate sidecar, or external callback infrastructure — the key technical challenge that makes Step Functions awkward for this use case. - Cost-efficient for long-running waits. The orchestrator pays nothing during the hours the agent runs. Charges apply only to the seconds of actual processing (admission, hydration, finalization).
- Same language, same codebase. The orchestrator is TypeScript (or Python), co-located with the CDK infrastructure code and the agent code. Standard development toolchain: IDE, unit tests, code review, CDK deploy.
- Simpler operational model. One Lambda function, not a Lambda + Step Functions state machine + optional Fargate task. Fewer moving parts to deploy, monitor, and debug.
Trade-off acknowledged: Lambda Durable Functions is newer than Step Functions. If the team encounters maturity issues (bugs, missing features, insufficient documentation), Step Functions (Option B) is the fallback. The blueprint step contract (idempotent, timeout-bounded, failure-aware) is implementation-agnostic — switching from Durable Functions to Step Functions requires re-wiring the orchestrator, not redesigning the blueprint.
Session monitoring pattern (async invocation + poll)
Section titled “Session monitoring pattern (async invocation + poll)”The key architectural pattern that makes Lambda Durable Functions work for this use case leverages AgentCore’s asynchronous processing model and sticky session routing:
-
Orchestrator starts the session via
context.step('start-session', ...). Theinvoke_agent_runtimecall sends the hydrated payload. The agent receives it, starts the coding task in a background thread (registering viaadd_async_task), and returns an acknowledgment immediately. The step completes in seconds. -
Orchestrator polls for completion via
context.waitForCondition(...). At configurable intervals (e.g. every 30 seconds), the condition function re-invokesinvoke_agent_runtimeon the sameruntimeSessionId. Sticky routing ensures the request reaches the same instance. The agent’s entrypoint detects this is a status poll and returns the current state:{ status: "running" }— task still in progress. The condition returnsundefined, and the orchestrator suspends until the next interval (no compute charges during the wait).{ status: "completed", pr_url: "...", cost_usd: ... }— task finished. The condition returns the result, and the orchestrator resumes to finalization.{ status: "failed", error: "..." }— task failed. Same as above, with an error payload.
-
Session termination detection. If the session is terminated externally (idle timeout, 8-hour limit, crash, or user cancellation), the re-invocation call either fails (session not found) or AgentCore starts a new session for that ID. The orchestrator detects this (e.g. by checking if the response is an unexpected acknowledgment rather than a status) and proceeds to finalization using GitHub-based result inference as a fallback.
-
Timeout safety net. The
waitForConditionhas a timeout (e.g. 8.5 hours — slightly beyond the AgentCore 8-hour max). If no completion is detected within this window, the orchestrator resumes with a timeout error and runs finalization.
Why this pattern works:
- No blocking call. Each
invoke_agent_runtimecall (initial and polls) completes in seconds. No Lambda, Fargate task, or wrapper needs to hold a connection for 8 hours. - No external callback infrastructure. The orchestrator detects completion by polling — no need for the agent to call
SendDurableExecutionCallbackSuccess, no EventBridge subscription, no sidecar. - No compute charges during waits. The durable execution suspends between poll intervals. At 30-second intervals over an 8-hour session, the orchestrator performs ~960 lightweight polls. Each poll is a fast Lambda invocation (sub-second). Total orchestrator compute is minutes, not hours.
- Resilient to agent crashes. If the agent crashes, the next poll detects the session is gone. The orchestrator does not hang waiting for a callback that will never arrive.
Poll interval cost analysis at scale:
| Concurrent tasks | Polls/day (30s interval, 8hr avg) | Lambda invocations/day | invoke_agent_runtime TPS (peak) | Lambda cost/month |
|---|---|---|---|---|
| 10 | ~9,600 | ~9,600 | ~0.3 | ~$0.002 |
| 50 | ~48,000 | ~48,000 | ~1.7 | ~$0.01 |
| 200 | ~192,000 | ~192,000 | ~6.7 | ~$0.04 |
| 500 | ~480,000 | ~480,000 | ~16.7 | ~$0.10 |
The invoke_agent_runtime quota is 25 TPS per agent per account (adjustable). At 500 concurrent tasks with 30-second polls, peak TPS is ~16.7 — within quota. Lambda cost is negligible at all projected scales. The first bottleneck is the AgentCore concurrent session quota, not the poll mechanism.
Tuning: The 30-second interval is suitable for typical tasks (1–2 hours). For longer tasks (4+ hours), a 60-second or adaptive interval halves poll invocations with minimal impact on status update latency. The poll interval should be configurable per blueprint (via blueprint_config.poll_interval_ms).
Agent-side contract for the poll pattern:
The agent’s entrypoint must distinguish between an initial task invocation and a status poll. Recommended approach:
- The initial invocation payload contains the full task context (prompt, repo, etc.) and a
type: "task"field. - Poll invocations contain
type: "poll"(or simply an empty/minimal payload that the agent interprets as a status check). - The agent maintains task state in memory (or a local store) and responds to polls with the current status.
- On completion, the agent writes a completion record to an external store (e.g. DynamoDB) as a durable backup — so even if the next poll fails, the orchestrator can query DynamoDB during finalization.
Data model (conceptual)
Section titled “Data model (conceptual)”Tasks table
Section titled “Tasks table”The primary table for task state. DynamoDB.
| Field | Type | Description |
|---|---|---|
task_id (PK) | String (ULID) | Unique task identifier. ULID provides sortable, unique IDs. |
user_id | String | Cognito sub or mapped platform user ID. |
status | String | Current state (see state machine). |
repo | String | GitHub owner/repo (e.g. org/myapp). |
issue_number | Number (optional) | GitHub issue number, if task is issue-based. |
task_description | String (optional) | Free-text task description. |
branch_name | String | Agent branch: bgagent/{task_id}/{slug}. |
session_id | String (optional) | AgentCore runtime session ID, set when session is started. |
execution_id | String (optional) | Lambda durable execution ID, set when the orchestrator starts. |
pr_url | String (optional) | Pull request URL, set during finalization. |
error_message | String (optional) | Error reason if FAILED. |
error_code | String (optional) | Machine-readable error code if FAILED (e.g. INVALID_STEP_SEQUENCE, SESSION_START_FAILED, TIMEOUT). Used for failure categorization in the evaluation pipeline and surfaced via GET /v1/tasks/{id}. |
idempotency_key | String (optional) | Client-supplied idempotency key. |
channel_source | String | Originating channel (cli, slack, web, etc.). |
channel_metadata | Map (optional) | Channel-specific routing data (Slack channel+thread, CLI request ID). |
created_at | String (ISO 8601) | Task creation timestamp. |
updated_at | String (ISO 8601) | Last state transition timestamp. |
started_at | String (optional) | When the session was started (entered RUNNING). |
completed_at | String (optional) | When the task reached a terminal state. |
cost_usd | Number (optional) | Agent cost from the SDK result. |
duration_s | Number (optional) | Total task duration in seconds. |
build_passed | Boolean (optional) | Post-agent build verification result. |
max_turns | Number (optional) | Maximum agent turns for this task. Set during task creation — either the user-specified value (1–500) or the platform default (100). Included in the orchestrator payload and consumed by the agent SDK’s ClaudeAgentOptions(max_turns=...). |
max_budget_usd | Number (optional) | Maximum cost budget in USD for this task. Set during task creation — either the user-specified value ($0.01–$100) or the per-repo Blueprint default. When reached, the agent stops regardless of remaining turns. If neither the task nor the Blueprint specifies a value, no budget limit is applied (turn limit and session timeout still apply). Included in the orchestrator payload and consumed by the agent SDK’s ClaudeAgentOptions(max_budget_usd=...). |
blueprint_config | Map (optional) | Snapshot of the RepoConfig record at task creation time (or a reference to it). This ensures tasks are not affected by mid-flight config changes. The schema follows the RepoConfig interface defined in REPO_ONBOARDING.md. Includes compute_type, runtime_arn, model_id, max_turns, system_prompt_overrides, github_token_secret_arn, poll_interval_ms, custom_steps, step_sequence, and egress_allowlist. The max_turns value from blueprint_config serves as the per-repo default; per-task max_turns (from the API request) takes higher priority. max_budget_usd follows the same 2-tier override pattern: per-task value takes priority over blueprint_config.max_budget_usd; if neither is specified, no budget limit is applied. |
prompt_version | String | Hash or version identifier of the system prompt used for this task. Required for prompt versioning (Iteration 3b). Enables correlation between prompt changes and task outcomes in the evaluation pipeline. |
model_id | String (optional) | Foundation model ID used for this task (e.g. anthropic.claude-sonnet-4-20250514). Defaults to the platform default; overridden by blueprint_config.model_id from onboarding. Stored for cost attribution and evaluation correlation. |
ttl | Number (optional) | DynamoDB TTL epoch (seconds). Set when the task reaches a terminal state. DynamoDB automatically deletes the record after this timestamp. Configurable via taskRetentionDays (default 90 days). |
Global Secondary Indexes:
| GSI | Key schema | Purpose |
|---|---|---|
UserStatusIndex | PK: user_id, SK: status#created_at | List tasks by user, filtered by status. Powers “my tasks” and queue processing. |
StatusIndex | PK: status, SK: created_at | List tasks by status. Powers system-wide queue processing and monitoring dashboards. |
IdempotencyIndex | PK: idempotency_key | Idempotency check during admission. Sparse index (only tasks with a key). |
TaskEvents table
Section titled “TaskEvents table”Append-only audit log. See OBSERVABILITY.md for the event list.
| Field | Type | Description |
|---|---|---|
task_id (PK) | String | Task identifier. |
event_id (SK) | String (ULID) | Unique, sortable event ID. |
event_type | String | E.g. task_created, admission_passed, hydration_complete, session_started, session_ended, pr_created, task_completed, task_failed, task_cancelled, task_timed_out. |
timestamp | String (ISO 8601) | When the event occurred. |
metadata | Map (optional) | Event-specific data (e.g. error message, PR URL, session ID). |
ttl | Number | DynamoDB TTL epoch (seconds). Set at event creation time. DynamoDB automatically deletes the record after this timestamp. Configurable via taskRetentionDays (default 90 days). |
UserConcurrency table
Section titled “UserConcurrency table”Atomic counters for per-user concurrency management.
| Field | Type | Description |
|---|---|---|
user_id (PK) | String | User identifier. |
active_count | Number | Number of currently running tasks for this user. |
updated_at | String (ISO 8601) | Last update timestamp. |
Operations:
- Increment:
UpdateItemwithSET active_count = active_count + 1andConditionExpression: active_count < :max. - Decrement:
UpdateItemwithSET active_count = active_count - 1andConditionExpression: active_count > 0.
Session mapping
Section titled “Session mapping”The session ID → task ID mapping is stored as a field on the Tasks table (session_id). No separate table is needed. To look up a task by session ID (e.g. when processing a session completion event), a GSI on session_id can be added if needed.
Open questions
Section titled “Open questions”These are design decisions not yet resolved. Each is framed as a question with options and trade-offs.
Q1: Session completion signaling — RESOLVED
Section titled “Q1: Session completion signaling — RESOLVED”Question: Given that invoke_agent_runtime blocks until the session ends (up to 8 hours), how does the durable orchestrator detect session completion without burning compute?
Resolution: This question is resolved by AgentCore’s asynchronous invocation model. invoke_agent_runtime does not need to block for hours. The agent starts work in a background thread and returns immediately. The orchestrator uses waitForCondition to poll the session via re-invocation (sticky routing) at 30-second intervals. Each poll is a fast, non-blocking call. The orchestrator suspends between polls (no compute charges). See the session monitoring pattern in the Implementation options section.
The original options (a) wrapper Lambda/Fargate and (c) agent calls callback directly are no longer needed. The poll-based approach (originally option b) is the natural fit now that the invocation itself is non-blocking.
Q2: Session status API availability — RESOLVED
Section titled “Q2: Session status API availability — RESOLVED”Question: Does AgentCore provide a way to query session status (running, completed, failed) without blocking?
Resolution: Yes, via two mechanisms:
-
Re-invocation on the same session (sticky routing). Calling
invoke_agent_runtimewith the sameruntimeSessionIdroutes to the same instance. The agent responds with its current status. This is the primary status mechanism. -
/pinghealth endpoint. The agent reportsHealthyBusy(processing) orHealthy(idle) via the/pingendpoint. AgentCore uses this for session lifecycle management (idle timeout). The orchestrator does not call/pingdirectly but benefits from it keeping the session alive.
No separate GetRuntimeSessionStatus API is needed — the re-invocation pattern provides equivalent functionality.
Q3: Completion signal mechanism — RESOLVED
Section titled “Q3: Completion signal mechanism — RESOLVED”Question: How should the agent signal task completion to the orchestrator?
Resolution: The agent signals completion via the re-invocation poll response. When the orchestrator re-invokes on the same session, the agent returns { status: "completed", ... } or { status: "failed", ... }. This is the primary signal.
Layered reliability:
| Layer | Mechanism | Purpose |
|---|---|---|
| Primary | Re-invocation poll response | Agent returns status directly to the orchestrator’s poll call. Fast, reliable, in-band. |
| Secondary | DynamoDB completion record | Agent writes a completion record (task_id, status, pr_url, error) to DynamoDB before exiting. The orchestrator checks this during finalization or if the poll detects session termination without a clean status response. |
| Fallback | GitHub state inspection | If both the poll and DynamoDB record are unavailable (agent crash before writing), the orchestrator falls back to GitHub-based result inference (branch exists? PR exists? commits?). |
Recommendation: Implement the primary (poll) and secondary (DynamoDB record) signals in Iteration 2. GitHub inspection remains the fallback as it is today.
Q4: Queue priority
Section titled “Q4: Queue priority”Question: Should the task queue support priority levels?
Recommendation: Start without priority (strict FIFO per user). Add priority if a concrete need arises.
Q5: Token budget management — RESOLVED
Section titled “Q5: Token budget management — RESOLVED”Question: Should the orchestrator enforce a token budget during context hydration, or should the agent harness manage its own context window?
Resolution: Both. The orchestrator enforces a character-based token budget (~4 chars/token, default 100K tokens) during context hydration, truncating oldest issue comments first when the budget is exceeded. The agent harness handles its own context compaction during multi-turn conversations. See the Context hydration section for implementation details.
Q6: Post-agent validation and retry cycles
Section titled “Q6: Post-agent validation and retry cycles”Question: When a post-agent validation step fails (e.g. build fails), should the orchestrator restart the agent for a fix cycle?
| Option | Description | Trade-off |
|---|---|---|
| (a) No retry | Agent gets one shot. Failure reported in PR. | Simplest; cheapest. |
| (b) Orchestrator retry (up to N) | New session with failure context. | Adds cost and complexity; doubles compute for each retry. |
| (c) In-session retry | Agent harness includes a “verify and fix” loop via system prompt. | No orchestrator changes; relies on agent following instructions. |
Recommendation: Option (c) for MVP (the current system prompt already instructs the agent to run tests and fix errors). Option (b) for Iteration 3+ when deterministic validation is introduced.
Q7: Orchestrator crash recovery
Section titled “Q7: Orchestrator crash recovery”Question: What if a durable execution itself gets stuck or fails to resume?
Recommendation: Lambda Durable Functions handles most crash recovery via checkpoint/replay. As defense in depth, add a periodic Lambda scanner that checks for tasks stuck in non-terminal states beyond their expected duration (e.g. RUNNING for > 9 hours when the max session is 8 hours). The scanner can trigger finalization or mark tasks as TIMED_OUT. Accept the risk for Iteration 1 (no durable orchestrator).
Q8: Branch name pre-generation
Section titled “Q8: Branch name pre-generation”Question: Should the orchestrator pre-generate the branch name, or should the agent generate it inside the session?
Current behavior: The agent entrypoint generates the branch name from task ID and issue title.
Recommendation: Pre-generate in the orchestrator. The branch name follows a deterministic pattern (bgagent/{task_id}/{slug}) so it can be computed from task metadata. This enables the orchestrator to store the branch name in the task record before the session starts, simplifying result inference.
Q9: DynamoDB single-table vs. multi-table
Section titled “Q9: DynamoDB single-table vs. multi-table”Question: Should Tasks, TaskEvents, and UserConcurrency share one DynamoDB table or use separate tables?
Recommendation: Start with separate tables (simpler, clearer access patterns). Consolidate later if the operational burden becomes an issue.
Q10: Notification timing
Section titled “Q10: Notification timing”Question: When should the orchestrator emit user notifications?
Recommendation: Notify on task accepted, task running, and terminal states (completed/failed/cancelled/timed_out) in Iteration 2. Add configurable per-user preferences in later iterations.