Skip to content

Orchestrator

The orchestrator drives the task lifecycle from submission to completion. It runs every deterministic step (admission, context hydration, session start, result inference, cleanup) and delegates the non-deterministic step (the agent workload) to an isolated compute session. This separation keeps bookkeeping cheap and predictable while containing the expensive, unpredictable agent work inside the compute environment.

The orchestrator is implemented as a Lambda Durable Function. Durable execution provides checkpoint/replay across process restarts, suspension without compute charges during long waits, and condition-based polling for session completion. See the Implementation section for details.

  • Use this doc for: task state machine, admission/finalization flow, cancellation behavior, failure recovery, and concurrency management.
  • Related docs: ARCHITECTURE.md for the high-level blueprint model, COMPUTE.md for the session runtime, MEMORY.md for context sources, REPO_ONBOARDING.md for per-repo customization.

The orchestrator sits between the API layer and the agent runtime. Changes to task submission, the CLI, or the container image touch these boundaries, so knowing where each contract lives avoids drift.

ConcernLocationNotes
REST request/response typescdk/src/handlers/shared/types.tsMirror in cli/src/types.ts
HTTP handlers and orchestrationcdk/src/handlers/Tests under cdk/test/handlers/
Agent runtimeagent/src/ (pipeline.py, runner.py, config.py, hooks.py, prompts/)See agent/README.md for env vars and local run

The orchestrator is deliberately scoped. It handles coordination and bookkeeping but never touches agent logic, compute infrastructure, or memory storage. This clear boundary means a crashed agent does not leave orphaned state, and platform invariants (concurrency limits, event audit, cancellation) cannot be bypassed by agent code.

ResponsibilityDescription
Task lifecycleAccept tasks, drive them through the state machine to a terminal state, persist state at each transition
Admission controlValidate repo onboarding, concurrency limits, rate limits, idempotency
Context hydrationAssemble the agent prompt from user input, GitHub data, memory, and repo config
Session managementStart the compute session, monitor liveness via heartbeat, detect completion
Result inferenceDetermine success or failure from agent response, DynamoDB record, and GitHub state
FinalizationUpdate status, emit events, release concurrency, persist audit records
CancellationStop the session and drive the task to CANCELLED at any point
ConcurrencyTrack per-user and system-wide running task counts with atomic counters
ComponentOwnerReference
Request authenticationInput gatewayINPUT_GATEWAY.md
Agent logic (clone, code, test, PR)Agent runtimeCOMPUTE.md
Compute session lifecycle (VM, image pull)AgentCore RuntimeCOMPUTE.md
Memory storage and retrievalAgentCore MemoryMEMORY.md
Repository onboardingBlueprint constructREPO_ONBOARDING.md

Every task moves through a finite set of states from creation to a terminal outcome. The state machine is the backbone of the orchestrator: it determines what actions are valid at each point, when resources are acquired or released, and how the platform recovers from failures. Four of the eight states are terminal, meaning the task is done and no further transitions occur.

StateDescriptionDuration
SUBMITTEDTask accepted, awaiting orchestrationMilliseconds
HYDRATINGFetching GitHub data, querying memory, assembling promptSeconds
RUNNINGAgent session active in compute environmentMinutes to hours
FINALIZINGResult inference and cleanup in progressSeconds
COMPLETEDTerminal. Task finished successfully-
FAILEDTerminal. Task could not complete-
CANCELLEDTerminal. Cancelled by user or system-
TIMED_OUTTerminal. Exceeded duration or idle timeout-
stateDiagram-v2
    [*] --> SUBMITTED
    SUBMITTED --> HYDRATING : Admission passes
    SUBMITTED --> FAILED : Admission rejected
    SUBMITTED --> CANCELLED : User cancels

    HYDRATING --> RUNNING : Session started
    HYDRATING --> FAILED : Hydration error
    HYDRATING --> CANCELLED : User cancels

    RUNNING --> FINALIZING : Session ends
    RUNNING --> CANCELLED : User cancels
    RUNNING --> TIMED_OUT : Duration exceeded
    RUNNING --> FAILED : Session crash

    FINALIZING --> COMPLETED : PR or commits found
    FINALIZING --> FAILED : No useful work
    FINALIZING --> TIMED_OUT : Idle timeout detected
FromToTriggerCondition
SUBMITTEDHYDRATINGAdmission passesConcurrency slot acquired
SUBMITTEDFAILEDAdmission rejectedRepo not onboarded, rate/concurrency limit, validation error
HYDRATINGRUNNINGHydration completeinvoke_agent_runtime returns session ID
HYDRATINGFAILEDHydration errorGitHub API failure, guardrail blocks content, Bedrock unavailable
RUNNINGFINALIZINGSession endsResponse received or session terminated
RUNNINGTIMED_OUTMax duration exceededWall-clock timer (default 8h, matching AgentCore max)
RUNNINGFAILEDSession crashHeartbeat lost (see Liveness monitoring)
FINALIZINGCOMPLETEDSuccess inferredPR exists or commits on branch
FINALIZINGFAILEDFailure inferredNo commits, no PR, or agent reported error

Users can cancel a task at any point. The orchestrator’s response depends on how far the task has progressed. The key guarantee: every cancel request either transitions the task to CANCELLED or is rejected because the task already reached a terminal state. No task is left in limbo.

State when cancel arrivesAction
SUBMITTEDTransition to CANCELLED. No cleanup needed.
HYDRATINGAbort hydration, release concurrency slot, transition to CANCELLED.
RUNNINGCall stop_runtime_session, wait for confirmation, release concurrency, transition to CANCELLED. Partial work on GitHub remains for the user to inspect.
FINALIZINGLet finalization complete. Mark CANCELLED only if the terminal state was not yet written.
TerminalReject the cancel request.

Multiple timeout mechanisms work together to prevent runaway tasks. Time-based limits (session duration, idle) are enforced by AgentCore; cost-based limits (turns, budget) are enforced by the agent SDK. The orchestrator acts as a safety net when external timeouts fire.

TypeDefaultEffect
Max session duration8 hoursAgentCore terminates session. Task transitions to TIMED_OUT.
Idle timeout15 minutesAgentCore terminates if agent is idle. See Liveness monitoring.
Max turns100 (range 1-500)Agent stops after N model invocations. Configurable per task or per repo.
Max cost budget$0.01-$100Agent stops when budget is reached. Per-task or per-repo via Blueprint.
Hydration timeout2 minutesFail the task if context assembly takes too long.

Every task follows a blueprint: a sequence of deterministic steps wrapping one agentic step. The default blueprint is the sequence described in ARCHITECTURE.md. Per-repo customization (see REPO_ONBOARDING.md) changes which steps run without affecting the framework guarantees.

flowchart LR
    A[Admission] --> B[Hydration]
    B --> C[Pre-flight]
    C --> D[Start session]
    D --> E[Await completion]
    E --> F[Finalize]

Validates the task before any compute is consumed. Checks run in order:

  1. Repo onboarding - GetItem on RepoTable. If not found or inactive, reject with REPO_NOT_ONBOARDED. This runs at the API handler level (createTaskCore) for fast rejection.
  2. User concurrency - Atomic check-and-increment on UserConcurrency counter. If at limit (default 3-5), reject.
  3. System concurrency - Compare total running + hydrating tasks to system limit (bounded by AgentCore quotas).
  4. Rate limiting - Sliding window counter (10 tasks/hour per user). Exceeded tasks are rejected, not queued.
  5. Idempotency - If the request includes an idempotency key and a task with that key exists, return the existing task.

On acceptance, the concurrency slot is acquired and the task transitions to HYDRATING.

Assembles the agent’s user prompt. The implementation lives in context-hydration.ts. What it does, by task type:

new_task: Fetches the GitHub issue (title, body, comments) if issue_number is set, loads memory from past tasks, and combines everything with the user’s task description.

pr_iteration / pr_review: Fetches PR metadata, conversation comments, changed files (REST), and inline review comments (GraphQL, resolved threads filtered out) in four parallel calls. Extracts head_ref and base_ref for branch resolution.

Regardless of task type, the assembled prompt is screened through Amazon Bedrock Guardrails for prompt injection (fail-closed: unscreened content never reaches the agent). A token budget (default 100K tokens, ~4 chars/token heuristic) trims oldest comments first when exceeded.

A pre-flight sub-step verifies the GitHub token has sufficient permissions for the task type, catches inaccessible PRs, and confirms GitHub API reachability. This fails fast with clear errors like INSUFFICIENT_GITHUB_REPO_PERMISSIONS before compute is consumed.

The orchestrator calls invoke_agent_runtime with the hydrated payload. The agent receives it, starts the coding task in a background thread (via add_async_task), and returns an acknowledgment immediately. The orchestrator records the (task_id, session_id) mapping and transitions to RUNNING.

The session ID is pre-generated and reused on retry, making session start idempotent after a crash.

The orchestrator polls for completion using waitForCondition from the Durable Execution SDK. At configurable intervals (default 30s), it re-invokes on the same session (sticky routing). The agent responds with its current status:

  • running - Orchestrator suspends until next interval (no compute charges)
  • completed - Orchestrator resumes to finalization with the result
  • failed - Same, with error payload

If the session is terminated externally (crash, timeout, cancellation), the poll detects it and the orchestrator proceeds to finalization using GitHub-based result inference as fallback.

After the session ends, the orchestrator determines the outcome from multiple signals.

Completion signals (layered reliability):

LayerMechanismPurpose
PrimaryPoll responseAgent returns status directly
SecondaryDynamoDB completion recordAgent writes before exiting, survives poll failures
FallbackGitHub inspectionBranch exists? PR exists? Commits?

Decision matrix:

Agent saysPR existsCommitsOutcome
successYes> 0COMPLETED
successNo> 0COMPLETED (partial, no PR)
successNo0FAILED (nothing done)
errorYes> 0COMPLETED (with warning)
errorNoanyFAILED
unknown--FAILED

Cleanup: Update task status with metadata (PR URL, cost, duration). Set TTL for data retention (default 90 days). Emit task events. Release concurrency counter. Send notifications. Persist code attribution to memory.

Every step in the pipeline satisfies these properties:

  • Idempotent - Safe to retry after crashes. Context hydration produces the same prompt for the same inputs; session start reuses a pre-generated session ID.
  • Timeout-bounded - Each step has a configurable timeout to prevent blocking the pipeline.
  • Failure-aware - Returns success or failed. Infrastructure failures (throttle, transient errors) trigger exponential backoff retries (default: 2 retries, base 1s, max 10s). Explicit failures transition to FAILED without retry.
  • Least-privilege input - Each step receives only the blueprintConfig fields it needs. Custom Lambda steps get credential ARNs stripped.
  • Bounded output - StepOutput.metadata is limited to 10KB. previousStepResults is pruned to the last 5 steps to stay within the 256KB checkpoint limit.

Per REPO_ONBOARDING.md, blueprints customize execution through three layers:

  1. Parameterized strategies - Select built-in implementations without code. Example: compute.type: 'agentcore' vs compute.type: 'ecs'.
  2. Lambda-backed custom steps - Inject custom logic at pre-agent or post-agent phases. Example: SAST scan before the agent, custom lint after.
  3. Custom step sequences - Override the default step order entirely via an ordered step_sequence list.

The framework enforces state transitions, event emission, cancellation checks, concurrency management, and timeouts regardless of customization.

Agent sessions run for minutes to hours inside isolated compute environments. The orchestrator does not control the agent’s behavior, but it needs to know whether the session is alive, healthy, and eventually done. This section covers how the orchestrator maintains that visibility without blocking or burning compute.

Liveness detection varies by compute backend. AgentCore sessions use DynamoDB heartbeats and a /ping health endpoint; ECS Fargate tasks rely on the ECS DescribeTasks API since the ECS entrypoint does not write heartbeats.

DynamoDB heartbeat (AgentCore only). The agent writes agent_heartbeat_at every 45 seconds via a daemon thread. The orchestrator applies two thresholds during polling when computeType === 'agentcore':

  • Grace period (120s) - After entering RUNNING, the orchestrator waits before expecting heartbeats (covers container startup).
  • Stale threshold (240s) - If the heartbeat exists but is older than this, the session is treated as lost.
  • Early crash - If no heartbeat is ever set after the combined window (360s), the agent died before the pipeline started.

When the session is unhealthy, the task transitions to FAILED with “Agent session lost: no recent heartbeat.”

ECS task status polling (ECS only). The orchestrator calls computeStrategy.pollSession (ECS DescribeTasks) on each poll cycle. Three failure modes are detected: container failure (immediate FAILED), container exit without DynamoDB terminal write (fail after 5 consecutive completed polls), and repeated API failures (fail after 3 consecutive errors). ECS does not have heartbeat-based hung-process detection; a hung but alive container polls for the full MAX_POLL_ATTEMPTS window (~8.5h) before timing out.

/ping health endpoint (AgentCore only). The agent’s FastAPI server responds to AgentCore’s /ping calls while the coding task runs in a separate thread. AgentCore sees HealthyBusy and keeps the session alive.

AgentCore terminates sessions after 15 minutes of inactivity. Since coding tasks may have long pauses between tool calls (builds, complex reasoning), the agent uses add_async_task to register background work. The SDK reports HealthyBusy via /ping while any async task is active, preventing idle termination.

Risk: if the agent process becomes entirely unresponsive (not just a thread), /ping may not respond, triggering termination. The defense is running the coding task in a separate thread that does not starve the main thread.

Long-running distributed systems fail. The orchestrator is designed so that every failure mode has a defined recovery path and every task eventually reaches a terminal state. The table below maps each step to its known failure modes and what the orchestrator does about them.

StepFailureRecovery
AdmissionDynamoDB unavailableRetry 3x with backoff, then reject
AdmissionConcurrency counter driftedReconciliation Lambda corrects every 15 minutes
HydrationGitHub API down/rate limitedRetry with backoff. Fail if issue is essential; degrade if user also provided a description
HydrationMemory service unavailableProceed without memory (it is enrichment, not required)
HydrationGuardrail blocks contentFail the task (content is adversarial, no retry)
HydrationGuardrail API unavailableFail the task (fail-closed: unscreened content never reaches agent)
Session startinvoke_agent_runtime throttledExponential backoff. Fail after retries exhausted.
Session startSession crashes immediatelyAgentCore: heartbeat never set, detected after 360s grace window. ECS: DescribeTasks reports failure on next poll.
RunningAgent crashes mid-taskAgentCore: heartbeat goes stale. ECS: DescribeTasks reports stopped task. Finalization inspects GitHub for partial work.
RunningAgent hits turn or budget limitSession ends normally. Finalize based on what was produced.
RunningIdle for 15 minAgentCore kills session. Task transitions to TIMED_OUT.
FinalizationGitHub API downRetry 3x. If still failing, mark FAILED with infrastructure reason.
OrchestratorCrash during any stepDurable execution replays from last checkpoint.
  1. Durable execution - Lambda Durable Functions checkpoints at each state transition and replays after crashes.
  2. Idempotent operations - All steps are safe to retry.
  3. Stuck-task scanner - Periodic Lambda detects tasks stuck beyond expected durations and either resumes or fails them.
  4. Counter reconciliation - Lambda runs every 15 minutes, compares counters to actual running task counts, corrects drift. Emits counter_drift_corrected CloudWatch metric.
  5. Dead-letter queue - Tasks that exhaust retries go to DLQ for investigation.

Each task runs in its own isolated compute session with no shared mutable state at the compute layer. The orchestrator manages concurrency purely at the coordination layer: atomic counters track how many tasks are active per user and system-wide, and admission control enforces limits before resources are consumed.

LimitValueSource
invoke_agent_runtime TPS25 per agent/accountAgentCore quota (adjustable)
Concurrent sessionsAccount-level limitAgentCore quota
Per-user concurrencyConfigurable (default 3-5)Platform config
System-wide max tasksConfigurableBounded by AgentCore session limit
  • UserConcurrency - DynamoDB item per user with active_count. Incremented atomically (active_count < max) at admission, decremented at finalization.
  • SystemConcurrency - Single DynamoDB item, same pattern.

Concurrency is always released in finalizeTask (step 6), never inside the poll loop. ECS poll failure paths call failTask with releaseConcurrency: false to transition the task to FAILED without decrementing — finalizeTask handles the single decrement after re-reading the task state. The heartbeat-detected crash path also guards against double-decrement by only releasing the counter after a successful state transition. If the transition fails (task already terminal), it re-reads and acts accordingly.

The orchestrator needed a runtime that survives hours-long waits without burning compute, recovers from crashes without losing progress, and expresses the blueprint as readable code rather than a DSL. Lambda Durable Functions fits all three requirements. The blueprint is written as sequential TypeScript with durable operations (step, wait, waitForCondition). Each operation creates a checkpoint; if the function is interrupted, it suspends without compute charges and replays from the last checkpoint on resumption.

Key properties:

  • No compute during waits. The orchestrator pays nothing while the agent runs for hours. At 30-second poll intervals over an 8-hour session, total orchestrator compute is minutes.
  • Execution duration up to 1 year. Far exceeds the 8-hour agent session limit.
  • Sequential code, not a DSL. The blueprint maps naturally to TypeScript with durable operations. No Amazon States Language or state machine abstractions.
  • Built-in retry with checkpointing. Steps support configurable retry strategies without re-executing completed work.
sequenceDiagram
    participant O as Orchestrator
    participant AC as AgentCore
    participant A as Agent

    O->>AC: invoke_agent_runtime (payload)
    AC->>A: Deliver payload
    A->>A: Start task in background thread
    A-->>AC: Ack (immediate)
    AC-->>O: Session started

    loop Every 30s via waitForCondition
        O->>AC: invoke_agent_runtime (same session)
        AC->>A: Route to same instance
        A-->>O: { status: "running" }
        Note over O: Suspend (no compute charges)
    end

    A->>A: Task complete
    O->>AC: invoke_agent_runtime (same session)
    A-->>O: { status: "completed", pr_url: "..." }
    O->>O: Proceed to finalization
Concurrent tasksPolls/day (30s, 8h avg)Peak TPSLambda cost/month
10~9,600~0.3~$0.002
50~48,000~1.7~$0.01
200~192,000~6.7~$0.04
500~480,000~16.7~$0.10

At 500 concurrent tasks, peak TPS is ~16.7 - well within the 25 TPS AgentCore quota. The bottleneck is the concurrent session quota, not the poll mechanism.

Three DynamoDB tables back the orchestrator: one for task state, one for the audit log, and one for concurrency counters. The Tasks table is the source of truth for every task; the orchestrator reads and writes it at every state transition. TaskEvents is append-only and powers the GET /v1/tasks/{id}/events API. UserConcurrency is a lightweight counter table used only during admission and finalization.

FieldTypeDescription
task_id (PK)String (ULID)Unique, sortable task ID
user_idStringCognito sub
statusStringCurrent state
repoStringowner/repo
task_typeStringnew_task, pr_iteration, or pr_review
issue_numberNumber?GitHub issue number
pr_numberNumber?PR number (required for PR task types)
task_descriptionString?Free-text description
branch_nameStringbgagent/{task_id}/{slug} for new tasks; PR’s head_ref for PR tasks
session_idString?AgentCore session ID
execution_idString?Durable execution ID
pr_urlString?PR URL (set during finalization)
error_messageString?Error reason if FAILED
error_codeString?Machine-readable error code (e.g. SESSION_START_FAILED)

Derived field: error_classification is not stored in DynamoDB. It is computed at API response time by passing error_message through the runtime error classifier (error-classifier.ts). This returns a structured object with category (auth/network/concurrency/compute/agent/guardrail/config/timeout/unknown), title, description, remedy, and retryable flag. The derived-field pattern means classifier updates take effect immediately for all existing tasks without data migration. | max_turns | Number? | Turn limit (per-task overrides per-repo default) | | max_budget_usd | Number? | Cost ceiling (per-task overrides per-repo default) | | model_id | String? | Foundation model ID | | prompt_version | String | System prompt hash for evaluation correlation | | blueprint_config | Map? | Snapshot of RepoConfig at task creation | | cost_usd | Number? | Agent cost from SDK | | duration_s | Number? | Total duration | | ttl | Number? | DynamoDB TTL (default: created_at + 90 days) | | created_at / updated_at | String | ISO 8601 timestamps |

GSIs: UserStatusIndex (PK: user_id, SK: status#created_at), StatusIndex (PK: status, SK: created_at), IdempotencyIndex (PK: idempotency_key, sparse).

Append-only audit log. See OBSERVABILITY.md.

FieldTypeDescription
task_id (PK)StringTask ID
event_id (SK)String (ULID)Sortable event ID
event_typeStringtask_created, hydration_complete, session_started, pr_created, task_completed, etc.
timestampStringISO 8601
metadataMap?Event-specific data
ttlNumberSame retention as tasks
FieldTypeDescription
user_id (PK)StringUser ID
active_countNumberRunning task count

Increment: SET active_count = active_count + 1 with ConditionExpression: active_count < :max. Decrement: SET active_count = active_count - 1 with ConditionExpression: active_count > 0.