Skip to content

DynamoDB

DynamoDB is used for workflow state management, not for search. It applies the One Table Design pattern to manage all state for projects, documents, workflows, segments, and processing steps in a single table.


ItemValue
BillingOn-Demand
Partition KeyPK (String)
Sort KeySK (String)
GSI1GSI1PK / GSI1SK
GSI2GSI2PK / GSI2SK
StreamNEW_AND_OLD_IMAGES

PK: PROJ#{project_id}
SK: META
FieldDescription
data.nameProject name
data.languageLanguage (default: en)
data.document_promptCustom prompt for document analysis
data.ocr_modelOCR model (default: pp-ocrv5)
PK: PROJ#{project_id}
SK: DOC#{document_id}

Query documents belonging to a project with begins_with(SK, 'DOC#').

PK: PROJ#{project_id}
SK: WF#{workflow_id}
FieldDescription
data.file_nameFile name
data.statusWorkflow status
PK: DOC#{document_id} (or WEB#{document_id})
SK: WF#{workflow_id}
FieldDescription
data.project_idParent project
data.file_uriS3 path
data.file_nameFile name
data.file_typeMIME type
data.execution_arnStep Functions execution ARN
data.statuspending / in_progress / completed / failed
data.total_segmentsTotal segment count
data.preprocessPer-stage preprocessing status (ocr, bda, transcribe, webcrawler)
PK: WF#{workflow_id}
SK: STEP
GSI1PK: STEP#ANALYSIS_STATUS
GSI1SK: pending | in_progress | completed | failed

Tracks the status of each processing step in a workflow. GSI1 enables fast lookup of currently running analyses.

StepDescription
segment_prepSegment preparation
bda_processorBedrock Document Analysis
format_parserFormat parsing
paddleocr_processorPaddleOCR processing
transcribeAudio transcription
webcrawlerWeb crawling
segment_builderSegment construction
segment_analyzerAI analysis (Claude)
graph_builderGraph construction
document_summarizerDocument summarization

Each step has status, label, started_at, ended_at, and error attributes.

PK: WF#{workflow_id}
SK: SEG#{segment_index:04d} ← 0001, 0002, ...
FieldDescription
data.segment_indexSegment index
data.s3_keyS3 path (segment data)
data.image_uriImage URI
data.image_analysisImage analysis results array

QueryIndexKey Condition
Project document listPrimaryPK=PROJ#{proj_id}, SK begins_with DOC#
Project workflow listPrimaryPK=PROJ#{proj_id}, SK begins_with WF#
Workflow metadataPrimaryPK=DOC#{doc_id}, SK=WF#{wf_id}
Step progressPrimaryPK=WF#{wf_id}, SK=STEP
Segment listPrimaryPK=WF#{wf_id}, SK begins_with SEG#
Specific segmentPrimaryPK=WF#{wf_id}, SK=SEG#{index}
In-progress analysisGSI1GSI1PK=STEP#ANALYSIS_STATUS, GSI1SK=in_progress

  • Single transaction: Workflow metadata and step status are created atomically via batch_write
  • Efficient queries: All documents/workflows for a project retrieved with a single query
  • Cost reduction: Minimized operational complexity with a single table

DynamoDB stores only state and metadata, while actual data (segment content, analysis results) is stored in S3.

DynamoDB S3
├─ Workflow status ├─ Segment raw data
├─ Step progress ├─ Analysis results (JSON)
├─ Segment metadata (s3_key) ├─ Entity extraction results
└─ WebSocket connections └─ Document summaries

Due to the Step Functions payload limit (256KB), DynamoDB serves as intermediate storage. Documents with 3000+ pages can be processed by passing only segment indices through the workflow.