Skip to content

AI Analysis Pipeline

The Segment Analyzer is a ReAct (Reasoning + Acting) agent built on Strands SDK. It combines upstream preprocessing results (OCR, BDA, PDF text, Transcribe) with AI tools to iteratively analyze documents and videos. The agent autonomously generates questions, invokes tools to obtain answers, and synthesizes findings through multiple passes for in-depth analysis.


Analysis Pipeline

Upstream Preprocessing (parallel)
├─ PaddleOCR (SageMaker) ── text extraction (automatic)
├─ Bedrock Data Automation ── document structure analysis (optional)
├─ Format Parser ── PDF text layer extraction (automatic, PDF only)
└─ AWS Transcribe ── audio/video transcription (automatic)
Segment Builder (merge all results → S3 segment JSON)
Segment Analyzer (ReAct Agent)
├─ Documents/Images → Claude Sonnet 4.6 + image analysis tools
└─ Videos/Audio → Claude Sonnet 4.6 + Pegasus video analysis tool
Analysis Finalizer → SQS → LanceDB Writer
Document Summarizer (Claude Haiku 4.5)

Before the Segment Analyzer begins, multiple preprocessors extract information from the original files. For detailed preprocessing flows by file type, see Preprocessing Pipeline.

Extracts text from documents and images. See PaddleOCR on SageMaker for details.

Uses AWS BDA to analyze document structure (tables, layouts, etc.) in markdown format. Can be enabled/disabled in project settings.

Converts speech from audio/video files to text. Generates timestamped segment-level transcripts.

ItemValue
TargetPDF files only (application/pdf)
Librarypypdf
ActionPage-by-page text layer extraction
PurposeObtain accurate text from digital PDFs without OCR

Digital PDFs contain text layers that provide more accurate text than OCR. These results are provided as context to the Segment Analyzer alongside OCR/BDA results.


Merges all preprocessing results into a single segment JSON.

{
"segment_index": 0,
"segment_type": "PAGE",
"image_uri": "s3://.../preprocessed/page_0000.png",
"paddleocr": "Text extracted by OCR...",
"bda_indexer": "BDA markdown result...",
"format_parser": "PDF text layer...",
"ai_analysis": []
}
{
"segment_index": 0,
"segment_type": "CHAPTER",
"file_uri": "s3://.../video.mp4",
"start_timecode_smpte": "00:00:00:00",
"end_timecode_smpte": "00:05:30:00",
"transcribe": "Full transcript...",
"transcribe_segments": [
{"id": 0, "transcript": "...", "start_time": "0.0", "end_time": "5.2"}
],
"bda_indexer": "Chapter summary...",
"ai_analysis": []
}

The Segment Analyzer uses an iterative question-answer approach. After reviewing the context (OCR, BDA, PDF text, etc.), the agent autonomously generates targeted questions, invokes tools, collects responses, and synthesizes them. This process repeats multiple times for thorough analysis.

Agent reviews context
→ "I need to identify the document type"
→ analyze_image("What is the type and structure of this document?")
→ Claude Sonnet 4.6 examines the image and responds
→ "There's a table that needs detailed analysis"
→ analyze_image("Extract the table structure and data")
→ Claude Sonnet 4.6 responds
→ "I should check the technical drawing dimensions"
→ analyze_image("What are the dimensions and specifications shown?")
→ Claude Sonnet 4.6 responds
→ Synthesize all results into final analysis

The agent automatically adjusts analysis depth based on content complexity.

ComplexityTool CallsExamples
Minimal1Blank pages, simple text
Normal2-3Standard document pages
Deep4+Technical drawings, complex tables, diagrams

ModelPurpose
Claude Sonnet 4.6ReAct agent (reasoning + tool call decisions)
Claude Sonnet 4.6 (Vision)Image analysis tool (processes image + question internally)

Analyzes document images by asking specific questions. Uses Claude Sonnet 4.6’s Vision capability to directly examine images and provide answers.

@tool
def analyze_image(question: str) -> str:
"""Analyze the document image with a specific question.
Ask targeted questions about text content, visual elements,
diagrams, tables, or any other details you need to understand.
"""

Example questions:

  • “Describe the type and overall structure of this document”
  • “Extract all data contained in the table”
  • “Read the dimensions and specifications shown in the technical drawing”
  • “Analyze the data points and trends in the chart”

Corrects orientation when document images are rotated.

@tool
def rotate_image(degrees: int) -> str:
"""Rotate the current document image by specified degrees.
Use this tool when text appears upside down, sideways, or at an angle.
"""
Input: Image + OCR text + BDA result (optional) + PDF text (PDF only)
[Step 1] Orientation check → rotate_image if needed
[Step 2] Document overview → analyze_image("What type of document is this?")
[Step 3] Text extraction → analyze_image("Extract all text content")
[Step 4] Tables/charts → analyze_image("Analyze tables and charts")
[Step 5] Details → analyze_image("Extract technical specs and dimensions")
Final synthesis → Consolidate all tool responses into structured analysis

ModelPurpose
Claude Sonnet 4.6ReAct agent (reasoning + tool call decisions)
TwelveLabs Pegasus 1.2Video analysis tool (analyzes video directly internally)

Analyzes video segments by asking specific questions. The TwelveLabs Pegasus 1.2 model directly watches and analyzes videos from S3.

@tool
def analyze_video(question: str) -> str:
"""Analyze the video segment with a specific question.
Ask targeted questions about visual content, actions, scenes,
objects, people, text overlays, or any other details you need to understand.
"""

Example questions:

  • “What actions are being performed in this video?”
  • “Describe the main objects and people on screen”
  • “Read any text displayed on screen”
  • “What are the key events in this segment?”
{
"inputPrompt": "What actions are being performed in this video segment?",
"mediaSource": {
"s3Location": {
"uri": "s3://bucket/projects/p1/documents/d1/video.mp4",
"bucketOwner": "123456789012"
}
}
}

Pegasus analyzes video files directly from S3. Segments are split based on chapter information (timecodes) extracted by BDA.

Input: Video URI + Transcribe result + BDA chapter summary (optional) + timecodes
[Step 1] Content overview → analyze_video("Describe the main content")
[Step 2] Visual elements → analyze_video("What actions and objects are visible?")
[Step 3] Audio content → analyze_video("Summarize the spoken content")
[Step 4] Key events → analyze_video("What are the key events?")
Final synthesis → Combine Transcribe + Pegasus responses into timeline-based analysis

AspectDocuments/ImagesVideos/Audio
Segment TypePAGECHAPTER, VIDEO, AUDIO
Input DataImage URIVideo URI + timecodes
Preprocessing DataOCR + BDA (optional) + PDF textTranscribe + BDA (optional)
Agent ModelClaude Sonnet 4.6Claude Sonnet 4.6
Analysis Tool ModelClaude Sonnet 4.6 (Vision)TwelveLabs Pegasus 1.2
Toolsanalyze_image, rotate_imageanalyze_video
Analysis FocusText, tables, diagrams, layoutActions, scenes, speech, visual events

After all segment analyses are complete, the Document Summarizer generates an overall document summary.

ItemValue
ModelClaude Haiku 4.5 (claude-4-5-haiku)
InputAll segment AI analysis results (max 50,000 chars)
OutputStructured document summary
Summary structure:
1. Document overview (1-2 sentences)
2. Key findings (3-5 items)
3. Important data points
4. Conclusion

Why Claude Haiku: Deep analysis is already completed by the Segment Analyzer, so a fast and cost-effective model is sufficient for the summarization step.


Each tool call result is stored sequentially in the ai_analysis array.

{
"ai_analysis": [
{
"analysis_query": "What is the type and structure of this document?",
"content": "This is a technical specification document..."
},
{
"analysis_query": "Extract the table data",
"content": "The table contains the following items..."
}
]
}

After analysis completes, the Analysis Finalizer generates content_combined (all analysis results merged) and sends it to LanceDB Writer via SQS. LanceDB Writer performs vector embedding with Nova Embed and stores it in LanceDB.


The Segment Analyzer generates analysis results in the project’s configured language. Supports Korean, English, Japanese, and Chinese.


Previously analyzed segments can be reanalyzed with custom instructions. During reanalysis, user-specified instructions are applied instead of the project default prompt.