Skip to content

Preprocessing Pipeline

When a document is uploaded, the Type Detection Lambda detects the file type and distributes the necessary preprocessing tasks asynchronously via SQS queues. Once preprocessing is complete, the Step Functions workflow merges results and passes them to the AI analysis stage.

S3 Upload
↓ [EventBridge]
Type Detection Lambda
├─ OCR Queue → PaddleOCR (Lambda/SageMaker)
├─ BDA Queue → Bedrock Data Automation
├─ Transcribe Queue → AWS Transcribe
├─ WebCrawler Queue → Bedrock Agent Core
└─ Workflow Queue → Step Functions
├─ Segment Prep (segment creation)
├─ Check Preprocess Status (polling)
├─ Format Parser (text extraction)
├─ Segment Builder (result merging)
└─ → AI Analysis Pipeline

For details on the AI analysis pipeline (Segment Analyzer, Document Summarizer), see AI Analysis Pipeline.


File TypeExtensionsOCRBDATranscribeFormat ParserWebCrawler
PDF.pdfOO-A-
Image.png .jpg .jpeg .gif .tiff .tif .webpOO---
Video.mp4 .mov .avi .mkv .webm-OO--
Audio.mp3 .wav .flac .m4a-OO--
Word Document.docx .doc---A-
Presentation.pptx .ppt---A-
Spreadsheet.xlsx .xls .csv---A-
Text.txt .md---A-
Web.webreq----A
CAD.dxf---A-
  • A (Automatic): Enabled by default (runs automatically)
  • O (Optional): User enables per document at upload time
  • - : Not applicable

OCR (use_ocr), BDA (use_bda), and Transcribe (use_transcribe) can all be selectively enabled per document at upload time.


Extracts text from PDFs and images. Supports dual backends: Lambda (CPU) or SageMaker (GPU).

ItemValue
TargetPDF, Images (excluding DXF)
Lambda Modelspp-ocrv5, pp-structurev3 (CPU)
SageMaker Modelspaddleocr-vl (GPU)
Outputpaddleocr/result.json (per-page text + block coordinates)

OCR language is automatically mapped based on the project language setting (Korean → korean, Japanese → japan, etc.).

For details, see PaddleOCR on SageMaker.

Uses AWS Bedrock Data Automation to analyze document structure (tables, layouts, images) in markdown format. For videos, it performs chapter splitting and summarization.

ItemValue
TargetPDF, Images, Video, Audio (excluding office documents/spreadsheets/DXF/web)
Activationuse_bda=true (selected at document upload)
Outputbda-output/ (markdown, images, metadata)

Converts speech from audio and video files to text. Generates timestamped segment-level transcripts.

ItemValue
TargetVideo (MP4, MOV, AVI, MKV, WebM), Audio (MP3, WAV, FLAC, M4A)
Activationuse_transcribe=true (selected at document upload)
Outputtranscribe/{workflow_id}-{timestamp}.json

Extracts text using various libraries depending on file type. Runs synchronously within the Step Functions workflow.

File TypeLibraryAction
PDFpypdfPer-page text layer extraction (graphics stripping)
DOCX/DOCLibreOffice → pypdf + pypdfium2Convert to PDF, then per-page text + PNG image generation
PPTX/PPTpython-pptx + LibreOffice → pypdfium2Per-slide text + PNG image generation
XLSXopenpyxlPer-sheet markdown table conversion
XLSxlrdPer-sheet markdown table conversion
CSVcsv (stdlib)Single-sheet markdown table conversion
TXT/MDDirect readText chunking (15,000 chars, 500 char overlap)
DXFezdxf + matplotlibPer-layout text extraction + PNG rendering

Output: format-parser/result.json

A web crawling agent powered by Bedrock Agent Core that crawls URLs specified in .webreq files.

ItemValue
Target.webreq files
InputJSON ({"url": "...", "instruction": "..."})
Outputwebcrawler/pages/page_XXXX.json (multi-page) or webcrawler/content.md (legacy)

PDF Upload
Type Detection
├─ OCR Queue → PaddleOCR → paddleocr/result.json (per-page text, optional)
├─ BDA Queue → BDA → bda-output/ (markdown, optional)
└─ Workflow Queue → Step Functions
├─ Segment Prep: Render each page to PNG (pypdfium2, 150 DPI)
├─ Check Preprocess Status: Poll for OCR/BDA completion
├─ Format Parser: Extract text layers with pypdf
└─ Segment Builder: Merge OCR + BDA + Format Parser
ItemValue
Segment TypePAGE (one per page)
Segment CountNumber of PDF pages
ImagesPer-page PNG (preprocessed/page_XXXX.png)
Automatic PreprocessingFormat Parser
Optional PreprocessingOCR, BDA
Image Upload (PNG, JPG, TIFF, etc.)
Type Detection
├─ OCR Queue → PaddleOCR → paddleocr/result.json (single text, optional)
├─ BDA Queue → BDA → bda-output/ (markdown, optional)
└─ Workflow Queue → Step Functions
├─ Segment Prep: Use original image (no copy)
├─ Check Preprocess Status: Poll for OCR/BDA completion
└─ Segment Builder: Merge OCR + BDA
ItemValue
Segment TypePAGE (1)
Segment Count1
ImagesOriginal file URI used directly
Optional PreprocessingOCR, BDA
Video Upload
Type Detection
├─ BDA Queue → BDA → Chapter splitting + summaries (optional)
├─ Transcribe Queue → AWS Transcribe → Transcript (optional)
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 VIDEO segment
├─ Check Preprocess Status: Poll for BDA/Transcribe completion
└─ Segment Builder: Merge BDA chapters + Transcribe
ItemValue
Segment TypeVIDEO (without BDA) or CHAPTER (with BDA chapter splitting)
Segment Count1 (without BDA) or number of chapters (with BDA)
ImagesNone
Optional PreprocessingBDA, Transcribe
Audio Upload
Type Detection
├─ BDA Queue → BDA (optional)
├─ Transcribe Queue → AWS Transcribe → Transcript (optional)
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 AUDIO segment
├─ Check Preprocess Status: Poll for BDA/Transcribe completion
└─ Segment Builder: Merge Transcribe results
ItemValue
Segment TypeAUDIO
Segment Count1
ImagesNone
Optional PreprocessingBDA, Transcribe
DOCX/DOC Upload
Type Detection
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 placeholder
├─ Format Parser: Convert to PDF via LibreOffice → per-page text + PNG
└─ Segment Builder: Override segments with Format Parser results
ItemValue
Segment TypePAGE (one per page)
Segment CountNumber of PDF pages after LibreOffice conversion
ImagesPer-page PNG (format-parser/slides/slide_XXXX.pngpreprocessed/page_XXXX.png)
Automatic PreprocessingFormat Parser
Async PreprocessingNone (all preprocessing skipped)
PPTX/PPT Upload
Type Detection
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 placeholder
├─ Format Parser: python-pptx text + LibreOffice PDF conversion → PNG
└─ Segment Builder: Override segments with Format Parser results
ItemValue
Segment TypePAGE (one per slide)
Segment CountNumber of slides
ImagesPer-slide PNG (format-parser/slides/slide_XXXX.pngpreprocessed/page_XXXX.png)
Automatic PreprocessingFormat Parser
Text ExtractionSlide text + tables + speaker notes
XLSX/XLS/CSV Upload
Type Detection
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 placeholder
├─ Format Parser: Per-sheet markdown table conversion
└─ Segment Builder: Override segments with Format Parser results
ItemValue
Segment TypeTEXT (one per sheet)
Segment CountNumber of sheets (1 for CSV)
ImagesNone
Automatic PreprocessingFormat Parser
Output FormatMarkdown table (## Sheet: {name}\n| col1 | col2 |...)
TXT/MD Upload
Type Detection
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 placeholder
├─ Format Parser: Read text → chunk splitting
└─ Segment Builder: Override segments with Format Parser results
ItemValue
Segment TypeTEXT (one per chunk)
Segment CountDetermined by text length
ImagesNone
Automatic PreprocessingFormat Parser
Chunking Config15,000 chars per chunk, 500 char overlap, sentence boundary preferred
.webreq File Upload ({"url": "...", "instruction": "..."})
Type Detection
├─ WebCrawler Queue → Bedrock Agent Core → Web crawling
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 WEB placeholder
├─ Check Preprocess Status: Poll for WebCrawler completion
└─ Segment Builder: Override segments with WebCrawler results
ItemValue
Segment TypeWEB (one per page)
Segment CountNumber of crawled pages
ImagesNone
Automatic PreprocessingWebCrawler
Output Fieldswebcrawler_content, source_url, page_title
DXF File Upload
Type Detection
└─ Workflow Queue → Step Functions
├─ Segment Prep: Create 1 placeholder
├─ Format Parser: ezdxf text extraction + matplotlib PNG rendering
└─ Segment Builder: Override segments with Format Parser results
ItemValue
Segment TypePAGE (one per layout)
Segment CountNumber of DXF layouts (Model Space + Paper Space)
ImagesPer-layout PNG (format-parser/slides/layout_XXXX.png)
Automatic PreprocessingFormat Parser
Extracted EntitiesTEXT, MTEXT, ATTRIB, DIMENSION + layer/block metadata

The Segment Builder merges all preprocessing results into a single segment JSON.

1. Base structure: Segment Prep (preprocessor/metadata.json)
2. Merge OCR: paddleocr/result.json → paddleocr, paddleocr_blocks
3. Merge BDA: bda-output/ → bda_indexer, bda_image_uri
4. Merge Format Parser: format-parser/result.json → format_parser, image_uri
5. Merge Transcribe: transcribe/*.json → transcribe, transcribe_segments
6. Merge WebCrawler: webcrawler/pages/*.json → webcrawler_content, source_url

Segment count is determined from different sources depending on file type:

File TypeSegment Count Source
PDFSegment Prep (number of PDF pages)
ImageSegment Prep (always 1)
DOCX/DOC, PPTX/PPT, DXFFormat Parser (pages/slides/layouts after conversion)
XLSX/XLS/CSVFormat Parser (number of sheets)
TXT/MDFormat Parser (number of chunks)
VideoSegment Prep (1) or BDA (number of chapters)
AudioSegment Prep (always 1)
WebWebCrawler (number of crawled pages)

Segment Prep creates a placeholder, then Segment Builder adjusts the segment count based on actual results and updates total_segments.


s3://bucket/projects/{project_id}/documents/{document_id}/
├─ {original_file} # Original uploaded file
├─ preprocessed/
│ ├─ metadata.json # Segment Prep metadata
│ ├─ page_0000.png # Page images (PDF, DOCX, PPTX, DXF)
│ ├─ page_0001.png
│ └─ ...
├─ paddleocr/
│ └─ result.json # OCR results (per-page text + blocks)
├─ bda-output/
│ └─ {job_id}/
│ ├─ job_metadata.json # BDA job metadata
│ ├─ standard_output/
│ │ ├─ 0/result.json # BDA analysis results (markdown)
│ │ └─ 0/assets/ # BDA extracted images
│ └─ ...
├─ format-parser/
│ ├─ result.json # Text extraction results
│ └─ slides/ # PPTX/DOCX/DXF images
│ ├─ slide_0000.png
│ └─ ...
├─ transcribe/
│ └─ {workflow_id}-{timestamp}.json # Transcribe results
├─ webcrawler/
│ ├─ metadata.json # Crawling metadata
│ └─ pages/
│ ├─ page_0000.json # Crawled page content
│ └─ ...
└─ analysis/
├─ segment_0000.json # Merged segment data
├─ segment_0001.json
└─ ...

Asynchronous Preprocessing Status Management

Section titled “Asynchronous Preprocessing Status Management”

Preprocessing status is managed in the preprocess field of the DynamoDB workflow record.

{
"preprocess": {
"ocr": {"required": true, "status": "completed"},
"bda": {"required": false, "status": "skipped"},
"transcribe": {"required": false, "status": "skipped"},
"webcrawler": {"required": false, "status": "skipped"}
}
}

The CheckPreprocessStatus Lambda in the Step Functions workflow periodically polls to verify all required preprocessing is complete. Once all required preprocessors reach completed or skipped status, the workflow proceeds to the next stage (Format Parser → Segment Builder).