Skip to content

Vector Database

This project uses LanceDB as the vector database instead of Amazon OpenSearch Service. LanceDB is an open-source, serverless vector database that stores data directly on S3, eliminating the need for dedicated cluster infrastructure. Combined with Lindera and ICU4X for multilingual tokenization, it enables hybrid search (vector + full-text) across documents in multiple languages.

LanguageSemantic Search (Vector)Full-Text Search (FTS)Search Mode
KoreanOOHybrid (Vector + FTS)
JapaneseOOHybrid (Vector + FTS)
ChineseOOHybrid (Vector + FTS)
English and othersOOHybrid (Vector + FTS)

Lindera provides dictionary-based tokenization for CJK languages (Korean, Japanese, Chinese), while ICU4X handles word segmentation for other languages. This enables accurate FTS keyword extraction across all supported languages.

This project is a PoC/prototype, and cost efficiency is a key factor.

FactorOpenSearch ServiceLanceDB (S3)
InfrastructureDedicated cluster (minimum 2-3 nodes)No cluster needed (serverless)
Idle costCharges even when unusedS3 storage only
Setup complexityDomain config, VPC, access policiesS3 bucket + DynamoDB lock table
ScalingNode scaling requiredScales with S3 automatically
Estimated monthly cost (PoC)$200-500+ (t3.medium x2 minimum)$1-10 (S3 + DDB on-demand)

Write Path:
Analysis Finalizer → SQS (Write Queue) → LanceDB Writer Lambda
→ LanceDB Service Lambda (Rust)
├─ Toka Lambda (Rust): keyword extraction (Lindera / ICU4X)
├─ Bedrock Nova: vector embedding (1024d)
└─ LanceDB: store to S3 Express One Zone
Read Path:
MCP Search Tool Lambda
→ LanceDB Service Lambda (Rust): hybrid search (vector + FTS)
→ Bedrock Claude Haiku: summarize search results
Delete Path:
Backend API (project deletion)
→ LanceDB Service Lambda: drop_table
S3 Express One Zone (Directory Bucket)
└─ idp-v2/
├─ {project_id_1}/ ← one LanceDB table per project
│ ├─ data/
│ └─ indices/
└─ {project_id_2}/
├─ data/
└─ indices/
DynamoDB (Lock Table)
PK: base_uri | SK: version
└─ Manages concurrent access to LanceDB tables

The core vector DB service. Implemented in Rust using cargo-lambda-cdk for optimized memory usage and cold start performance.

ItemValue
Function Nameidp-v2-lance-service
RuntimeRust (cargo-lambda-cdk)
ArchitectureARM64
Memory1024 MB
Timeout5 min

Supported Actions:

ActionDescription
add_recordAdd a QA record (keyword extraction + embedding + store)
delete_recordDelete by QA ID or segment ID
get_segments_by_document_idRetrieve all segments for a document
get_by_segment_idsRetrieve content by segment ID list (used by Graph MCP)
hybrid_searchHybrid search (vector + FTS, query_type='hybrid')
list_tablesList all project tables
countCount records in a project table
delete_by_workflowDelete all records for a workflow
drop_tableDrop an entire project table

Why Rust Lambda:

Rust provides significantly lower memory usage and faster cold starts compared to the previous Docker Python Lambda approach, which is critical for a serverless vector DB service that may scale to zero.

A multilingual tokenizer service used by LanceDB Service for FTS keyword extraction.

ItemValue
Function Nameidp-v2-toka
RuntimeRust (cargo-lambda-cdk)
ArchitectureARM64
Memory1024 MB
TokenizersLindera (CJK dictionary-based), ICU4X (Unicode word segmentation)

An SQS consumer that receives write requests from the analysis pipeline and delegates to the LanceDB Service.

ItemValue
Function Nameidp-v2-lancedb-writer
RuntimePython 3.14
Memory256 MB
Timeout5 min
TriggerSQS (idp-v2-lancedb-write-queue)
Concurrency1 (sequential processing)

Concurrency is set to 1 to prevent concurrent write conflicts on LanceDB tables.

The Agent’s MCP tool invokes the LanceDB Service Lambda directly to perform document retrieval during AI chat.

User Query → Bedrock Agent Core → MCP Gateway
→ Search Tool Lambda → LanceDB Service Lambda (hybrid_search)
→ Bedrock Claude Haiku: summarize search results → Response
ItemValue
StackMcpStack
RuntimeNode.js 22.x (ARM64)
Timeout30s
EnvironmentLANCEDB_FUNCTION_ARN (via SSM)

Each QA analysis result is stored as a record. Since a single segment (page) can have multiple QAs, records are created per QA unit:

class DocumentRecord(LanceModel):
workflow_id: str # Workflow ID
document_id: str # Document ID
segment_id: str # "{workflow_id}_{segment_index:04d}"
qa_id: str # "{workflow_id}_{segment_index:04d}_{qa_index:02d}"
segment_index: int # Segment page/chapter number
qa_index: int # QA number (starting from 0)
question: str # AI-generated question
content: str # content_combined (SourceField for embedding)
vector: Vector(1024) # Bedrock Nova embedding (VectorField)
keywords: str # Tokenized keywords (FTS indexed)
file_uri: str # Original file S3 URI
file_type: str # MIME type
image_uri: Optional[str] # Segment image S3 URI
created_at: datetime # Timestamp
  • One table per project: Table name = project_id
  • Per-QA storage: Multiple QAs per segment are stored as independent records (uniquely identified by qa_id)
  • content: Merged text from all preprocessing (OCR + BDA + PDF text + AI analysis)
  • vector: Auto-generated by LanceDB’s embedding function (Bedrock Nova, 1024 dimensions)
  • keywords: Lindera/ICU4X-extracted tokens for FTS index. Lindera handles CJK languages with dictionary-based tokenization, ICU4X handles other languages with Unicode word segmentation

Toka is a Rust-based multilingual tokenizer Lambda that combines Lindera and ICU4X for accurate keyword extraction across languages.

LanceDB’s built-in FTS tokenizer does not support CJK languages well. CJK languages (Korean, Japanese, Chinese) are agglutinative or lack word boundaries, so simple space-based tokenization is insufficient. For example:

Korean input: "인공지능 기반 문서 분석 시스템을 구축했습니다."
Toka output: ["인공지능", "기반", "분석", "시스템", "구축", "했", "."]
Japanese input: "文書分析システムを構築しました"
Toka output: ["文書", "分析", "システム", "構築", "し"]
LanguageTokenizerMethod
KoreanLindera (lindera-ko-dic)Dictionary-based morphological analysis
JapaneseLindera (lindera-ipadic)Dictionary-based morphological analysis
ChineseLindera (lindera-cc-cedict)Dictionary-based segmentation
OthersICU4XUnicode word segmentation

All searches are processed by the LanceDB Service Lambda. It uses LanceDB’s native query_type='hybrid' to combine vector search and full-text search.

Search Query: "문서 분석 결과 조회"
├─ [1] Toka keyword extraction (via LanceDB Service Lambda)
│ → ["문서", "분석", "결과", "조회"]
├─ [2] LanceDB native hybrid search
│ → table.search(query=keywords, query_type='hybrid')
│ → Vector search (Nova embedding) + FTS auto-merged
│ → Top-K results with _relevance_score
└─ [3] Result summarization (MCP Search Tool Lambda)
→ Bedrock Claude Haiku generates answer from search results

// StorageStack
const expressStorage = new CfnDirectoryBucket(this, 'LanceDbExpressStorage', {
bucketName: `idp-v2-lancedb--use1-az4--x-s3`,
dataRedundancy: 'SingleAvailabilityZone',
locationName: 'use1-az4',
});

S3 Express One Zone provides single-digit millisecond latency, optimized for frequent read/write patterns like vector search operations.

// StorageStack
const lockTable = new Table(this, 'LanceDbLockTable', {
partitionKey: { name: 'base_uri', type: AttributeType.STRING },
sortKey: { name: 'version', type: AttributeType.NUMBER },
billingMode: BillingMode.PAY_PER_REQUEST,
});

Manages distributed locking when multiple Lambda functions access the same dataset concurrently.

KeyDescription
/idp-v2/lancedb/lock/table-nameDynamoDB lock table name
/idp-v2/lancedb/express/bucket-nameS3 Express bucket name
/idp-v2/lancedb/express/az-idS3 Express availability zone ID
/idp-v2/lancedb/function-arnLanceDB Service Lambda function ARN

The following diagram shows all components that depend on LanceDB:

graph TB
    subgraph Write["Write Path"]
        Writer["LanceDB Writer"]
        QA["QA Regenerator"]
    end

    subgraph Read["Read Path"]
        MCP["MCP Search Tool<br/>(Agent)"]
    end

    subgraph Delete["Delete Path"]
        Backend["Backend API<br/>(Project Deletion)"]
    end

    subgraph Core["Core Service"]
        Service["LanceDB Service<br/>(Rust Lambda)"]
        Toka["Toka<br/>(Rust Lambda)"]
    end

    subgraph Storage["Storage Layer"]
        S3["S3 Express One Zone"]
        DDB["DynamoDB Lock Table"]
    end

    Writer -->|invoke| Service
    QA -->|invoke| Service
    MCP -->|invoke<br/>hybrid_search| Service
    Backend -->|invoke<br/>drop_table| Service

    Service --> S3 & DDB
    Service -->|invoke| Toka

    style Storage fill:#fff3e0,stroke:#ff9900
    style Core fill:#e8f5e9,stroke:#2ea043
    style Write fill:#fce4ec,stroke:#e91e63
    style Read fill:#e3f2fd,stroke:#1976d2
    style Delete fill:#f3e5f5,stroke:#7b1fa2
ComponentStackAccess TypeDescription
LanceDB ServiceLanceServiceStackRead/WriteCore DB service (Rust Lambda)
TokaLanceServiceStackTokenizationMultilingual tokenizer (Rust Lambda)
LanceDB WriterWorkflowStackWrite (via Service)SQS consumer, delegates to Service
Analysis FinalizerWorkflowStackWrite (via SQS/Service)Sends segments to write queue, deletes on reanalysis
QA RegeneratorWorkflowStackWrite (via Service)Updates Q&A segments
MCP Search ToolMcpStackRead (direct Service invoke)Agent tool for document retrieval
Backend APIApplicationStackDelete (via Service)Invokes drop_table on project deletion

If migrating to Amazon OpenSearch Service for production, the following components need modification:

ComponentCurrent (LanceDB)Target (OpenSearch)Scope
LanceDB Service LambdaRust Lambda + LanceDBOpenSearch client (CRUD + search)Replace entirely
LanceDB Writer LambdaSQS → invoke LanceDB ServiceSQS → write to OpenSearch indexReplace invoke target
MCP Search ToolLambda invoke → LanceDB ServiceLambda invoke → OpenSearch searchReplace invoke target
StorageStackS3 Express + DDB lock tableOpenSearch domain (VPC)Replace resources
ComponentReason
Analysis FinalizerOnly sends messages to SQS (queue interface unchanged)
FrontendNo direct DB access
Step Functions WorkflowNo direct LanceDB dependency
Phase 1: Replace Storage Layer
- Create OpenSearch domain in VPC
- Replace StorageStack resources (remove S3 Express + DDB lock)
- Configure Nori analyzer for Korean tokenization (replaces Toka/Lindera)
Phase 2: Replace Write Path
- Modify LanceDB Service → OpenSearch indexing service
- Update document schema (OpenSearch index mapping)
- Add OpenSearch neural ingest pipeline for embeddings
Phase 3: Replace Read Path
- Update MCP Search Tool Lambda invoke target to OpenSearch search service
- Remove Toka dependency (Nori handles Korean tokenization)
Phase 4: Remove LanceDB Dependencies
- Remove Rust Lambda functions (LanceDB Service, Toka)
- Remove S3 Express bucket and DDB lock table
ItemNotes
Korean tokenizationOpenSearch includes Nori analyzer for Korean. Toka/Lindera can be removed.
Vector searchOpenSearch k-NN plugin (HNSW/IVF) replaces LanceDB vector search
EmbeddingOpenSearch neural search can auto-embed via ingest pipelines, or use pre-computed embeddings
CostOpenSearch requires a running cluster. Minimum 2-node cluster for HA.
SQS interfaceThe SQS write queue pattern can be preserved, only the consumer logic changes