Skip to content

Vector Database

This project uses LanceDB as the vector database instead of Amazon OpenSearch Service. LanceDB is an open-source, serverless vector database that stores data directly on S3, eliminating the need for dedicated cluster infrastructure. Combined with Kiwi, a Korean morphological analyzer, it enables hybrid search (vector + full-text) for Korean documents.

LanguageSemantic Search (Vector)Full-Text Search (FTS)Search Mode
KoreanOOHybrid (Vector + FTS)
English and othersOXSemantic search only

Kiwi is a Korean-specific morphological analyzer, so FTS keywords are accurately extracted only from Korean documents. Documents in English and other languages are searched via semantic search using Bedrock Nova vector embeddings. Since vector search operates on meaning regardless of language, documents in all languages can be retrieved.

This project is a PoC/prototype, and cost efficiency is a key factor.

FactorOpenSearch ServiceLanceDB (S3)
InfrastructureDedicated cluster (minimum 2-3 nodes)No cluster needed (serverless)
Idle costCharges even when unusedS3 storage only
Setup complexityDomain config, VPC, access policiesS3 bucket + DynamoDB lock table
ScalingNode scaling requiredScales with S3 automatically
Estimated monthly cost (PoC)$200-500+ (t3.medium x2 minimum)$1-10 (S3 + DDB on-demand)

Write Path:
Analysis Finalizer → SQS (Write Queue) → LanceDB Writer Lambda
→ LanceDB Service Lambda (Container)
├─ Kiwi: keyword extraction
├─ Bedrock Nova: vector embedding (1024d)
└─ LanceDB: store to S3 Express One Zone
Read Path:
MCP Search Tool Lambda
→ LanceDB Service Lambda (Container): hybrid search (vector + FTS)
→ Bedrock Claude Haiku: summarize search results
Delete Path:
Backend API (project deletion)
→ LanceDB Service Lambda: drop_table
S3 Express One Zone (Directory Bucket)
└─ idp-v2/
├─ {project_id_1}/ ← one LanceDB table per project
│ ├─ data/
│ └─ indices/
└─ {project_id_2}/
├─ data/
└─ indices/
DynamoDB (Lock Table)
PK: base_uri | SK: version
└─ Manages concurrent access to LanceDB tables

1. LanceDB Service Lambda (Container Image)

Section titled “1. LanceDB Service Lambda (Container Image)”

The core vector DB service. Uses a Docker container image because lancedb and kiwipiepy together exceed the 250MB Lambda deployment limit.

ItemValue
Function Nameidp-v2-lancedb-service
RuntimePython 3.12 (Container Image)
Memory2048 MB
Timeout5 min
Base Imagepublic.ecr.aws/lambda/python:3.12
Dependencieslancedb>=0.26.0, kiwipiepy>=0.22.0, boto3

Supported Actions:

ActionDescription
add_recordAdd a QA record (keyword extraction + embedding + store)
delete_recordDelete by QA ID or segment ID
get_segmentsRetrieve all segments for a workflow
get_by_segment_idsRetrieve content by segment ID list (used by Graph MCP)
hybrid_searchHybrid search (vector + FTS, query_type='hybrid')
list_tablesList all project tables
countCount records in a project table
delete_by_workflowDelete all records for a workflow
drop_tableDrop an entire project table

Why Container Lambda:

Kiwi’s Korean language model files and LanceDB’s native binaries total several hundred MB, exceeding Lambda’s 250MB zip limit. Using a Docker container image (up to 10GB) resolves this constraint.

An SQS consumer that receives write requests from the analysis pipeline and delegates to the LanceDB Service.

ItemValue
Function Nameidp-v2-lancedb-writer
RuntimePython 3.14
Memory256 MB
Timeout5 min
TriggerSQS (idp-v2-lancedb-write-queue)
Concurrency1 (sequential processing)

Concurrency is set to 1 to prevent concurrent write conflicts on LanceDB tables.

The Agent’s MCP tool invokes the LanceDB Service Lambda directly to perform document retrieval during AI chat.

User Query → Bedrock Agent Core → MCP Gateway
→ Search Tool Lambda → LanceDB Service Lambda (hybrid_search)
→ Bedrock Claude Haiku: summarize search results → Response
ItemValue
StackMcpStack
RuntimeNode.js 22.x (ARM64)
Timeout30s
EnvironmentLANCEDB_FUNCTION_ARN (via SSM)

Each QA analysis result is stored as a record. Since a single segment (page) can have multiple QAs, records are created per QA unit:

class DocumentRecord(LanceModel):
workflow_id: str # Workflow ID
document_id: str # Document ID
segment_id: str # "{workflow_id}_{segment_index:04d}"
qa_id: str # "{workflow_id}_{segment_index:04d}_{qa_index:02d}"
segment_index: int # Segment page/chapter number
qa_index: int # QA number (starting from 0)
question: str # AI-generated question
content: str # content_combined (SourceField for embedding)
vector: Vector(1024) # Bedrock Nova embedding (VectorField)
keywords: str # Kiwi-extracted keywords (FTS indexed)
file_uri: str # Original file S3 URI
file_type: str # MIME type
image_uri: Optional[str] # Segment image S3 URI
created_at: datetime # Timestamp
  • One table per project: Table name = project_id
  • Per-QA storage: Multiple QAs per segment are stored as independent records (uniquely identified by qa_id)
  • content: Merged text from all preprocessing (OCR + BDA + PDF text + AI analysis)
  • vector: Auto-generated by LanceDB’s embedding function (Bedrock Nova, 1024 dimensions)
  • keywords: Kiwi-extracted Korean morphemes for FTS index. Non-Korean languages use space-based tokenization

Kiwi (Korean Intelligent Word Identifier) is an open-source Korean morphological analyzer written in C++ with Python bindings (kiwipiepy).

LanceDB’s built-in FTS tokenizer does not support Korean. Korean is an agglutinative language where words cannot be separated by spaces alone. For example:

Input: "인공지능 기반 문서 분석 시스템을 구축했습니다"
Kiwi: "인공 지능 기반 문서 분석 시스템 구축" (nouns extracted)

Without morphological analysis, searching for “시스템” would miss documents containing “시스템을” or “시스템에서”.

POS TagDescriptionExample
NNGCommon noun문서, 분석, 시스템
NNPProper nounAWS, Bedrock
NRNumeral하나, 둘
NPPronoun이것, 그것
SLForeign wordLambda, Python
SNNumber1024, 3.5
SHChinese character
XSNSuffixAttached to previous token

Filters:

  • Single-character Korean stop words: 것, 수, 등, 때, 곳
  • Single-character foreign words, numbers, and Chinese characters are preserved

All searches are processed by the LanceDB Service Lambda. It uses LanceDB’s native query_type='hybrid' to combine vector search and full-text search.

Search Query: "문서 분석 결과 조회"
├─ [1] Kiwi keyword extraction (LanceDB Service Lambda)
│ → "문서 분석 결과 조회"
├─ [2] LanceDB native hybrid search
│ → table.search(query=keywords, query_type='hybrid')
│ → Vector search (Nova embedding) + FTS auto-merged
│ → Top-K results with _relevance_score
└─ [3] Result summarization (MCP Search Tool Lambda)
→ Bedrock Claude Haiku generates answer from search results

// StorageStack
const expressStorage = new CfnDirectoryBucket(this, 'LanceDbExpressStorage', {
bucketName: `idp-v2-lancedb--use1-az4--x-s3`,
dataRedundancy: 'SingleAvailabilityZone',
locationName: 'use1-az4',
});

S3 Express One Zone provides single-digit millisecond latency, optimized for frequent read/write patterns like vector search operations.

// StorageStack
const lockTable = new Table(this, 'LanceDbLockTable', {
partitionKey: { name: 'base_uri', type: AttributeType.STRING },
sortKey: { name: 'version', type: AttributeType.NUMBER },
billingMode: BillingMode.PAY_PER_REQUEST,
});

Manages distributed locking when multiple Lambda functions access the same dataset concurrently.

KeyDescription
/idp-v2/lancedb/lock/table-nameDynamoDB lock table name
/idp-v2/lancedb/express/bucket-nameS3 Express bucket name
/idp-v2/lancedb/express/az-idS3 Express availability zone ID
/idp-v2/lancedb/function-arnLanceDB Service Lambda function ARN

The following diagram shows all components that depend on LanceDB:

graph TB
    subgraph Write["Write Path"]
        Writer["LanceDB Writer"]
        QA["QA Regenerator"]
    end

    subgraph Read["Read Path"]
        MCP["MCP Search Tool<br/>(Agent)"]
    end

    subgraph Delete["Delete Path"]
        Backend["Backend API<br/>(Project Deletion)"]
    end

    subgraph Core["Core Service"]
        Service["LanceDB Service<br/>(Container Lambda)"]
    end

    subgraph Storage["Storage Layer"]
        S3["S3 Express One Zone"]
        DDB["DynamoDB Lock Table"]
    end

    Writer -->|invoke| Service
    QA -->|invoke| Service
    MCP -->|invoke<br/>hybrid_search| Service
    Backend -->|invoke<br/>drop_table| Service

    Service --> S3 & DDB

    style Storage fill:#fff3e0,stroke:#ff9900
    style Core fill:#e8f5e9,stroke:#2ea043
    style Write fill:#fce4ec,stroke:#e91e63
    style Read fill:#e3f2fd,stroke:#1976d2
    style Delete fill:#f3e5f5,stroke:#7b1fa2
ComponentStackAccess TypeDescription
LanceDB ServiceWorkflowStackRead/WriteCore DB service (Container Lambda)
LanceDB WriterWorkflowStackWrite (via Service)SQS consumer, delegates to Service
Analysis FinalizerWorkflowStackWrite (via SQS/Service)Sends segments to write queue, deletes on reanalysis
QA RegeneratorWorkflowStackWrite (via Service)Updates Q&A segments
MCP Search ToolMcpStackRead (direct Service invoke)Agent tool for document retrieval
Backend APIApplicationStackDelete (via Service)Invokes drop_table on project deletion

If migrating to Amazon OpenSearch Service for production, the following components need modification:

ComponentCurrent (LanceDB)Target (OpenSearch)Scope
LanceDB Service LambdaContainer Lambda + LanceDBOpenSearch client (CRUD + search)Replace entirely
LanceDB Writer LambdaSQS → invoke LanceDB ServiceSQS → write to OpenSearch indexReplace invoke target
MCP Search ToolLambda invoke → LanceDB ServiceLambda invoke → OpenSearch searchReplace invoke target
StorageStackS3 Express + DDB lock tableOpenSearch domain (VPC)Replace resources
ComponentReason
Analysis FinalizerOnly sends messages to SQS (queue interface unchanged)
FrontendNo direct DB access
Step Functions WorkflowNo direct LanceDB dependency
Phase 1: Replace Storage Layer
- Create OpenSearch domain in VPC
- Replace StorageStack resources (remove S3 Express + DDB lock)
- Configure Nori analyzer for Korean tokenization
Phase 2: Replace Write Path
- Modify LanceDB Service → OpenSearch indexing service
- Update document schema (OpenSearch index mapping)
- Add OpenSearch neural ingest pipeline for embeddings
Phase 3: Replace Read Path
- Update MCP Search Tool Lambda invoke target to OpenSearch search service
- Remove Kiwi dependency (Nori handles Korean tokenization)
Phase 4: Remove LanceDB Dependencies
- Remove lancedb, kiwipiepy from requirements
- Remove Container Lambda (standard Lambda may suffice)
- Remove S3 Express bucket and DDB lock table
ItemNotes
Korean tokenizationOpenSearch includes Nori analyzer for Korean. Kiwi can be removed.
Vector searchOpenSearch k-NN plugin (HNSW/IVF) replaces LanceDB vector search
EmbeddingOpenSearch neural search can auto-embed via ingest pipelines, or use pre-computed embeddings
CostOpenSearch requires a running cluster. Minimum 2-node cluster for HA.
SQS interfaceThe SQS write queue pattern can be preserved, only the consumer logic changes