Database Overview
Background
Section titled “Background”This project analyzes documents page by page. Each page is separated into independent segments, processed through AI analysis, and the results are stored as vector embeddings. With this structure, there are problems that vector search alone cannot solve.
Disconnected Pages
Section titled “Disconnected Pages”Imagine a 100-page engineering drawing. If a single drawing spans 2 pages, vector search treats each page independently. Searching for “Valve V-101 specifications” might return page 42, but the continuation on page 43 could be missed because its content differs enough to drop below the similarity threshold.
With vector search only:
Search: "Valve V-101 specifications" → Page 42 (V-101 drawing part 1) ✓ Found → Page 43 (V-101 drawing part 2) ✗ Missed — different content → Page 78 (V-101 maintenance record) ✗ Missed — low similarityThis is manageable with a few dozen documents, but with hundreds or thousands of documents, vector similarity alone cannot reliably find all related pages.
Limitations of Graph-Only Search
Section titled “Limitations of Graph-Only Search”We also considered searching with only a graph database. By extracting entities (V-101, valve, maintenance, etc.) and relationships, we could discover all related pages by following entity connections.
With graph only:
Search: "Valve V-101 specifications" → Entity "V-101" traversal → Page 42 (MENTIONED_IN) ✓ → Page 43 (MENTIONED_IN) ✓ → Page 78 (RELATES_TO → "maintenance" → MENTIONED_IN) ✓Connected pages are found well, but there were issues:
- No semantic search: Weak against natural language queries like “valve specs” that don’t match exact entity names
- Depends on extraction quality: If the LLM misses an entity, the graph has no connection
- No FTS: Graph databases don’t support full-text search
Vector + Graph Combined
Section titled “Vector + Graph Combined”We combine the strengths of both approaches:
Combined search:
[Step 1] Vector search (LanceDB) Search: "Valve V-101 specifications" → Page 42 (score: 0.92) ✓ → Page 15 (score: 0.71) ✓ ← Found via semantic similarity
[Step 2] Graph traversal (Neptune) Starting point: QA IDs from page 42 → Entity "V-101" traversal → Page 43 (V-101 → MENTIONED_IN) ✓ ← Continuation page discovered → Page 78 (V-101 → RELATES_TO → "maintenance" → MENTIONED_IN) ✓ → Page 42 excluded (deduplication)Vector search finds semantically related pages, and graph traversal follows entity connections to supplement pages that vector search missed.
Search Architecture
Section titled “Search Architecture”The agent uses both databases sequentially via MCP tools.
User question: "Tell me about Valve V-101 specs and maintenance history" │ ▼[1] MCP Search Tool (summarize) │ → LanceDB hybrid_search (vector + FTS) │ → Haiku summarization │ → Result: Page 42, Page 15 (with qa_ids) │ ▼[2] MCP Graph Tool (graph_search) │ → Input: qa_ids (QA IDs from vector search) │ → Neptune: QA ID → Analysis → Entity → RELATES_TO → Entity → Analysis → Segment │ → LanceDB: Fetch content for discovered segments (get_by_segment_ids) │ → Haiku summarization │ → Result: Page 43, Page 78 (pages not in vector search) │ ▼[3] Agent synthesizes both results into final responseThe Key Link: Entity
Section titled “The Key Link: Entity”The key that connects both databases is the Entity. Related pages are discovered through Analysis nodes that share the same entity.
- QA ID (
{workflow_id}_{segment_index}_{qa_index}): An identifier for the analysis result of each document segment - LanceDB: Stores QA analysis results with
qa_id, returnsqa_idin search results - Neptune: Uses the same
qa_idas Analysis nodeid, with Entities connected viaMENTIONED_IN
LanceDB (qa_id: wf_abc_0042_00) ↕ Same IDNeptune (Analysis {id: wf_abc_0042_00}) ── MENTIONED_IN → Entity ("V-101", EQUIPMENT) ← MENTIONED_IN ── Analysis {id: wf_abc_0078_00} ↕ Same ID LanceDB (qa_id: wf_abc_0078_00)The same entity “V-101” is mentioned in multiple Analysis nodes, so related pages are discovered simply by sharing the same entity.
Role Division
Section titled “Role Division”| LanceDB (Vector DB) | Neptune (Graph DB) | |
|---|---|---|
| Stores | QA analysis text + vector embeddings | Entities, relationships, document structure |
| Search method | Hybrid (vector + FTS) | Graph traversal (openCypher) |
| Strength | Natural language queries, semantic similarity | Entity connections, relationship traversal |
| Weakness | Cannot recognize cross-page connections | No semantic search, no FTS |
| Search order | Step 1 (starting point) | Step 2 (expansion) |
| Storage | S3 Express One Zone | Neptune Serverless (VPC) |
| Cost model | S3 pricing only (serverless) | NCU-based (min 1, max 2.5) |
Data Flow
Section titled “Data Flow”When a document is uploaded, data is built into both databases simultaneously.
Step Functions Workflow │ ├─ Map (parallel per segment, max 30) │ ├─ SegmentAnalyzer: AI analysis (Claude Sonnet 4.5) │ └─ AnalysisFinalizer: │ ├─ SQS → LanceDB Writer → LanceDB Service │ │ → Keyword extraction (Kiwi) + Vector embedding (Nova) + Store │ └─ Entity/relationship extraction (Strands Agent) → Save to S3 │ ├─ GraphBuilder: │ └─ Collect entities from S3 → Deduplicate → GraphService → Store in Neptune │ └─ DocumentSummarizer: Generate document summaryVector embedding and entity extraction run in parallel per segment within AnalysisFinalizer, enabling efficient processing of large documents. GraphBuilder runs after the Map completes to collect all entities, deduplicate them, and store them in Neptune.
Sub-pages
Section titled “Sub-pages”- Vector Database — LanceDB, S3 Express One Zone, Kiwi Korean morphological analysis, hybrid search
- Graph Database — Neptune DB Serverless, openCypher, entity extraction, graph traversal
- DynamoDB — One Table Design, workflow state management, segment metadata