Skip to content

Graph Database

This project uses Amazon Neptune DB Serverless as its graph database. Core entities extracted and normalized during document analysis are stored in a knowledge graph, enabling entity-connection-based traversal that is difficult to achieve with vector search alone.

AspectVector Search (LanceDB)Graph Traversal (Neptune)Keyword Graph (LanceDB + Neptune)
Search methodSemantic similarityEntity relationship traversalKeyword embedding similarity + graph traversal
StrengthFinding “similar content”Discovering “connected content” from search resultsDiscovering pages by concept keyword
InputUser queryQA IDs from search resultsKeyword string
Datacontent_combined + vector embeddingsCore entity nodes + MENTIONED_IN edgesGraph keywords (name + embedding) + Neptune entities

These search methods are used together by the agent via Search MCP tools:

  • search___summarize — hybrid search on documents
  • search___graph_traverse — graph traversal from search result QA IDs
  • search___graph_keyword — keyword similarity search via LanceDB graph keywords

Step Functions Workflow
→ Distributed Map (max 30 concurrency)
→ SegmentAnalyzer
→ Parallel:
+- AnalysisFinalizer (SQS → LanceDB)
+- PageDescriptionGenerator (Haiku)
'- EntityExtractor (Haiku) → S3 (graph_entities)
→ GraphBuilder Lambda:
1. Collect entities from all segments (S3)
2. Deduplicate (exact name match)
3. Normalize → Core entities (Sonnet via Strands structured output)
4. Store core entity names in LanceDB (add_graph_keywords)
5. Save work files to S3 (entities.json, analyses.json)
→ GraphBatchSender (Map) → GraphService Lambda (VPC) → Neptune

Graph Traverse (Read Path — from search results)

Section titled “Graph Traverse (Read Path — from search results)”
Agent → MCP Gateway → Search MCP Lambda (graph_traverse)
→ GraphService Lambda (VPC): search_graph (entity traversal from qa_ids)
→ LanceDB Service Lambda: Segment content retrieval
→ Bedrock Claude Haiku: Result summarization

Keyword Graph Search (Read Path — from keyword)

Section titled “Keyword Graph Search (Read Path — from keyword)”
Agent → MCP Gateway → Search MCP Lambda (graph_keyword)
→ LanceDB Service: search_graph_keywords (embedding similarity)
→ SHA256 hash entity names → Neptune entity ~id
→ GraphService Lambda (VPC): raw_query (find connected qa_ids)
→ LanceDB Service: get_by_qa_ids (content retrieval)
→ Bedrock Claude Haiku: Result summarization
Frontend → Backend API → GraphService Lambda (VPC)
→ get_entity_graph: Project-wide entity graph
→ get_document_graph: Document-level detailed graph

Node and relationship structure stored in Neptune. Uses openCypher as the query language.

NodeDescriptionKey Properties
DocumentDocumentid, project_id, workflow_id, file_name, file_type
SegmentDocument page/sectionid, project_id, workflow_id, document_id, segment_index
AnalysisQA analysis resultid, project_id, workflow_id, document_id, segment_index, qa_index, question
EntityCore entity (normalized)id, project_id, name
RelationshipDirectionDescription
BELONGS_TOSegment → DocumentSegment belongs to document
BELONGS_TOAnalysis → SegmentAnalysis belongs to segment
NEXTSegment → SegmentPage order (next segment)
MENTIONED_INEntity → AnalysisEntity mentioned in a specific QA (confidence, context)
RELATED_TODocument → DocumentManual document-to-document link (reason, label)

Neptune does not support secondary indexes — the node’s ~id property is the only O(1) direct lookup mechanism. Each node type’s ID is designed as a meaningful composite key, enabling fast lookups without indexes.

NodeID FormatExample
Document{document_id}doc_abc123
Segment{workflow_id}_{segment_index:04d}wf_abc123_0042
Analysis{workflow_id}_{segment_index:04d}_{qa_index:02d}wf_abc123_0042_00
EntityFirst 16 chars of SHA256({project_id}:{name})a1b2c3d4e5f6g7h8
  • Segment/Analysis: Composed of workflow ID + segment index (+ QA index), so the parent relationship can be inferred from the ID alone
  • Entity: Uses a hash of project ID + normalized name, so the same entity extracted from multiple segments is naturally merged (MERGE) into a single node
Document (report.pdf)
├── Segment (page 0) ──NEXT──→ Segment (page 1) ──NEXT──→ ...
│ └── Analysis (QA 0) ←──MENTIONED_IN── Entity ("Prototyping")
│ └── Analysis (QA 1) ←──MENTIONED_IN── Entity ("AWS")
└── Segment (page 1)
└── Analysis (QA 0) ←──MENTIONED_IN── Entity ("Prototyping")
└── Analysis (QA 0) ←──MENTIONED_IN── Entity ("Innovation Flywheel")

Core entity “Prototyping” connects pages 0 and 1 because it was normalized from “Prototype” (page 0) and “AWS Prototyping” (page 1).


ItemValue
Cluster IDidp-v2-neptune
Engine Version1.4.1.0
Instance Classdb.serverless
Capacitymin: 1 NCU, max: 2.5 NCU
SubnetPrivate Isolated
AuthenticationIAM Auth (SigV4)
Port8182
Query LanguageopenCypher

A gateway Lambda that communicates directly with Neptune. Deployed inside the VPC (Private Isolated Subnet) to access the Neptune endpoint.

ItemValue
Function Nameidp-v2-graph-service
RuntimePython 3.14
Timeout5 min
VPCPrivate Isolated Subnet
AuthenticationIAM SigV4 (neptune-db)

Supported Actions:

CategoryActionDescription
Writeadd_segment_linksCreate Document + Segment nodes, BELONGS_TO/NEXT relationships
add_analysesCreate Analysis nodes, BELONGS_TO to Segment
add_entitiesMERGE Entity nodes, MENTIONED_IN to Analysis
link_documentsCreate bidirectional RELATED_TO between Documents
unlink_documentsDelete RELATED_TO between Documents
delete_analysisDelete Analysis node + cleanup orphaned Entities
delete_by_workflowDelete all graph data for a workflow
Readsearch_graphQA ID-based graph traversal (Entity → MENTIONED_IN → related Segments)
raw_queryExecute arbitrary openCypher query with parameters
get_entity_graphProject-wide entity graph query (visualization)
get_document_graphDocument-level detailed graph query (visualization)
get_linked_documentsQuery document link relationships

3. EntityExtractor Lambda (Step Functions)

Section titled “3. EntityExtractor Lambda (Step Functions)”

Runs in parallel with AnalysisFinalizer and PageDescriptionGenerator inside the Distributed Map.

ItemValue
Function Nameidp-v2-entity-extractor
RuntimePython 3.14
Timeout5 min
ModelBedrock Haiku 4.5
OutputStructured (Pydantic model)
StackWorkflowStack

Features:

  • Extracts entities from AI analysis results using structured output
  • Supports test mode (mode: "test") that returns entities without saving to S3 (for prompt tuning)
  • Saves graph_entities to S3 segment data

Runs after Distributed Map completion and before DocumentSummarizer.

ItemValue
Function Nameidp-v2-graph-builder
RuntimePython 3.14
Timeout15 min
StackWorkflowStack

Processing Flow:

  1. Create Document + Segment structure — Create document/segment nodes and BELONGS_TO, NEXT relationships in Neptune
  2. Load segment analysis results from S3 — Collect analysis data from all segments
  3. Create Analysis nodes — Batch create Analysis nodes per QA pair
  4. Collect entities — Gather graph_entities already extracted per segment by EntityExtractor
  5. Deduplicate — Merge identical entities by name (case-insensitive)
  6. Normalize → Core entities — LLM groups related entities loosely (notation variants, morphological variants, conceptual containment). One entity can belong to multiple core entity groups. Core entities absorb members’ mentioned_in lists.
  7. Store core entity names in LanceDBadd_graph_keywords for cross-document keyword search
  8. Save work files to S3entities.json and analyses.json for GraphBatchSender

Graph search tools used by the AI agent, integrated into the Search MCP Lambda.

ItemValue
StackMcpStack
RuntimeNode.js 22.x (ARM64)
Timeout5 min

Tools:

MCP ToolDescription
graph_traverseTraverse the graph using search result QA IDs as starting points to discover related pages
graph_keywordSearch core entities by keyword similarity in LanceDB, then find connected pages via Neptune

graph_traverse Flow:

1. Receive qa_ids from search___summarize results
2. QA ID → Analysis node → MENTIONED_IN ← Entity node (all entities, no limit)
3. Entity → MENTIONED_IN → Other Analysis → Segment (single UNWIND query)
4. Exclude source segments, filter by document_id
5. Fetch segment content from LanceDB (get_by_segment_ids)
6. Summarize with Bedrock Claude Haiku
7. Filter sources to only Haiku-cited segments

graph_keyword Flow:

1. Receive keyword query
2. Search LanceDB graph_keywords by embedding similarity (top 3)
3. Hash matched entity names → Neptune entity ~id (SHA256)
4. Query Neptune: Entity → MENTIONED_IN → Analysis (get qa_ids)
5. Fetch content from LanceDB (get_by_qa_ids)
6. Summarize with Bedrock Claude Haiku

Entity extraction runs in the EntityExtractor Lambda, parallelized per segment alongside AnalysisFinalizer and PageDescriptionGenerator. Since it runs inside Step Functions’ Distributed Map, up to 30 segments extract entities concurrently.

Uses Strands Agent with Pydantic structured output for reliable JSON responses.

ItemValue
ModelBedrock Haiku 4.5
FrameworkStrands SDK (Agent + structured_output_model)
InputSegment AI analysis results + page description
Outputentities[] (Pydantic EntityExtractionResult)

After all segments are processed, GraphBuilder normalizes entities using LLM:

ItemValue
ModelBedrock Sonnet 4.6 (1M context)
FrameworkStrands SDK (Agent + structured_output_model)
InputAll deduplicated entities with contexts + existing LanceDB keywords
OutputCore entity groups (NormalizationResult)

Normalization rules:

  • Group liberally — over-connecting is better than missing connections
  • Notation variants (spacing, punctuation, abbreviations)
  • Morphological variants (singular/plural, verb/noun forms)
  • Conceptual containment (a specific term contains a broader concept)
  • Cross-language variants
  • One entity can belong to multiple core groups
  • Core entity name uses member name or well-known standard term
{
"entities": [
{
"name": "AWS Prototyping",
"mentioned_in": [
{
"segment_index": 1,
"qa_index": 0,
"context": "AWS prototyping program and methodology"
}
]
}
]
}
Input entities: Prototype (page 0), AWS Prototyping (page 1), AWS (page 0), Amazon Web Services (page 1)
Core entities:
- "Prototyping" → [Prototype, AWS Prototyping] → connected to pages 0, 1
- "AWS" → [AWS, Amazon Web Services, AWS Prototyping] → connected to pages 0, 1

// Neptune DB Serverless Cluster
const cluster = new neptune.CfnDBCluster(this, 'NeptuneCluster', {
dbClusterIdentifier: 'idp-v2-neptune',
engineVersion: '1.4.1.0',
iamAuthEnabled: true,
serverlessScalingConfiguration: {
minCapacity: 1,
maxCapacity: 2.5,
},
});
// Serverless Instance
const instance = new neptune.CfnDBInstance(this, 'NeptuneInstance', {
dbInstanceClass: 'db.serverless',
dbClusterIdentifier: cluster.dbClusterIdentifier!,
});
VPC (10.0.0.0/16)
└─ Private Isolated Subnet
├─ Neptune DB Serverless (port 8182)
└─ GraphService Lambda (SG: VPC CIDR → 8182 allowed)

Only the GraphService Lambda is deployed in the VPC. GraphBuilder Lambda and Search MCP Lambda call GraphService via Lambda invoke from outside the VPC.

KeyDescription
/idp-v2/neptune/cluster-endpointNeptune cluster endpoint
/idp-v2/neptune/cluster-portNeptune cluster port
/idp-v2/neptune/cluster-resource-idNeptune cluster resource ID
/idp-v2/neptune/security-group-idNeptune security group ID
/idp-v2/graph-service/function-arnGraphService Lambda function ARN

graph TB
    subgraph Build["Graph Build (Step Functions)"]
        EE["EntityExtractor<br/>(Entity Extraction)"]
        GB["GraphBuilder<br/>(Normalization + Core Entities)"]
    end

    subgraph Search["Graph Search (Agent)"]
        GT["graph_traverse<br/>(Search MCP)"]
        GK["graph_keyword<br/>(Search MCP)"]
    end

    subgraph Viz["Graph Visualization (Frontend)"]
        BE["Backend API"]
    end

    subgraph Core["Core Service (VPC)"]
        GS["GraphService Lambda"]
    end

    subgraph Storage["Storage Layer"]
        Neptune["Neptune DB Serverless"]
        LanceDB["LanceDB (graph_keywords)"]
    end

    EE -->|"S3<br/>(graph_entities)"| GB
    GB -->|"invoke<br/>(add_entities)"| GS
    GB -->|"invoke<br/>(add_graph_keywords)"| LanceDB
    GT -->|"invoke<br/>(search_graph)"| GS
    GK -->|"invoke<br/>(search_graph_keywords)"| LanceDB
    GK -->|"invoke<br/>(raw_query)"| GS
    BE -->|"invoke<br/>(get_entity_graph, get_document_graph)"| GS

    GS -->|"openCypher<br/>(IAM SigV4)"| Neptune

    style Storage fill:#fff3e0,stroke:#ff9900
    style Core fill:#e8f5e9,stroke:#2ea043
    style Build fill:#fce4ec,stroke:#e91e63
    style Search fill:#e3f2fd,stroke:#1976d2
    style Viz fill:#f3e5f5,stroke:#7b1fa2
ComponentStackAccess TypeDescription
GraphServiceWorkflowStackRead/WriteCore Neptune gateway (inside VPC)
EntityExtractorWorkflowStackWrite (S3)Per-segment entity extraction (parallel)
GraphBuilderWorkflowStackWrite (via GraphService + LanceDB)Core entity normalization + graph construction
graph_traverseMcpStackRead (via GraphService + LanceDB)Agent graph traversal from search results
graph_keywordMcpStackRead (via LanceDB + GraphService)Agent keyword-based graph search
Backend APIApplicationStackRead (via GraphService)Frontend graph visualization