Skip to content

Graph Database

This project uses Amazon Neptune DB Serverless as its graph database. Entities (people, organizations, concepts, technologies, etc.) and relationships extracted during document analysis are built into a knowledge graph, enabling entity-connection-based traversal that is difficult to achieve with vector search alone.

AspectVector Search (LanceDB)Graph Traversal (Neptune)
Search methodSemantic similarityEntity relationship graph traversal
StrengthFinding “similar content”Discovering “connected content”
ExampleSearch “AI analysis” → segments with similar contentFrom pages mentioning “AWS Bedrock” → discover other pages where related entities appear
Datacontent_combined + vector embeddingsEntity, relationship, and segment nodes

These two search methods are used together by the agent via MCP Search Tool + MCP Graph Tool. Vector search provides initial results, and graph traversal discovers additional related pages.


Step Functions Workflow
→ Map(SegmentAnalyzer + AnalysisFinalizer)
→ AnalysisFinalizer: Entity/relationship extraction via Strands Agent (parallel per segment)
→ GraphBuilder Lambda: Collection + deduplication
→ GraphService Lambda (VPC): openCypher query execution
→ Neptune DB Serverless
Agent → MCP Gateway → Graph MCP Lambda
→ GraphService Lambda (VPC): graph_search (entity traversal)
→ LanceDB Service Lambda: Segment content retrieval
→ Bedrock Claude Haiku: Result summarization
Frontend → Backend API → GraphService Lambda (VPC)
→ get_entity_graph: Project-wide entity graph
→ get_document_graph: Document-level detailed graph

Node and relationship structure stored in Neptune. Uses openCypher as the query language.

NodeDescriptionKey Properties
DocumentDocumentid, project_id, workflow_id, file_name, file_type
SegmentDocument page/sectionid, project_id, workflow_id, document_id, segment_index
AnalysisQA analysis resultid, project_id, workflow_id, document_id, segment_index, qa_index, question
EntityExtracted entityid, project_id, name, type
RelationshipDirectionDescription
BELONGS_TOSegment → DocumentSegment belongs to document
BELONGS_TOAnalysis → SegmentAnalysis belongs to segment
NEXTSegment → SegmentPage order (next segment)
MENTIONED_INEntity → AnalysisEntity mentioned in a specific QA (confidence, context)
RELATES_TOEntity → EntityRelationship between entities (relationship, source)
RELATED_TODocument → DocumentManual document-to-document link (reason, label)
Document (report.pdf)
├── Segment (page 0) ──NEXT──→ Segment (page 1) ──NEXT──→ ...
│ └── Analysis (QA 1) ←──MENTIONED_IN── Entity ("AWS Bedrock", TECH)
│ └── Analysis (QA 2) ←──MENTIONED_IN── Entity ("Claude", PRODUCT)
│ │
│ RELATES_TO
│ ▼
│ Entity ("Anthropic", ORG)
└── Segment (page 1)
└── Analysis (QA 1) ←──MENTIONED_IN── Entity ("Anthropic", ORG)

ItemValue
Cluster IDidp-v2-neptune
Engine Version1.4.1.0
Instance Classdb.serverless
Capacitymin: 1 NCU, max: 2.5 NCU
SubnetPrivate Isolated
AuthenticationIAM Auth (SigV4)
Port8182
Query LanguageopenCypher

Neptune DB Serverless automatically scales based on usage and reduces cost to minimum capacity (1 NCU) when idle.

A gateway Lambda that communicates directly with Neptune. Deployed inside the VPC (Private Isolated Subnet) to access the Neptune endpoint.

ItemValue
Function Nameidp-v2-graph-service
RuntimePython 3.14
Timeout5 min
VPCPrivate Isolated Subnet
AuthenticationIAM SigV4 (neptune-db)

Supported Actions:

CategoryActionDescription
Writeadd_segment_linksCreate Document + Segment nodes, BELONGS_TO/NEXT relationships
add_analysesCreate Analysis nodes, BELONGS_TO to Segment
add_entitiesMERGE Entity nodes, MENTIONED_IN to Analysis
add_relationshipsCreate RELATES_TO relationships between Entities
link_documentsCreate bidirectional RELATED_TO between Documents
unlink_documentsDelete RELATED_TO between Documents
delete_analysisDelete Analysis node + cleanup orphaned Entities
delete_by_workflowDelete all graph data for a workflow
Readsearch_graphQA ID-based graph traversal (Entity → RELATES_TO → related Segments)
traverseN-hop graph traversal
find_related_segmentsFind related segments by entity IDs
get_entity_graphProject-wide entity graph query (visualization)
get_document_graphDocument-level detailed graph query (visualization)
get_linked_documentsQuery document link relationships

Runs after Map(SegmentAnalyzer) completion and before DocumentSummarizer in the Step Functions workflow.

ItemValue
Function Nameidp-v2-graph-builder
RuntimePython 3.14
Timeout15 min
StackWorkflowStack

Processing Flow:

  1. Create Document + Segment structure — Create document/segment nodes and BELONGS_TO, NEXT relationships in Neptune
  2. Load segment analysis results from S3 — Collect analysis data from all segments
  3. Create Analysis nodes — Batch create Analysis nodes per QA pair (200 per batch)
  4. Collect Entities/Relationships — Gather entities and relationships already extracted per segment by AnalysisFinalizer
  5. Entity deduplication — Merge identical entities by name + type
  6. Batch save to Neptune — Save Entities and Relationships in batches of 50, up to 10 parallel workers

MCP tool used by the AI agent to perform graph traversal.

ItemValue
StackMcpStack
RuntimeNode.js 22.x (ARM64)
Timeout30s

Tools:

MCP ToolDescription
graph_searchTraverse the graph using vector search QA IDs as starting points to discover related pages
link_documentsCreate manual document-to-document links (with reason)
unlink_documentsDelete document-to-document links
get_linked_documentsQuery document link relationships

graph_search Flow:

1. Use QA IDs from vector search results as starting points
2. QA ID → Analysis node → MENTIONED_IN ← Entity node
3. Entity → RELATES_TO → Related Entity → MENTIONED_IN → Other Analysis
4. Fetch segment content from LanceDB for discovered segments
5. Summarize results with Bedrock Claude Haiku

Entity extraction runs in the AnalysisFinalizer Lambda, parallelized per segment. Since it runs inside Step Functions’ Distributed Map, up to 30 segments extract entities concurrently.

Uses Strands Agent for LLM-based entity and relationship extraction.

ItemValue
ModelBedrock (configurable)
FrameworkStrands SDK (Agent)
InputSegment AI analysis results + page description
Outputentities[] + relationships[] (JSON)
  • Entity names use canonical forms (e.g., “the transformer model” → “Transformer”)
  • Generic references are excluded (e.g., “Figure 1”, “Table 2”, “the author”)
  • Entity types in uppercase English (e.g., PERSON, ORG, CONCEPT, TECH, PRODUCT)
  • Entity names, context, and relationship labels are written in the document language
  • Every QA pair is guaranteed to have at least one entity connection
{
"entities": [
{
"name": "Amazon Bedrock",
"type": "TECH",
"mentioned_in": [
{
"segment_index": 0,
"qa_index": 0,
"confidence": 0.95,
"context": "Used as AI model hosting platform"
}
]
}
],
"relationships": [
{
"source": "Amazon Bedrock",
"source_type": "TECH",
"target": "Claude",
"target_type": "PRODUCT",
"relationship": "hosts"
}
]
}

// Neptune DB Serverless Cluster
const cluster = new neptune.CfnDBCluster(this, 'NeptuneCluster', {
dbClusterIdentifier: 'idp-v2-neptune',
engineVersion: '1.4.1.0',
iamAuthEnabled: true,
serverlessScalingConfiguration: {
minCapacity: 1,
maxCapacity: 2.5,
},
});
// Serverless Instance
const instance = new neptune.CfnDBInstance(this, 'NeptuneInstance', {
dbInstanceClass: 'db.serverless',
dbClusterIdentifier: cluster.dbClusterIdentifier!,
});
VPC (10.0.0.0/16)
└─ Private Isolated Subnet
├─ Neptune DB Serverless (port 8182)
└─ GraphService Lambda (SG: VPC CIDR → 8182 allowed)

Only the GraphService Lambda is deployed in the VPC. GraphBuilder Lambda and Graph MCP Lambda call GraphService via Lambda invoke from outside the VPC.

KeyDescription
/idp-v2/neptune/cluster-endpointNeptune cluster endpoint
/idp-v2/neptune/cluster-portNeptune cluster port
/idp-v2/neptune/cluster-resource-idNeptune cluster resource ID
/idp-v2/neptune/security-group-idNeptune security group ID
/idp-v2/graph-service/function-arnGraphService Lambda function ARN

graph TB
    subgraph Build["Graph Build (Step Functions)"]
        AF["AnalysisFinalizer<br/>(Entity Extraction)"]
        GB["GraphBuilder<br/>(Collection + Dedup)"]
    end

    subgraph Search["Graph Search (Agent)"]
        GMCP["Graph MCP Tool"]
    end

    subgraph Viz["Graph Visualization (Frontend)"]
        BE["Backend API"]
    end

    subgraph Core["Core Service (VPC)"]
        GS["GraphService Lambda"]
    end

    subgraph Storage["Storage Layer"]
        Neptune["Neptune DB Serverless"]
    end

    AF -->|"invoke<br/>(via qa-regenerator)"| GS
    GB -->|"invoke<br/>(add_entities, add_relationships)"| GS
    GMCP -->|"invoke<br/>(search_graph)"| GS
    BE -->|"invoke<br/>(get_entity_graph, get_document_graph)"| GS

    GS -->|"openCypher<br/>(IAM SigV4)"| Neptune

    style Storage fill:#fff3e0,stroke:#ff9900
    style Core fill:#e8f5e9,stroke:#2ea043
    style Build fill:#fce4ec,stroke:#e91e63
    style Search fill:#e3f2fd,stroke:#1976d2
    style Viz fill:#f3e5f5,stroke:#7b1fa2
ComponentStackAccess TypeDescription
GraphServiceWorkflowStackRead/WriteCore Neptune gateway (inside VPC)
GraphBuilderWorkflowStackWrite (via GraphService)Graph construction in Step Functions
AnalysisFinalizerWorkflowStackWrite (via GraphService)Per-segment entity extraction + graph updates on QA regeneration
Graph MCP ToolMcpStackRead (via GraphService)Agent graph traversal tool
Backend APIApplicationStackRead (via GraphService)Frontend graph visualization