Web Crawler Agent
Overview
Section titled “Overview”The Web Crawler Agent crawls web pages, extracts content, and connects it to the document analysis pipeline. It controls a real browser via AgentCore Browser and uses the D2Snap algorithm to compress HTML, saving LLM tokens.
User uploads .webreq file │ EventBridge ▼SQS Queue │ ▼Webcrawler Consumer Lambda │ bedrock-agentcore:InvokeAgentRuntime ▼Web Crawler Agent (AgentCore Runtime) ├─ AgentCore Browser (Playwright) ├─ D2Snap (HTML compression) └─ S3 storage (per-page JSON)Processing Flow
Section titled “Processing Flow”- User uploads a
.webreqfile (URL + instructions) to S3 - EventBridge → SQS → Lambda Consumer invokes AgentCore Runtime
- Agent downloads
.webreqfile from S3, extracts URL and instructions - Navigates pages with AgentCore Browser:
- Take screenshots (visual analysis)
- Extract HTML → D2Snap compression
- Extract content as Markdown
- Save to S3 via
save_page - Evaluate links → crawl additional pages
- Save metadata, update DynamoDB status
| Tool | Description |
|---|---|
browser | Navigate, screenshot, click, type, scroll |
get_compressed_html | Compress HTML with D2Snap and return |
save_page | Save extracted content as per-page JSON to S3 |
get_current_time | Get current time (UTC, US/Eastern, US/Pacific, Asia/Seoul) |
D2Snap (HTML Compression)
Section titled “D2Snap (HTML Compression)”D2Snap (DOM Downsampling for Static Page Analysis) removes unnecessary elements from HTML, reducing LLM token usage by 70-90%.
Compression Process
Section titled “Compression Process”Original HTML │ ├─ 1. Remove non-content: <script>, <style>, <svg>, <iframe>, comments ├─ 2. Remove hidden elements: aria-hidden, display:none ├─ 3. Simplify attributes: keep only id, class, href, src, alt, role, etc. └─ 4. Limit content: text 500 chars, lists 10 items, table rows 20 │ ▼Compressed HTML (70-90% token reduction)Analysis Strategies
Section titled “Analysis Strategies”| Strategy | Preserves | Use Case |
|---|---|---|
content_extraction | Headings, paragraphs, lists, tables | Content extraction |
browser_automation | Buttons, forms, inputs, navigation | Browser automation |
hybrid | Content + navigation + media | Web crawling (default) |
Input / Output
Section titled “Input / Output”Input (.webreq)
Section titled “Input (.webreq)”{ "url": "https://example.com/article", "instruction": "Focus on main article content (optional)"}Output (S3)
Section titled “Output (S3)”s3://bucket/projects/{project_id}/documents/{doc_id}/└── webcrawler/ ├── metadata.json └── pages/ ├── page_0000.json ← { url, title, content, crawled_at } ├── page_0001.json └── page_0002.jsonDynamoDB Status Tracking
Section titled “DynamoDB Status Tracking”Crawling progress is recorded in DynamoDB.
| PK | SK | Purpose |
|---|---|---|
WEB#{document_id} | WF#{workflow_id} | Preprocess status (status, started_at, ended_at) |
WF#{workflow_id} | STEP | Workflow step status |
Limitations
Section titled “Limitations”- Maximum ~20 pages per crawl
- Agent timeout: default 1800 seconds (30 minutes), configurable via
AGENT_TIMEOUT_SECS - Browser operations are serialized with threading locks to prevent Playwright contextvars conflicts