info

This page documents the internal workflow that Kiro follows when you run the corresponding @prompt. You don't need to read this to use the tool — it's here for transparency and for developers who want to understand or extend the workflows.

Healthcheck a Deployed Blockchain Node

Perform a comprehensive healthcheck using AWS CLI commands to query CloudWatch Logs and Metrics. Provide a detailed report with timestamps and specific values.

CRITICAL: Identify Correct Deployment

List all deployment output files: ls deploy-output-*.json
If multiple files exist, ask the user which deployment to check
Confirm the deployment before proceeding:
- Show stack name from filename
- Show instance ID from the file
- Ask: "I found deployment {stack-name} with instance {instance-id}. Is this correct?"
Wait for confirmation before proceeding
Use the confirmed deploy-output-{stack-name}.json file for all subsequent checks

Deployment Information

Read the confirmed deploy-output-{stack-name}.json file
Extract instance ID, region, and stack name
Parse protocol and configuration from stack name (format: {protocol}-{network}-{config})
OPTIONAL: If .env-{stack-name} file exists, read it for additional context (traffic shaping settings, instance type, storage configuration)
If .env file not available, extract configuration from CloudWatch logs and stack name

Node Service Status

Check systemd service logs in CloudWatch for the "node" service
Command: aws logs tail /aws/ec2/blockchain-nodes/systemd-services --log-stream-names $INSTANCE_ID --filter-pattern "node.service"
Look for recent errors, failures, or crashes
Verify the service is running and not restarting repeatedly
For EVM-compatible chains (Ethereum): also filter by "execution" and "consensus"
For non-EVM chains (Solana, etc.): use "node.service" filter

Synchronization Status

Check node sync status from logs
For Ethereum-like chains:
- Execution client: Current block height, blocks behind, throughput (Mgas/second)
- Consensus client: Current slot/epoch, sync distance, optimistic status
For Solana:
- Current slot, slots behind, catchup status
For other protocols: Check protocol-specific sync indicators
Check syncchecker logs for reported metrics

Network Connectivity

Peer count from node logs
Network connectivity issues
P2P port accessibility

System Resources — CPU

Check CloudWatch metrics for CPU utilization
Command: aws cloudwatch get-metric-statistics --namespace CWAgent --metric-name cpu_usage_idle
Report CPU idle percentage (higher is better, >20% idle is healthy)
Identify if CPU is bottleneck (consistently <10% idle)
Check for CPU throttling or saturation

System Resources — Memory

Check CloudWatch metrics for memory usage
Command: aws cloudwatch get-metric-statistics --namespace CWAgent --metric-name mem_used_percent
Report memory consumption percentage
Check for memory pressure (>90% is concerning)
Look for OOM (Out of Memory) events in logs
Report RSS (Resident Set Size) from node logs if available

Block Storage Health

Check CloudWatch metrics for ALL data volumes (DATA_VOL_1, DATA_VOL_2, etc.). All metrics are in CWAgent namespace with dimensions: InstanceId and name (device name like nvme1n1).

For EACH volume, report:

Read Latency (ms/operation):

Calculate: diskio_read_time / diskio_reads (both from CWAgent namespace)
Use Sum statistic for both metrics, then divide
Healthy: <5ms | Warning: 5-10ms | Critical: >10ms

Write Latency (ms/operation):

Calculate: diskio_write_time / diskio_writes (both from CWAgent namespace)
Use Sum statistic for both metrics, then divide
Healthy: <10ms | Warning: 10-20ms | Critical: >20ms

IOPS:

Metrics: diskio_reads, diskio_writes (CWAgent namespace)
Calculate: Sum of metric / PERIOD (e.g., Sum over 60s / 60 = IOPS)
Report current IOPS vs provisioned IOPS
Check if hitting IOPS limits (>90% utilization)

Throughput (bytes/sec):

Metrics: diskio_read_bytes, diskio_write_bytes (CWAgent namespace)
Calculate: Sum of metric / PERIOD (e.g., Sum over 60s / 60 = bytes/sec)
Report current throughput vs provisioned throughput
Check if hitting throughput limits (>90% utilization)

Queue Length:

Metric: diskio_iops_in_progress (CWAgent namespace)
Healthy: <5 | Warning: 5-10 | Critical: >10

Disk Space:

Metric: disk_used_percent (CWAgent namespace)
Dimensions: InstanceId and path (mount path like /data)
Warning: >80% full | Critical: >90% full

Storage Performance Analysis

Identify if storage is the bottleneck for sync performance
Compare current IOPS/throughput against provisioned limits
Recommend storage optimizations if needed:
- Increase IOPS (for gp3: up to 80,000 IOPS)
- Increase throughput (for gp3: up to 2,000 MB/s)
- Switch to io2 or Instance Store for lower latency
- Consider instance store for lowest latency (ephemeral)

Traffic Shaping (if enabled)

Check if traffic shaping is active
Verify net-rules service status
Check if node is falling behind due to bandwidth limits

Recent Issues

Search for error patterns in logs (last 30 minutes)
Identify any warnings or critical messages
Check for common issues: disk full, memory exhaustion, network problems, I/O bottlenecks

Overall Health Assessment

Provide a summary: HEALTHY, SYNCING, DEGRADED, or CRITICAL
List any issues found with severity (HIGH, MEDIUM, LOW)
Identify bottlenecks: CPU, Memory, Storage I/O, Network
Recommend actions if issues detected

CloudWatch Query Reference

Date Command Compatibility (works on both Linux and macOS):

START_TIME=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S 2>/dev/null || date -u -v-1H +%Y-%m-%dT%H:%M:%S)
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%S)

CloudWatch Datapoint Limits — maximum 1,440 datapoints per query:

60-second periods: Maximum 24-hour time range
300-second periods: Maximum 5-day time range
3600-second periods: Maximum 60-day time range

Storage Latency: Use 3600-second (1 hour) periods for reliable averages. 60-second periods often have zero operations, making calculations unreliable.

Log Query Tips:

Use specific time ranges with --start-time (epoch milliseconds)
Limit output with | tail -N
Query recent data first (last 30 minutes) before expanding range
Use filter patterns to reduce data transfer

Instance Store Volumes:

Device names: nvme1n1, nvme2n1, etc. (not EBS volume IDs)
No provisioned IOPS/throughput limits (hardware-limited)
Expected latency: <1ms reads, <2ms writes (NVMe performance)

Discovering Custom Metrics:

aws cloudwatch list-metrics \
  --namespace CWAgent \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --region $REGION | jq -r '.Metrics[] | select(.MetricName | startswith("c1_") or startswith("c2_")) | .MetricName' | sort -u

Disk Space Troubleshooting — if disk_used_percent returns no data:

# List available mount paths:
aws cloudwatch list-metrics --namespace CWAgent --dimensions Name=InstanceId,Value=$INSTANCE_ID --metric-name disk_used_percent --region $REGION | jq -r '.Metrics[] | .Dimensions[] | select(.Name == "path") | .Value' | sort -u

Then try longer time ranges (6-24 hours). For instance store volumes, metrics may not be immediately available.

Storage Performance Thresholds

Metric	Excellent	Good	Acceptable	Poor (bottleneck)
Read Latency	<2ms	<5ms	5-10ms	>10ms
Write Latency	<5ms	<10ms	10-20ms	>20ms
IOPS Utilization	<50%	<70%	70-90%	>90%
Throughput Utilization	<50%	<70%	70-90%	>90%
Queue Length	<2	<5	5-10	>10

Protocol-Specific Considerations

Ethereum (EVM-compatible):

Two services: execution client and consensus client
Filter by "execution" or "consensus" for specific logs
Metrics: block height, Mgas/second, slot numbers
Storage: Heavy write load during sync, read-heavy when synced

Solana:

Single validator service
Filter by "node.service" only
Metrics: slot height, catchup status, vote credits
Storage: Very high IOPS requirements (recommend io2 or instance store)

Other Protocols:

Use "node.service" as the universal filter
Check protocol-specific sync indicators in logs
Storage requirements vary by protocol

Variant Healthchecks

For a quick status check, focus only on: service running, sync status, critical errors (last 10 minutes), CPU idle %, memory used %.

For a performance analysis, focus on: sync throughput, detailed storage analysis, CPU/memory, peer connectivity, bottleneck identification.

For a storage deep dive, focus on: all data volumes (latency, IOPS, throughput), current vs provisioned limits, queue lengths, disk space, specific optimization recommendations.

For troubleshooting, focus on: all errors/warnings in the last hour, failure patterns, resource exhaustion, traffic shaping impact, detailed remediation steps.

CRITICAL: Identify Correct Deployment​

Deployment Information​

Node Service Status​

Synchronization Status​

Network Connectivity​

System Resources — CPU​

System Resources — Memory​

Block Storage Health​

Storage Performance Analysis​

Traffic Shaping (if enabled)​

Recent Issues​

Overall Health Assessment​

CloudWatch Query Reference​

Storage Performance Thresholds​

Protocol-Specific Considerations​

Variant Healthchecks​