Troubleshooting Guide
This guide helps you diagnose and resolve common issues with the Universal Blockchain Node Runner.
CRITICAL: Identify the Correct Deployment
Before troubleshooting, always identify which deployment you're working with:
# List all deployments
ls deploy-output-*.json
# Example output:
# deploy-output-solana-mainnet-beta-agave-rpc-base.json
# deploy-output-solana-mainnet-beta-agave-rpc-extended.json
# deploy-output-ethereum-mainnet-archive.json
For GenAI tools: Always ask the user which deployment to troubleshoot if multiple files exist. Confirm the stack name and instance ID before proceeding.
Extract deployment information:
# Replace {stack-name} with the actual stack name from the filename
export DEPLOY_FILE="deploy-output-{stack-name}.json"
export INSTANCE_ID=$(cat $DEPLOY_FILE | jq -r '..|.InstanceId? | select(. != null)')
export STACK_NAME=$(cat $DEPLOY_FILE | jq -r 'keys[0]')
echo "Troubleshooting deployment: $STACK_NAME"
echo "Instance ID: $INSTANCE_ID"
Quick Checks
Start with these quick diagnostic commands for common issues:
-
Deployment Failed:
# Use the correct stack name from deploy-output-{stack-name}.jsonaws cloudformation describe-stack-events \--stack-name $STACK_NAME \--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]' -
Node Not Starting:
# View node service logs in CloudWatch for specific instanceaws logs tail /aws/ec2/blockchain-nodes/systemd-services \--follow \--log-stream-names $INSTANCE_ID \--filter-pattern "node.service" -
Health Check Failing (HA deployments):
# Extract target group ARN from deployment outputexport TG_ARN=$(cat $DEPLOY_FILE | jq -r '..|.TargetGroupArn? | select(. != null)')aws elbv2 describe-target-health --target-group-arn $TG_ARN
For detailed troubleshooting, see the sections below.
Table of Contents
- Quick Checks
- Configuration Issues
- Deployment Issues
- Node Operation Issues
- Networking Issues
- Storage Issues
- Monitoring Issues
- Performance Issues
- Traffic Shaping Issues
- Security Issues
Configuration Issues
Protocol Not Found
Symptom: Error message "Protocol 'xyz' not found" or "No installed dependency declares protocol 'xyz'"
Cause: The specified protocol doesn't have an installed blueprint package, or is misspelled
Solution:
- Check available protocols (installed blueprint packages):
# List installed blueprint packagesnode -e "const pkg = require('./package.json');Object.entries(pkg.dependencies || {}).forEach(([name, ver]) => {try {const bp = require(name + '/package.json');if (bp['aws-blockchain-node-runner']) {console.log(bp['aws-blockchain-node-runner'].BLOCKCHAIN_PROTOCOL + ' -> ' + name);}} catch(e) {}});"
- Verify
BLOCKCHAIN_PROTOCOLin.envmatches a protocol declared by an installed blueprint - Ensure protocol name is lowercase
- If using an external blueprint, ensure it is listed in root
package.jsondependencies andnpm installhas been run
Example:
# Wrong
BLOCKCHAIN_PROTOCOL="Ethereum"
# Correct
BLOCKCHAIN_PROTOCOL="ethereum"
Invalid package.json blueprint configuration
Symptom: Error parsing protocol configuration
Cause: Malformed JSON or missing "aws-blockchain-node-runner" field in the blueprint's package.json (resolved from node_modules/)
Solution:
- Validate JSON syntax of the installed blueprint:
cat node_modules/aws-bnr-blueprint-mychain/package.json | jq .
- Check for:
- Missing commas
- Trailing commas
- Unquoted strings
- Mismatched brackets
- Missing
"aws-blockchain-node-runner"field
Missing Required Environment Variables
Symptom: "Required environment variable X is not set"
Cause: .env file is missing required variables
Solution:
- Check which variables are required:
# Required for all deploymentsAWS_ACCOUNT_IDAWS_REGIONBLOCKCHAIN_PROTOCOLDEPLOYMENT_MODEINSTANCE_TYPECPU_TYPEBC_NETWORKCLIENT_CONFIGDATA_VOLUMES_COUNT
- Copy from sample configuration:
cp node_modules/aws-bnr-blueprint-{protocol}/samples/.env-mainnet .env
- Fill in your values
Invalid Storage Configuration
Symptom: "Storage configuration validation failed"
Cause: IOPS or throughput exceeds limits for volume type
Solution:
- Check volume type limits:
- gp3: 3,000-80,000 IOPS, 125-2,000 MB/s throughput
- io1: 100-64,000 IOPS
- io2: 100-64,000 IOPS
- Adjust values in
.env:DATA_VOL_1_IOPS="80000" # Within new gp3 limitDATA_VOL_1_THROUGHPUT="2000" # Within new gp3 limit
HA Configuration Incomplete
Symptom: "HA configuration is incomplete"
Cause: DEPLOYMENT_MODE="ha-nodes" but HA variables not set
Solution:
- Add all required HA variables:
HA_NUMBER_OF_NODES="3"HA_ALB_HEALTHCHECK_PORT="8545" # Use protocol's RPC port (8545 for Ethereum, 8899 for Solana)HA_ALB_HEALTHCHECK_PATH="/health"HA_ALB_HEALTHCHECK_GRACE_PERIOD_MIN="60"HA_ALB_HEALTHCHECK_INTERVAL_SEC="30"HA_ALB_HEALTHCHECK_TIMEOUT_SEC="5"HA_ALB_HEALTHCHECK_HEALTHY_THRESHOLD="3"HA_ALB_HEALTHCHECK_UNHEALTHY_THRESHOLD="2"HA_NODES_HEARTBEAT_DELAY_MIN="10"HA_ALB_DEREGISTRATION_DELAY_SEC="30"
- Or use a sample HA configuration:
cp node_modules/aws-bnr-blueprint-{protocol}/samples/.env-ha .env
Deployment Issues
CDK Bootstrap Required
Symptom: "This stack uses assets, so the toolkit stack must be deployed"
Cause: CDK not bootstrapped in the account/region
Solution:
npx cdk bootstrap aws://ACCOUNT-ID/REGION
Example:
npx cdk bootstrap aws://123456789012/us-east-1
Insufficient IAM Permissions
Symptom: "User is not authorized to perform: iam:CreateRole"
Cause: AWS credentials lack necessary permissions
Solution:
- Ensure your IAM user/role has permissions for:
- CloudFormation
- EC2
- IAM
- S3
- CloudWatch
- Auto Scaling (for HA)
- Elastic Load Balancing (for HA)
- Use AdministratorAccess for initial testing
- Create custom policy for production:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["cloudformation:*","ec2:*","iam:*","s3:*","cloudwatch:*","autoscaling:*","elasticloadbalancing:*"],"Resource": "*"}]}
Region Mismatch Between .env and AWS Profile
Symptom: You expect deployment to one region but resources appear in another.
Cause: Previously, CDK would use the AWS CLI profile region (CDK_DEFAULT_REGION) instead of AWS_REGION from your .env file. This is no longer an issue — the app now enforces the .env region at startup.
Current behavior: AWS_REGION in your .env always determines the deployment region. If it differs from your CLI profile default, a note is printed at synth time:
Note: deploying to us-east-1 (from .env), AWS CLI profile default is us-west-2
If you're still seeing unexpected regions, verify AWS_REGION is correctly set in your .env file:
grep AWS_REGION .env
Stack Already Exists
Symptom: "Stack [name] already exists"
Cause: Attempting to deploy with same stack name. Stack names are automatically generated from BLOCKCHAIN_PROTOCOL, BC_NETWORK, and CLIENT_CONFIG (e.g., ethereum-mainnet-geth-1-14-0-lighthouse-2-5-1-full).
Solution:
- Update existing stack:
npx cdk deploy --json --outputs-file deploy-output.json
- Or destroy and redeploy:
npx cdk destroynpx cdk deploy --json --outputs-file deploy-output.json
- To deploy a different configuration alongside the existing one, change
BC_NETWORKorCLIENT_CONFIGin your.envfile to generate a unique stack name.
Resource Limit Exceeded
Symptom: "You have exceeded the limit for X"
Cause: AWS service limits reached
Solution:
- Check current limits:
aws service-quotas list-service-quotas \--service-code ec2 \--query 'Quotas[?QuotaName==`Running On-Demand Standard instances`]'
- Request limit increase via AWS Support
- Or use different instance type/region
Instance Type Not Available in Availability Zone
Symptom: Deployment fails with an error indicating the requested instance type is not available in the selected availability zone (e.g., "Your requested instance type is not supported in your requested Availability Zone")
Cause: The automatically selected availability zone does not support the configured EC2 instance type. Not all instance types are available in every AZ within a region.
Solution:
-
Check which AZs support your instance type:
aws ec2 describe-instance-type-offerings \--location-type availability-zone \--filters Name=instance-type,Values=<type> \--region <region>Replace
<type>with your instance type (e.g.,m6a.2xlarge) and<region>with your AWS region. -
Set
AWS_AZin your.envfile to an AZ from the output above:AWS_AZ="us-east-1a" -
Redeploy:
npx cdk deploy --json --outputs-file deploy-output.json
Notes:
AWS_AZis only used for single-node deployments. HA deployments use the Auto Scaling Group's multi-AZ placement and ignore this setting.- The AZ must belong to the configured
AWS_REGION(e.g.,us-east-1afor regionus-east-1). - See Configuration Reference for full details on the
AWS_AZvariable.
CloudFormation Rollback
Symptom: Stack creation failed and rolled back
Cause: Various - check CloudFormation events
Solution:
- View stack events:
aws cloudformation describe-stack-events \--stack-name YourStackName \--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
- Check specific error messages
- Fix configuration and redeploy
- Common causes:
- Invalid instance type for region
- Insufficient capacity
- Security group rule conflicts
- IAM permission issues
Node Operation Issues
Node Not Starting
Symptom: Instance launches but node service fails to start
Diagnosis:
-
View node service logs in CloudWatch:
# View recent node.service logsaws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --filter-pattern "node.service"# View for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" -
Check service status via CloudWatch Logs Insights:
Navigate to CloudWatch Logs Insights and use this query to check for service failures:
fields @timestamp, @message| filter @message like /node.service/ and @message like /error|failed|fatal/i| sort @timestamp desc| limit 50 -
Check user data execution:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID -
If CloudWatch logs are not available, connect via SSM:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION# Check service statussudo systemctl status node# View service logssudo journalctl -u node -n 100 --no-pager
Common Causes:
-
Missing Dependencies:
# View dependency errors in CloudWatch for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | grep -i "error\|failed"# Or connect via SSM to checkexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONdpkg -l | grep {package-name} -
Insufficient Disk Space:
# Connect via SSM to check disk spaceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONdf -h -
Port Already in Use:
# View port conflict errors in CloudWatch for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | grep -i "port\|address"# Or connect via SSM to checkexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo netstat -tulpn | grep {port}
Solution: Fix the specific issue and restart service:
# Connect via SSM
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')
aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION
# Restart service
sudo systemctl restart node
# Verify service started
sudo systemctl status node
Then verify in CloudWatch:
# Check for "Started" message for specific instance
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | grep -i "started"
Node Syncing Slowly
Symptom: Block height increasing very slowly
Diagnosis:
-
Check sync status:
# Protocol-specific command (example for Ethereum)curl http://localhost:8545 -X POST \-H "Content-Type: application/json" \-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'# For Solana (port 8899)curl http://localhost:8899 -X POST \-H "Content-Type: application/json" \-d '{"jsonrpc":"2.0","method":"getHealth","params":[],"id":1}' -
Check CloudWatch Dashboard (single-node deployments):
- Review the "Volume Read/Write latency (ms/op)" widgets
- High latency (>10ms for reads, >5ms for writes) indicates storage bottleneck
- Check "Volume Read/Write (IO/sec)" for IOPS saturation
- Review "Disk Used (%)" to ensure sufficient free space
-
Check network connectivity:
ping -c 5 8.8.8.8 -
Check peer count:
# Protocol-specific command (example for Ethereum)curl http://localhost:8545 -X POST \-H "Content-Type: application/json" \-d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}'# For Solana (port 8899)curl http://localhost:8899 -X POST \-H "Content-Type: application/json" \-d '{"jsonrpc":"2.0","method":"getClusterNodes","params":[],"id":1}'
Solutions:
-
Enable Blockchain Snapshot: Significantly reduces sync time
SNAPSHOT_ENABLED="true"SNAPSHOT_DOWNLOAD_URL="https://snapshots.example.com/latest.tar.gz" -
Optimize Storage (if high latency detected):
a. Switch to io2 volumes for lower latency:
DATA_VOL_1_TYPE="io2"DATA_VOL_1_IOPS="64000"b. Or use Instance Store for lowest latency (data is ephemeral):
DATA_VOL_1_TYPE="instance-store"# Note: Data is lost on instance stop/termination# Requires instance types with instance store (i3, i4i, i4g, etc.)Note: Storage type changes require stack destruction and redeployment.
-
Increase Instance Size: More CPU/memory for faster processing
INSTANCE_TYPE="m6a.4xlarge" # Upgrade from 2xlarge -
Check Peers: Ensure sufficient peer connections
- Verify security group allows P2P ports
- Verify network connectivity
Node Crashed
Symptom: Node service stopped unexpectedly
Diagnosis:
-
View crash logs in CloudWatch:
# View recent node.service errors for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | grep -i "error\|failed\|stopped" -
Check for OOM (Out of Memory) events:
# Connect via SSM to check system logsexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo dmesg | grep -i "out of memory"sudo journalctl -xe | grep -i "oom" -
Check disk space:
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONdf -h -
View detailed service logs:
# View last 200 lines of node.service logs for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | tail -200
Solutions:
-
Out of Memory: Increase instance size
INSTANCE_TYPE="m6a.4xlarge" # More memory -
Disk Full: Increase volume size
# Modify volume sizeaws ec2 modify-volume --volume-id vol-xxxxx --size 4000# Connect via SSM to extend filesystemexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo resize2fs /dev/xvdg -
Corrupted Data: Restore from blockchain snapshot or resync
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl stop nodesudo rm -rf /data/blockchain/chaindata# Re-download blockchain snapshot or resync -
Verify service restarted:
# Check CloudWatch logs for "Started" message for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service" | grep -i "started"
Networking Issues
Cannot Connect to RPC Endpoint
Symptom: Connection refused when accessing RPC endpoint
Diagnosis:
- Check if service is listening:
# Check for RPC port (varies by protocol)sudo netstat -tulpn | grep LISTEN# Ethereum: 8545, Solana: 8899, Bitcoin: 8332
- Check security group rules:
aws ec2 describe-security-groups \--group-ids sg-xxxxx \--query 'SecurityGroups[0].IpPermissions'
- Test locally on instance:
# Use protocol-specific RPC portcurl http://localhost:{rpc-port}# Ethereum: 8545, Solana: 8899, Bitcoin: 8332
Solutions:
-
Service Not Running: Start the service
sudo systemctl start node -
Security Group: Verify port is open
- Check
requiredPortsin protocol'spackage.json"aws-blockchain-node-runner"field - Ensure security group includes the port
- For testing, temporarily allow from your IP
- Check
-
Binding Address: Ensure RPC service binds to internal IP address (security best practice)
# Check configurationcat /data/blockchain/config/* | grep -i "listen\|bind\|rpc"# RPC should bind to internal IP for security# Correct: listen_addr = "172.31.x.x:8545" (internal IP)# Incorrect: listen_addr = "0.0.0.0:8545" (all interfaces - security risk)# Get internal IPTOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")EC2_INTERNAL_IP=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/local-ipv4)echo "Internal IP: $EC2_INTERNAL_IP"# Update configuration to use internal IP# Example: sed -i "s/0.0.0.0:8545/$EC2_INTERNAL_IP:8545/g" /data/blockchain/config/config.tomlSecurity Note:
- RPC endpoints: Should bind to internal IP (e.g.,
172.31.x.x:8545) - P2P endpoints: Can bind to
0.0.0.0(needs external connectivity) - Access control is managed via Security Groups, not binding addresses
- Binding to internal IP provides defense-in-depth security
- RPC endpoints: Should bind to internal IP (e.g.,
Health Check Failing (HA)
Symptom: ALB marks targets as unhealthy
Diagnosis:
- Check target health:
aws elbv2 describe-target-health \--target-group-arn arn:aws:elasticloadbalancing:...
- Test health check endpoint:
# Use protocol-specific health check port and pathcurl http://instance-ip:{health-port}{health-path}# Example: curl http://instance-ip:8545/health (Ethereum)# Example: curl http://instance-ip:8899/health (Solana)
- Check ALB logs (if enabled)
Solutions:
-
Wrong Health Check Path: Update configuration
HA_ALB_HEALTHCHECK_PATH="/health" # Correct path -
Node Not Ready: Increase grace period
HA_ALB_HEALTHCHECK_GRACE_PERIOD_MIN="90" # More time to initialize -
Health Check Too Strict: Adjust thresholds
HA_ALB_HEALTHCHECK_HEALTHY_THRESHOLD="2" # Reduce from 3HA_ALB_HEALTHCHECK_INTERVAL_SEC="60" # Increase interval -
Port Mismatch: Verify health check port matches protocol
# Set to protocol's RPC portHA_ALB_HEALTHCHECK_PORT="8545" # EthereumHA_ALB_HEALTHCHECK_PORT="8899" # SolanaHA_ALB_HEALTHCHECK_PORT="8332" # Bitcoin
Peer Connection Issues
Symptom: Node has no peers or very few peers
Diagnosis:
- Check peer count:
# Protocol-specific command
- Check P2P port accessibility:
# Check for P2P ports (varies by protocol)sudo netstat -tulpn | grep LISTEN# Ethereum: 30303, Solana: 8001-8020 range, Bitcoin: 8333
- Verify security group allows P2P ports
Solutions:
-
Security Group: Ensure P2P ports are open
- Check both TCP and UDP
- Allow from 0.0.0.0/0 for P2P ports
-
Network Configuration: Check node configuration
- Verify external IP is correct
- Check NAT traversal settings
Storage Issues
Disk Full
Symptom: "No space left on device"
Diagnosis:
df -h
du -sh /data/* | sort -h
Solutions:
-
Increase Volume Size:
# Modify volumeaws ec2 modify-volume --volume-id vol-xxxxx --size 4000# Wait for modification to completeaws ec2 describe-volumes-modifications --volume-id vol-xxxxx# Extend filesystemsudo resize2fs /dev/xvdg # For ext4# ORsudo xfs_growfs /data # For xfs -
Clean Up Old Data:
# Protocol-specific cleanup commands# Be careful - may require resync -
Add Additional Volume:
- Update
.envwith new volume - Redeploy stack
- Migrate data if needed
- Update
Disk Fills During Snapshot Download
Symptom: Disk fills to 100% during snapshot download or extraction, node never starts. CloudWatch logs show download stopping at ~60-70% or extraction failing with "No space left on device".
Cause: The compressed snapshot archive and extracted data both reside on the same /data volume. Peak disk usage = compressed_archive_size + extracted_data_size, which exceeds available space for large snapshots.
Diagnosis:
# Connect via SSM
export INSTANCE_ID=$(cat $DEPLOY_FILE | jq -r '..|.InstanceId? | select(. != null)')
aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION
# Check disk usage
df -h /data
# Check if snapshot archive exists alongside extracted data
ls -lh /data/snapshot-archive 2>/dev/null || ls -lh /data/snapshot.tar.zst 2>/dev/null
Solution: Configure a snapshot staging volume to hold the compressed archive on a separate temporary EBS volume:
-
Destroy the failed stack:
npx cdk destroy -
Add staging volume to
.env(set to ~1.1x the compressed archive size):# Example for Base mainnet op-reth (~4.86 TB archive)SNAPSHOT_STAGING_VOL_SIZE="5000"# Example for BNB mainnet bsc-reth (~9.7 TB archive)SNAPSHOT_STAGING_VOL_SIZE="10000" -
Redeploy:
npx cdk deploy --json --outputs-file deploy-output-$STACK_NAME.json
The staging volume is a temporary gp3 EBS volume that is automatically deleted after successful extraction. Cost is minimal (~$29 for a 5 TB volume over 2 days) compared to the cost of a failed deployment.
See Snapshot Staging Guide for detailed volume sizing guidance per protocol.
Orphaned Snapshot Staging Volume
Symptom: A gp3 EBS volume tagged Purpose=snapshot-staging remains in the account (and keeps incurring cost) after a deployment, even though the snapshot finished downloading.
Cause: The in-instance cleanup could not confirm the staging volume was deleted — for example a missing ec2:DetachVolume/ec2:DeleteVolume permission, a stalled detach, an unreachable metadata service, or the volume ID being lost after a mid-download reboot. Cleanup now logs this rather than swallowing it.
Diagnosis:
# Look for the cleanup error in cloud-init-output for the instance
export INSTANCE_ID=$(cat $DEPLOY_FILE | jq -r '..|.InstanceId? | select(. != null)')
aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output \
--log-stream-names $INSTANCE_ID \
--filter-pattern "staging cleanup"
# List any staging volumes still present in the region
aws ec2 describe-volumes \
--filters "Name=tag:Purpose,Values=snapshot-staging" \
--query 'Volumes[].{Id:VolumeId,State:State,AZ:AvailabilityZone}' \
--output table
Solution:
- If the stack is still deployed,
npx cdk destroyremoves the volume via CloudFormation (RemovalPolicy.DESTROY). - If the volume is orphaned (its instance/stack is gone), delete it manually after confirming it is not in use:
aws ec2 detach-volume --volume-id vol-xxxxxxxx 2>/dev/null || trueaws ec2 wait volume-available --volume-ids vol-xxxxxxxxaws ec2 delete-volume --volume-id vol-xxxxxxxx
- If cleanup failed due to missing permissions, confirm the instance role grants
ec2:DetachVolumeandec2:DeleteVolume(single-node) or the HA self-management actions, then redeploy.
To validate the staging cleanup lifecycle cheaply, use the dummy debug path documented in Snapshot Staging Guide and look for the STAGING DEBUG: PASS line in cloud-init-output.
Volume Not Mounting
Symptom: Volume exists but not mounted
Diagnosis:
lsblk
sudo blkid
mount | grep /data
Solutions:
-
Check /etc/fstab:
cat /etc/fstab -
Mount Manually:
sudo mount /dev/xvdg /data -
Check setup-storage.sh Logs:
sudo cat /var/log/cloud-init-output.log | grep -A 20 "setup-storage" -
Verify Device Name:
# Device names may differlsblk# Update mount command accordingly
Poor I/O Performance
Symptom: High disk latency, slow read/write
Diagnosis:
- Check I/O metrics:
iostat -x 5
- Check CloudWatch metrics:
- VolumeReadOps
- VolumeWriteOps
- VolumeThroughputPercentage
- VolumeQueueLength
Solutions:
-
Increase IOPS:
DATA_VOL_1_IOPS="80000" # New gp3 maximum -
Increase Throughput (gp3 only):
DATA_VOL_1_THROUGHPUT="2000" # New gp3 maximum -
Use io2 Volumes:
DATA_VOL_1_TYPE="io2"DATA_VOL_1_IOPS="64000" -
Use Instance Store (if available):
DATA_VOL_1_TYPE="instance-store"# Note: Data is ephemeral -
Verify Instance Store Volume Selection:
# List all NVMe deviceslsblk | grep nvme# Check which volumes are mounteddf -h | grep nvme# View instance store setup logssudo cat /var/log/cloud-init-output.log | grep -A 30 "setup-storage"
Monitoring Issues
CloudWatch Log Groups
The CloudWatch agent is configured to send the following logs to CloudWatch Logs:
| Log Group | Description | Retention | Source |
|---|---|---|---|
/aws/ec2/blockchain-nodes/cloud-init-output | Cloud-init output | 7 days | /var/log/cloud-init-output.log |
/aws/ec2/blockchain-nodes/systemd-services | Systemd service logs | 7 days | /var/log/syslog |
Note: Ubuntu's rsyslog automatically forwards all systemd service logs to /var/log/syslog, which is then collected by the CloudWatch agent.
Viewing Logs:
# View cloud-init output (most useful for troubleshooting deployment)
aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow
# View systemd service logs (node.service, syncchecker.service, net-rules.service)
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow
# View logs for specific instance
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')
aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID
# Filter logs by service name for specific instance
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service"
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service"
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service"
CloudWatch Logs Insights Queries:
Use CloudWatch Logs Insights for advanced log analysis. Example query to check node.service errors:
-- View node.service errors
fields @timestamp, @message
| filter @message like /node.service/ and @message like /error|failed|fatal/i
| sort @timestamp desc
| limit 50
Accessing Logs via Console:
- Open the CloudWatch Logs Console
- Navigate to log group:
/aws/ec2/blockchain-nodes/systemd-services - Select the log stream for your instance (instance ID)
- Use the filter box to search for specific services:
node.service- Main blockchain node servicesyncchecker.service- Sync checker and traffic shaping controlnet-rules.service- Traffic shaping network rules
- Click "Actions" → "View in Logs Insights" for advanced queries
Metrics Not Appearing
Symptom: CloudWatch dashboard shows no data
Diagnosis:
- Check CloudWatch agent status:
sudo systemctl status amazon-cloudwatch-agent
- Check agent logs:
sudo cat /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
- Check agent logs in CloudWatch (if available):
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "cloudwatch-agent"
- Or check agent logs directly on instance:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo cat /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
- Verify IAM permissions:
aws sts get-caller-identity
Solutions:
-
Restart CloudWatch Agent:
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl restart amazon-cloudwatch-agent -
Check IAM Role: Ensure instance has CloudWatch permissions
- CloudWatchAgentServerPolicy
- Custom metrics permissions
-
Verify Configuration:
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo cat /opt/aws/amazon-cloudwatch-agent/etc/custom-amazon-cloudwatch-agent.json -
Check Region: Ensure metrics sent to correct region
Dashboard Not Created
Symptom: CloudWatch dashboard doesn't exist after deployment
Diagnosis:
- Check CloudFormation stack outputs
- Check CDK synthesis output
Solutions:
- Single-Node Only: Dashboards only created for single-node deployments
- HA deployments don't include default dashboard
- Create custom dashboard for HA
Performance Issues
High CPU Usage
Symptom: CPU consistently above 80%
Diagnosis:
top
htop # If installed
Solutions:
-
Increase Instance Size:
INSTANCE_TYPE="m6a.4xlarge" # More vCPUs -
Optimize Node Configuration:
- Reduce cache size
- Adjust thread count
- Disable unnecessary features
-
Check for Runaway Processes:
ps aux --sort=-%cpu | head -10
High Memory Usage
Symptom: Memory consistently above 80%, potential OOM
Diagnosis:
free -h
sudo dmesg | grep -i "out of memory"
Solutions:
-
Increase Instance Size:
INSTANCE_TYPE="m6a.4xlarge" # More memory -
Optimize Node Configuration:
- Reduce cache size
- Adjust memory limits
- Enable swap (temporary solution)
-
Add Swap (temporary only as it puts more pressure on storage):
sudo fallocate -l 8G /swapfilesudo chmod 600 /swapfilesudo mkswap /swapfilesudo swapon /swapfile
Slow Deployment
Symptom: CDK deployment takes very long
Causes:
- Slow instance initialization
- Snapshot download
Solutions:
-
Faster Instance: Use larger instance type temporarily
-
Optimize Snapshot: Use compressed snapshots
-
Parallel Deployment: Deploy multiple stacks in parallel (only if stacks deploy different protocols)
Traffic Shaping Issues
Traffic Shaping Not Working
Symptom: Traffic shaping enabled but bandwidth not limited
Diagnosis:
-
Check net-rules service status in CloudWatch:
# View net-rules.service logs for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service"# Check for service start/stop eventsaws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service" | grep -i "started\|stopped" -
Check sync checker status in CloudWatch:
# View syncchecker.service logs for specific instanceaws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" -
Verify traffic shaping configuration:
# Connect via SSM to check configurationexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONcat /etc/cdk_environment | grep TRAFFIC_SHAPINGsudo systemctl status net-rules.servicesudo systemctl status syncchecker.timer
Solutions:
-
Service Not Running: Start the service
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl start net-rules.servicesudo systemctl status net-rules.serviceThen verify in CloudWatch:
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service" | grep -i "started" -
Sync Checker Not Running: Start the timer
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl start syncchecker.timersudo systemctl status syncchecker.timerThen verify in CloudWatch:
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" -
Node Not Fully Synced: Traffic shaping only activates when node is fully synchronized
- Check node sync status using protocol-specific commands
- Wait for initial sync to complete
- Check
c1_blocks_behindmetric in CloudWatch - View sync status in CloudWatch logs:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" | grep -i "blocks behind\|slots behind"
-
Configuration Error: Verify environment variables
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION# Should show trueecho $TRAFFIC_SHAPING_ENABLED# Should show configured rateecho $TRAFFIC_SHAPING_RATE_MBIT
Traffic Shaping Causing Sync Issues
Symptom: Node falling behind after traffic shaping enabled
Diagnosis:
-
Check blocks behind metric in CloudWatch:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws cloudwatch get-metric-statistics \--namespace CWAgent \--metric-name c1_blocks_behind \--dimensions Name=InstanceId,Value=$INSTANCE_ID \--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \--period 60 \--statistics Average \--region $AWS_REGION -
Check if traffic shaping is active in CloudWatch:
# View net-rules service status for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service" -
Check sync checker logs in CloudWatch:
# View sync checker activity for specific instanceaws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service"
Solutions:
-
Rate Too Low: Increase bandwidth limit
# Update .envTRAFFIC_SHAPING_RATE_MBIT="50" # Increase from 40# Redeploynpx cdk deploy --json --outputs-file deploy-output.json -
Threshold Too High: Reduce max blocks behind threshold
# Update .envTRAFFIC_SHAPING_MAX_BLOCKS_BEHIND="5" # Reduce from 10# Redeploynpx cdk deploy --json --outputs-file deploy-output.json -
Disable Traffic Shaping: If issues persist
# Update .envTRAFFIC_SHAPING_ENABLED="false"# Redeploynpx cdk deploy --json --outputs-file deploy-output.json -
Manual Override: Temporarily disable traffic shaping
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION# Stop traffic shapingsudo systemctl stop net-rules.service# Stop sync checkersudo systemctl stop syncchecker.timerThen verify in CloudWatch:
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service" | grep -i "stopped"
Traffic Shaping Metrics Not Appearing
Symptom: c1_blocks_behind metric not showing in CloudWatch
Diagnosis:
-
Check sync checker logs in CloudWatch:
# View syncchecker.service logs for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" -
Check CloudWatch agent status:
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl status amazon-cloudwatch-agent -
Verify IAM permissions:
# Instance should have CloudWatch PutMetricData permissionaws sts get-caller-identity
Solutions:
-
Sync Checker Not Running: Start the service
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl start syncchecker.timersudo systemctl status syncchecker.timerThen verify in CloudWatch:
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" -
CloudWatch Agent Issue: Restart the agent
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo systemctl restart amazon-cloudwatch-agent -
Script Error: Check for errors in sync checker
# View errors in CloudWatch for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service" | grep -i "error\|failed"# Or run manually via SSM to see errorsexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONsudo /opt/blueprints/user-data/syncchecker.sh -
Node Not Ready: Sync checker only runs after initial sync
- Check for
/data/data/init-completedfile - Wait for node to complete initial synchronization
- View initialization progress in CloudWatch:
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID | grep -i "init-completed"
- Check for
Traffic Shaping Scripts Missing
Symptom: Traffic shaping scripts not found on instance
Diagnosis:
-
Check if scripts exist:
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONls -la /opt/network/ls -la /opt/common/network/ -
Check asset download in CloudWatch:
# View cloud-init logs for asset download for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID | grep -i "traffic shaping\|network"
Solutions:
-
Assets Not Downloaded: Check asset download
# Check cloud-init logs in CloudWatch for specific instanceexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID | grep -A 10 "traffic shaping" -
Redeploy Stack: If assets missing
npx cdk destroynpx cdk deploy --json --outputs-file deploy-output.json -
Manual Copy: Temporarily copy scripts
# Connect via SSMexport INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION# If common assets exist but not copiedsudo mkdir -p /opt/networksudo cp /opt/common/network/*.sh /opt/network/sudo chmod +x /opt/network/*.sh
Security Issues
Cannot Access Instance
Symptom: Cannot connect via SSM Session Manager
Diagnosis:
- Verify IAM role has SSM permissions
- Check VPC endpoints (if using private subnets)
Solutions:
-
Check IAM Role: Ensure AmazonSSMManagedInstanceCore policy attached
-
VPC Endpoints: Create SSM endpoints for private subnets
Secrets Not Accessible
Symptom: Cannot retrieve secrets from Secrets Manager
Diagnosis:
aws secretsmanager get-secret-value --secret-id my-secret
Solutions:
-
Verify Secret ARN: Ensure ARN is correct in configuration
-
Check Region: Secret must be in same region as deployment
-
Check Secret Exists: Verify the secret was created
aws secretsmanager describe-secret --secret-id my-secret -
Test from Instance: Connect to instance and test access
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')aws ssm start-session --target $INSTANCE_ID --region $AWS_REGIONaws secretsmanager get-secret-value --secret-id my-secret --region $AWS_REGION
Note: The default IAM role includes secretsmanager:GetSecretValue and secretsmanager:DescribeSecret permissions for all secrets. If you need to restrict access to specific secrets, you can modify the IAM role after deployment.
Getting Additional Help
Collect Diagnostic Information
Before requesting help, collect:
-
Configuration:
cat .env | grep -v "SECRET\|PASSWORD" > .env-support # Redact sensitive data# Inspect the installed blueprint's package.jsoncat node_modules/aws-bnr-blueprint-{protocol}/package.json | jq '."aws-blockchain-node-runner"' > protocol-config-support.json -
Logs:
sudo cat /var/log/cloud-init-output.logsudo journalctl -u node -n 200 -
System Info:
uname -adf -hfree -h -
CloudFormation Events:
aws cloudformation describe-stack-events --stack-name YourStack
Support Channels
- GitHub Issues: Report bugs and request features
- Documentation: Check docs/ for guides
- AWS Support: For AWS-specific issues
Useful Commands Reference
# CDK Commands
npx cdk synth # Synthesize CloudFormation template
npx cdk deploy --json --outputs-file deploy-output.json # Deploy stack
npx cdk destroy # Destroy stack
npx cdk diff # Show differences
# Get Instance ID from deployment outputs
export INSTANCE_ID=$(cat deploy-output.json | jq -r '..|.InstanceId? | select(. != null)')
echo "INSTANCE_ID=$INSTANCE_ID"
# AWS CLI Commands
aws ssm start-session --target $INSTANCE_ID --region $AWS_REGION # Connect to instance
aws logs tail /aws/ec2/blockchain-nodes/cloud-init-output --follow --log-stream-names $INSTANCE_ID # View deployment logs
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID # View systemd service logs
aws cloudformation describe-stacks --stack-name YourStack # Stack info
# CloudWatch Logs Commands
# View specific service logs for specific instance
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "node.service"
# Check on Ethereum execution client like Geth, Reth, Erigon, or Hyperledger Besu
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "execution"
# Check on Ethereum consensus client like Lighthouse, Prysm, or Teku
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "consensus"
# Check on Syncchecker
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "syncchecker.service"
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID --filter-pattern "net-rules.service"
# View logs for specific instance (alternative without filter)
aws logs tail /aws/ec2/blockchain-nodes/systemd-services --follow --log-stream-names $INSTANCE_ID
# Instance Commands (via SSM)
sudo systemctl status node # Check service status
sudo journalctl -u node -f # Follow service logs (if CloudWatch not available)
df -h # Disk usage
free -h # Memory usage
top # Process monitor
Prevention Best Practices
-
Use Sample Configurations: Start with provided sample .env files
-
Monitor from Day One: Set up CloudWatch alarms immediately
-
Document Changes: Keep notes on configuration changes
-
Stay Updated: Keep protocol clients and CDK updated
-
Review Logs: Regularly check logs for warnings
-
Capacity Planning: Monitor growth and plan for scaling
See Also
- Configuration Reference - Complete configuration documentation
- Deployment Guide - Deployment best practices
- Snapshot Staging - Staging volume for large snapshot downloads
- Adding New Protocols - Protocol addition guide
- Design Document - System architecture and design decisions