Spot Workloads on EKS Auto Mode
Table of Contents
Prerequisites
Cluster deployed and kubectl configured per Quick Start.
Overview
Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity at steep discounts. Key benefits include:
💰 Cost Optimization
- Up to 90% cost savings compared to On-Demand instances
- Ideal for fault-tolerant, flexible workloads
- Pay only for what you use
⚡ Scalability
- Access to large-scale compute capacity
- Perfect for batch processing and stateless applications
- Automatic capacity rebalancing
🔄 Flexibility
- Mix of instance types and sizes
- Automatic instance selection based on availability
- Graceful interruption handling
Architecture
This example demonstrates how to run workloads on Spot instances in EKS Auto Mode using Karpenter's spot instance management capabilities.
Key Components: 📄 NodePool Template
- Defines Spot instance requirements
- Available here
- Supports c, m, and r instance families
- ARM64 architecture for cost efficiency
🔄 Load Balancer
- Application Load Balancer (ALB)
- Exposes the application to external traffic
🎮 Sample Application
- 2048 game (sliding tile puzzle)
- Stateless application ideal for spot instances
Implementation Steps
1. Deploy Spot NodePool
Deploy the NodePool that will manage our Spot instances:
kubectl apply -f ../../nodepools/spot-nodepool.yaml
⚠️ The Spot NodePool applies the following taint to ensure workloads are spot-aware:
taints:- key: "spot"value: "true"effect: "NoSchedule" # Prevents non-spot-aware pods from schedulingAny pods that need to run on Spot nodes must include matching tolerations in their specifications. This ensures workloads are designed to handle spot instance interruptions.
2. Deploy the 2048 Game
Deploy our spot-compatible 2048 game application:
kubectl apply -f game-2048.yaml
✅ The 2048 game deployment includes the required configuration for Spot instances:
tolerations:- key: "spot" # Matches the Spot node taintvalue: "true"effect: "NoSchedule" # Allows scheduling on tainted nodesnodeSelector:karpenter.sh/capacity-type: spot # Ensures pods run on spot instancesThis configuration ensures the pods can run on Spot instances and are scheduled appropriately.
3. Configure Load Balancer
Set up the Application Load Balancer using Ingress:
kubectl apply -f 2048-ingress.yaml
4. Access the Application
By default, this example exposes its UI via an internal ALB — reachable from inside the VPC only. To access it from your laptop, use kubectl port-forward:
kubectl port-forward -n game-2048-spot svc/service-2048 8080:80
# then open http://localhost:8080
If you want to inspect the ALB DNS name directly (e.g. from a bastion or VPN):
kubectl get ingress ingress-2048 \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}' \
-n game-2048-spot
To expose the UI publicly over HTTPS, deploy the Terraform stack with var.base_domain set to a public Route53 zone you own (see top-level README). The example will be reachable at https://2048-spot.<full_domain> once external-dns publishes the record. 🎮
Cleanup
🧹 Follow these steps to clean up all resources:
Remove the application and node pool:
kubectl delete -f 2048-ingress.yaml
kubectl delete -f game-2048.yaml
kubectl delete -f ../../nodepools/spot-nodepool.yaml
Troubleshooting
🔧 Common issues and their solutions:
🎯 Spot Instance Issues
-
Capacity Unavailability
- Monitor instance capacity with AWS CLI:
aws ec2 describe-spot-instance-requests \--filters "Name=status-code,Values=capacity-not-available"
- Check NodePool events:
kubectl describe nodepool spot-nodepool
- Monitor instance capacity with AWS CLI:
-
Instance Interruptions
- Monitor interruption events:
kubectl get events --field-selector reason=SpotInterruption
- Review pod eviction status:
kubectl get pods -n game-2048 -o wide
- Monitor interruption events:
🔄 Load Balancer Issues
-
ALB Configuration
# Check ALB controller logskubectl logs -n kube-system \deployment/aws-load-balancer-controller -
Ingress Status
# Check ingress statuskubectl describe ingress ingress-2048 -n game-2048
💡 Tip: Use
kubectl get eventsto monitor spot instance lifecycle events and pod rescheduling.