Pod Autoscaling on EKS Auto Mode

Prerequisites
Overview
Part 1: Horizontal Pod Autoscaler (HPA)
Part 2: KEDA Event-Driven Autoscaling
Troubleshooting

Prerequisites

Cluster deployed and kubectl configured per Quick Start.

Overview

Pod autoscaling is essential for maintaining optimal performance and cost efficiency in Kubernetes clusters. This example demonstrates two complementary approaches:

📊 Horizontal Pod Autoscaler (HPA)

CPU and memory-based scaling
Built-in Kubernetes functionality
Ideal for traditional web applications

🎯 KEDA (Kubernetes Event-Driven Autoscaling)

Event-driven scaling based on external metrics
Supports 60+ scalers (SQS, Kafka, Redis, etc.)
Perfect for event-driven and batch workloads

⚡ Key Benefits

Automatic scaling based on demand
Cost optimization through right-sizing
Improved application performance and availability

Part 1: Horizontal Pod Autoscaler (HPA)

HPA Architecture

The HPA automatically scales the number of pods in a deployment based on observed CPU utilization, memory usage, or custom metrics.

How it works:

📈 Metrics Collection: Metrics Server collects resource usage from pods
🔍 Evaluation: HPA controller evaluates metrics against target thresholds
⚖️ Scaling Decision: Calculates desired replica count based on current vs target metrics
🔄 Pod Adjustment: Updates deployment replica count to match demand

Key Components:

Metrics Server: Collects resource metrics from kubelets
HPA Controller: Makes scaling decisions based on metrics
Target Deployment: The workload being scaled
Load Generator: Simulates traffic to trigger scaling

HPA Implementation Steps

1. Install Metrics Server

The HPA requires the Metrics Server to collect resource metrics from pods:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

📘 Note: The Metrics Server collects resource metrics from kubelets and exposes them through the Kubernetes API. It may take a minute or two to become ready.

Verify the Metrics Server is running:

kubectl get pods -n kube-system | grep metrics-server

2. Deploy the Sample Application

Deploy a PHP Apache server that will serve as our scaling target:

kubectl apply -f hpa/php-apache.yaml

✅ Application Details: The php-apache deployment includes:

CPU Request: 200m (200 millicores)

CPU Limit: 500m (500 millicores)

Container: registry.k8s.io/hpa-example - a simple PHP server

Service: Exposes the application on port 80

3. Create the HorizontalPodAutoscaler

Create an HPA that maintains between 1 and 10 replicas based on CPU utilization:

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

Alternatively, you can use the declarative approach:

kubectl apply -f hpa/hpa.yaml

4. Verify HPA Status

Check the current status of the HPA:

kubectl get hpa

Expected output:

NAME         REFERENCE                     TARGET    MINPODS   MAXPODS   REPLICAS   AGE
php-apache   Deployment/php-apache/scale   0% / 50%  1         10        1          18s

📘 Note: The current CPU consumption shows 0% because there's no load on the server yet.

5. Generate Load to Trigger Scaling

Start a load generator to increase CPU utilization:

# Run this in a separate terminal
kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

6. Watch the Scaling in Action

In another terminal, monitor the HPA scaling behavior:

# Press Ctrl+C to stop watching when ready
kubectl get hpa php-apache --watch

Within a minute, you should see increased CPU load:

NAME         REFERENCE                     TARGET      MINPODS   MAXPODS   REPLICAS   AGE
php-apache   Deployment/php-apache/scale   305% / 50%  1         10        1          3m

Then observe the replica count increase (will stabilize between 6-8):

NAME         REFERENCE                     TARGET      MINPODS   MAXPODS   REPLICAS   AGE
php-apache   Deployment/php-apache/scale   305% / 50%  1         10        8          3m

Verify the deployment scaling:

kubectl get deployment php-apache

Expected output:

NAME         READY   UP-TO-DATE   AVAILABLE   AGE
php-apache   8/8     8            8           19m

7. Stop Load Generation and Observe Scale-Down

Stop the load generator by pressing Ctrl+C in the load generator terminal.

Monitor the scale-down process:

kubectl get hpa php-apache --watch

After a few minutes, you'll see the CPU utilization drop and replicas scale back down:

NAME         REFERENCE                     TARGET       MINPODS   MAXPODS   REPLICAS   AGE
php-apache   Deployment/php-apache/scale   0% / 50%     1         10        1          11m

⏱️ Scaling Timing:

Scale-up: Typically occurs within 1-2 minutes of increased load

Scale-down: Takes 5-10 minutes to ensure stability before reducing replicas

HPA Cleanup

Remove the HPA resources:

kubectl delete pod load-generator

# Remove the HPA
kubectl delete hpa php-apache

# Remove the application
kubectl delete -f hpa/php-apache.yaml

📚 Attribution: This HPA demo was adapted from the Kubernetes HorizontalPodAutoscaler Walkthrough.

Part 2: KEDA Event-Driven Autoscaling

KEDA Architecture

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes with event-driven autoscaling capabilities beyond traditional CPU/memory metrics. Key benefits include:

🎯 Event-Driven Scaling

Scale based on external metrics (SQS queue depth, Kafka lag, etc.)
Support for 60+ scalers including AWS services
Zero-to-N and N-to-zero scaling capabilities

🚀 Advanced Capabilities

Custom metrics from external systems
Integration with Horizontal Pod Autoscaler
Seamless integration with Karpenter for node-level scaling

⚡ Perfect for Modern Workloads

Batch processing jobs
Event-driven microservices
AI/ML inference workloads
Queue-based processing systems

This example demonstrates KEDA scaling a GPU-based AI model inference workload based on Amazon SQS queue depth.

How it works:

📨 Message Queue: SQS receives inference requests
📊 KEDA Monitoring: ScaledObject monitors queue depth
🔄 Scaling Decision: KEDA scales pods based on queue metrics
🤖 GPU Processing: Scaled pods process inference requests
📉 Scale Down: Pods scale to zero when queue is empty

Key Components:

KEDA Controller: Manages event-driven scaling
ScaledObject: Defines scaling behavior and triggers
SQS Queue: Message queue for inference requests
GPU Inference Pods: AI model serving containers
Karpenter Integration: Automatic node provisioning for GPU workloads

KEDA Implementation Steps

⚠️ Prerequisites:

GPU Instance Availability: Ensure you have sufficient GPU quota for your AWS account

Helm: Required for KEDA installation

1. Setup AWS Infrastructure

Deploy the required AWS resources (SQS queue, IAM roles) using Terraform:

cd keda/terraform

terraform init
terraform apply -auto-approve

📦 AWS Resources Created:

SQS Queue: For inference request messages

IAM Roles: Service accounts for KEDA and SQS access

IAM Policies: Permissions for queue operations

2. Configure Kubernetes Resources

Set up the necessary namespaces and service accounts:

cd ..

kubectl apply -f namespace.yaml
kubectl apply -f keda-service-account.yaml
kubectl apply -f vllm-qwen3/namespace.yaml
kubectl apply -f sqs-reader-service-account.yaml

✅ Service Account Details:

KEDA Service Account: Allows KEDA to read SQS metrics

SQS Reader Service Account: Enables pods to consume SQS messages

IAM Role Annotations: Links Kubernetes service accounts to AWS IAM roles

3. Install KEDA with Helm

Deploy KEDA controller with custom configuration:

# Add KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

# Install KEDA with custom values
helm install keda kedacore/keda \
  --namespace keda \
  --version 2.17.0 \
  --values keda-helm-values.yaml

Verify KEDA installation:

kubectl get pods -n keda

Expected output:

NAME                                      READY   STATUS    RESTARTS   AGE
keda-admission-webhooks-xxx               1/1     Running   0          2m
keda-operator-xxx                         1/1     Running   0          2m
keda-operator-metrics-apiserver-xxx       1/1     Running   0          2m

📘 KEDA Components:

Operator: Main KEDA controller managing ScaledObjects

Metrics API Server: Exposes custom metrics to HPA

Admission Webhooks: Validates KEDA resource configurations

4. Deploy GPU NodePool

Ensure GPU nodes are available for the AI workload:

# Deploy GPU-enabled NodePool
kubectl apply -f ../../../nodepools/gpu-nodepool.yaml

⚠️ GPU Node Configuration: The NodePool includes:

Instance Types: G5, G6, or G6e instances optimized for ML workloads

Taints: nvidia.com/gpu=true:NoSchedule to ensure only GPU workloads are scheduled

Labels: Proper GPU node identification for workload placement

5. Deploy AI Model with SQS Consumer

Deploy the GPU-based inference workload that will be scaled by KEDA:

kubectl apply -f vllm-qwen3/model-qwen3-4b-fp8-with-sqs.yaml

🤖 Model Details:

Model: Qwen3-4B-FP8 optimized for GPU inference

SQS Integration: Built-in consumer for processing queue messages

GPU Tolerations: Configured to run on GPU-tainted nodes

Resource Requests: Optimized for efficient GPU utilization

Verify the deployment:

kubectl get pods -n vllm
kubectl get deployments -n vllm

Initially, you should see 0 replicas since there are no messages in the queue.

6. Deploy KEDA ScaledObject

Create the ScaledObject that defines the scaling behavior:

kubectl apply -f scaledObject.yaml

📊 Scaling Configuration:

Trigger: SQS queue depth

Target: 5 messages per pod

Min Replicas: 0 (scale to zero when idle)

Max Replicas: 10 (adjust based on your needs)

Cooldown: Prevents rapid scaling oscillations

Verify the ScaledObject:

kubectl get scaledobject -n vllm

Expected output:

NAME                        SCALETARGETKIND      SCALETARGETNAME      MIN   MAX   READY   ACTIVE   FALLBACK   PAUSED    TRIGGERS        AUTHENTICATIONS   AGE
model-qwen3-4b-fp8-scaler   apps/v1.Deployment   model-qwen3-4b-fp8   0     10    True    False    Unknown    Unknown   aws-sqs-queue                     6s

7. Test the Scaling Behavior

Generate test messages to trigger scaling:

# Deploy job that generates 50 inference requests
kubectl apply -f prompt-generator-job.yaml

🧪 Test Scenario: The prompt generator creates 50 sample inference requests in the SQS queue, simulating real-world load.

8. Monitor the Scaling Process

Watch the scaling behavior in real-time:

Check SQS Queue Depth: Wait for job to finish sending messages first.

cd terraform
QUEUE_URL=$(terraform output -raw sqs_url)
aws sqs get-queue-attributes \
  --queue-url $QUEUE_URL \
  --attribute-names ApproximateNumberOfMessages

Monitor ScaledObject Status:

kubectl describe scaledobject model-qwen3-4b-fp8-scaler -n vllm

Watch Deployment Scaling:

kubectl get deployment model-qwen3-4b-fp8 -n vllm --watch

Monitor Pod Creation:

kubectl get pods -n vllm --watch

⏱️ Expected Timeline:

0-1 min: Messages appear in SQS queue (50 messages)

1-2 min: KEDA detects queue depth and triggers scaling

2-5 min: GPU nodes provision and pods start (model download begins)

5-6 min: Pods become ready and start consuming messages

6-7 min: All messages processed, queue becomes empty

7-8 min: Pods scale down to zero after cooldown period

9. Observe the Processing

Monitor the actual inference processing:

Check Pod Logs (model initialization):

kubectl logs -n vllm deployment/model-qwen3-4b-fp8 -f

Monitor Message Processing:

# Watch queue depth decrease as messages are processed
watch "aws sqs get-queue-attributes --queue-url $QUEUE_URL --attribute-names ApproximateNumberOfMessages"

🎯 Success Indicators:

Queue depth increases to 50, then decreases to 0

Deployment scales from 0 to multiple replicas, then back to 0

Pod logs show model loading and inference processing

Processing completes within expected timeframe

KEDA Cleanup

🧹 Follow these steps to clean up all KEDA resources:

1. Remove Application Resources

# Remove test job and scaling resources
kubectl delete job prompt-generator -n keda --ignore-not-found
kubectl delete -f scaledObject.yaml --ignore-not-found
kubectl delete -f vllm-qwen3/model-qwen3-4b-fp8-with-sqs.yaml --ignore-not-found

# Remove GPU NodePool created in step 4
kubectl delete -f ../../../nodepools/gpu-nodepool.yaml --ignore-not-found

2. Uninstall KEDA

# Remove KEDA Helm installation
helm uninstall keda -n keda

3. Remove Kubernetes Resources

# Clean up service accounts and namespaces
kubectl delete -f sqs-reader-service-account.yaml --ignore-not-found
kubectl delete -f keda-service-account.yaml --ignore-not-found
kubectl delete namespace keda --ignore-not-found
kubectl delete namespace vllm --ignore-not-found

4. Destroy AWS Infrastructure

# Remove AWS resources (SQS, IAM roles)
cd terraform
terraform destroy -auto-approve

⚠️ Warning: This will remove all AWS resources created for the KEDA demo, including the SQS queue and IAM roles.

Table of Contents​

Prerequisites​

Overview​

Part 1: Horizontal Pod Autoscaler (HPA)​

HPA Architecture​

HPA Implementation Steps​

1. Install Metrics Server​

2. Deploy the Sample Application​

3. Create the HorizontalPodAutoscaler​

4. Verify HPA Status​

5. Generate Load to Trigger Scaling​

6. Watch the Scaling in Action​

7. Stop Load Generation and Observe Scale-Down​

HPA Cleanup​

Part 2: KEDA Event-Driven Autoscaling​

KEDA Architecture​

KEDA Implementation Steps​

1. Setup AWS Infrastructure​

2. Configure Kubernetes Resources​

3. Install KEDA with Helm​

4. Deploy GPU NodePool​

5. Deploy AI Model with SQS Consumer​

6. Deploy KEDA ScaledObject​

7. Test the Scaling Behavior​

8. Monitor the Scaling Process​

9. Observe the Processing​

KEDA Cleanup​

1. Remove Application Resources​

2. Uninstall KEDA​

3. Remove Kubernetes Resources​

4. Destroy AWS Infrastructure​

Table of Contents

Prerequisites

Overview

Part 1: Horizontal Pod Autoscaler (HPA)

HPA Architecture

HPA Implementation Steps

1. Install Metrics Server

2. Deploy the Sample Application

3. Create the HorizontalPodAutoscaler

4. Verify HPA Status

5. Generate Load to Trigger Scaling

6. Watch the Scaling in Action

7. Stop Load Generation and Observe Scale-Down

HPA Cleanup

Part 2: KEDA Event-Driven Autoscaling

KEDA Architecture

KEDA Implementation Steps

1. Setup AWS Infrastructure

2. Configure Kubernetes Resources

3. Install KEDA with Helm

4. Deploy GPU NodePool

5. Deploy AI Model with SQS Consumer

6. Deploy KEDA ScaledObject

7. Test the Scaling Behavior

8. Monitor the Scaling Process

9. Observe the Processing

KEDA Cleanup

1. Remove Application Resources

2. Uninstall KEDA

3. Remove Kubernetes Resources

4. Destroy AWS Infrastructure