GPU Workloads on EKS Auto Mode

Prerequisites
Overview
Architecture
Implementation Steps
Clean up
Troubleshooting

Prerequisites

Cluster deployed and kubectl configured per Quick Start.

Overview

NVIDIA GPUs on Amazon EC2 supercharge your workloads with powerful GPU acceleration. Key benefits include:

🚀 High Performance Computing

GPU accelerators from g3, g4, g5, g6, p3, and p4 families
Optimized for machine learning and graphics workloads
Ideal for running large language models

🤖 AI/ML Capabilities

Perfect for GenAI model deployment
Supports complex deep learning tasks
Accelerated model inference

⚙️ Flexible Configuration

Customizable instance types
Scalable GPU resources
EKS Auto Mode integration

This example demonstrates deploying a GenAI model (Qwen 3 32b fp8) on EKS Auto Mode.

⚠️ Prerequisites:

You must have a Hugging Face account with an access token!

GPU Instance Availability: Many AWS accounts have a default service quota of 0 for p* and g* GPU instance types. You may need to request a quota increase through the AWS Service Quotas console before deploying GPU workloads. This process can take 24-48 hours for approval.

Architecture

This example showcases GPU-accelerated workloads on EKS Auto Mode using the following components:

🖥️ Instance Types

Default: G5, G6 or G6e instances (optimized for ML workloads)
Customization: Available in gpu-nodepool.yaml.tpl

🔧 Key Components

📦 Infrastructure

NodePool and NodeClass for GPU workload management
Application Load Balancer (Ingress) for HTTP access — internal-scheme by default, opt-in internet-facing + HTTPS via var.base_domain

🧠 AI Components

Hugging Face model deployment (Qwen 3 32b fp8)
Interactive Web UI for model interaction

Implementation Steps

1. Get Hugging Face Access Token

Create a Hugging Face account and generate a FINEGRAINED Access Token

2. Deploy GPU NodePool

Deploy the NodePool that will manage our GPU instances:

kubectl apply -f ../../nodepools/gpu-nodepool.yaml

⚠️ The GPU NodePool applies the following taint to ensure only GPU-compatible workloads are scheduled on these nodes:
taints:
  - key: "nvidia.com/gpu"
    value: "true"
    effect: "NoSchedule"   # Prevents non-GPU pods from scheduling
Any pods that need to run on GPU nodes must include matching tolerations in their specifications.

3. Configure Namespace and Secrets

Create Namespace:

kubectl apply -f namespace.yaml

Add Hugging Face Token:

# Replace <your_actual_hugging_face_token> with your token
kubectl create secret generic hf-secret \
  -n vllm-inference \
  --from-literal=hf_api_token=<your_actual_hugging_face_token>

4. Deploy Model and UI

Deploy the Model: Following command will deploy Qwen3 32b (fp8). We also have another manifest file that allows you to deploy Deepseek instead.

kubectl apply -f model-qwen3-32b-fp8.yaml

✅ The model deployment includes the required toleration to run on GPU nodes:
tolerations:
  - key: "nvidia.com/gpu"     # Matches the GPU node taint
    value: "true"
    effect: "NoSchedule"      # Allows scheduling on tainted nodes
This toleration enables the pods to be scheduled on our GPU-enabled instances.

Deploy the Web UI:

kubectl apply -f open-webui.yaml

5. Deploy the Service and Ingress

Apply the ClusterIP Service + ALB Ingress that front the Web UI:

kubectl apply -f lb-service.yaml

📘 The manifest provisions a ClusterIP Service open-webui-service (port 80 → 8080) and an Ingress open-webui-ingress using the cluster-wide alb IngressClass. By default the ALB scheme is internal (VPC-only) — no public endpoint is created.

6. Access the Application

By default, this example exposes its UI via an internal ALB — reachable from inside the VPC only. To access it from your laptop, use kubectl port-forward:

kubectl port-forward -n vllm-inference svc/open-webui-service 8080:80
# then open http://localhost:8080

If you want to inspect the ALB DNS name directly (e.g. from a bastion or VPN):

kubectl get ingress open-webui-ingress \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' \
  -n vllm-inference

To expose the UI publicly over HTTPS, deploy the Terraform stack with var.base_domain set to a public Route53 zone you own (see top-level README). The example will be reachable at https://gpu.<full_domain> once external-dns publishes the record.

⚠️ Select a Model: If you are unable to select a model, it means the model is still being downloaded and is not yet being served by our inferencing server pod. Just refresh until you see a model available.

Clean up

kubectl delete -f lb-service.yaml
kubectl delete -f open-webui.yaml
kubectl delete -f model-qwen3-32b-fp8.yaml
kubectl delete -f namespace.yaml
kubectl delete -f ../../nodepools/gpu-nodepool.yaml

Troubleshooting

🔧 Common issues and their solutions:

🎯 Model Deployment Issues

GPU Node Provisioning
- Verify nodes are properly labeled for GPU
- Check node status with kubectl get nodes
- Ensure GPU drivers are initialized
Model Initialization
- Check pod logs for startup errors:
```
kubectl logs -n vllm-inference deployment/qwen3-32b-fp8
```
- Verify Hugging Face token is valid

🔄 Load Balancer Issues

Ingress / ALB Status
- Check Ingress provisioning:
```
kubectl describe ingress open-webui-ingress -n vllm-inference
```
- Check ALB controller logs:
```
kubectl logs -n kube-system deployment/aws-load-balancer-controller
```
- Verify the ALB hostname resolves and target group health is green

💻 Resource Constraints

GPU Capacity
- Ensure sufficient GPU quota in your AWS account
- Monitor GPU utilization:
```
kubectl describe node <node-name>
```
- Check for pod scheduling events:
```
kubectl get events -n vllm-inference
```

💡 Tip: Always check pod logs and events first when troubleshooting deployment issues.

Table of Contents​

Prerequisites​

Overview​

Architecture​

🖥️ Instance Types​

🔧 Key Components​

Implementation Steps​

1. Get Hugging Face Access Token​

2. Deploy GPU NodePool​

3. Configure Namespace and Secrets​

4. Deploy Model and UI​

5. Deploy the Service and Ingress​

6. Access the Application​

Clean up​

Troubleshooting​

🎯 Model Deployment Issues​

🔄 Load Balancer Issues​

💻 Resource Constraints​

Table of Contents