Neuron Workloads on EKS Auto Mode

Prerequisites
Overview
Architecture
Implementation Steps
Clean up
Troubleshooting

Prerequisites

Cluster deployed and kubectl configured per Quick Start.

Overview

AWS Inferentia2 accelerates machine learning inference workloads with custom-built chips. Key benefits include:

🚀 High Performance Inference

Purpose-built ML acceleration
Optimized for transformer models
Cost-effective alternative to GPUs

🎯 ML Model Support

Supports popular ML frameworks
Optimized for transformer architectures
OpenAI-compatible serving via vLLM

This example demonstrates deploying DeepSeek-R1-0528-Qwen3-8B on EKS Auto Mode using vLLM on Inferentia2. The container image ships a pre-compiled Neuron artifact, so no separate compile job is needed.

The manifest is adapted from aws-samples/sample-genai-on-eks-starter-kit.

Architecture

This example showcases Inferentia2-accelerated workloads using the following components:

🖥️ Instance Types

Default: inf2 instances (optimized for ML inference)
Customization: Available in neuron-nodepool.yaml.tpl

🔧 Key Components

📦 Infrastructure

NodePool and NodeClass for Neuron workload management
Application Load Balancer (Ingress) for HTTP access — internal-scheme by default, opt-in internet-facing + HTTPS via var.base_domain

🧠 ML Components

vLLM serving DeepSeek-R1-Qwen3-8B on 2 Neuron cores (tensor-parallel-size=2)
OpenAI-compatible chat completions endpoint
DeepSeek-R1 reasoning parser enabled

Implementation Steps

1. Deploy Neuron NodePool

Deploy the NodePool that will manage our Inferentia2 instances:

kubectl apply -f ../../nodepools/neuron-nodepool.yaml

⚠️ The Neuron NodePool applies the following taint to ensure only Neuron-compatible workloads are scheduled on these nodes:
taints:
  - key: "aws.amazon.com/neuron"
    value: "true"
    effect: "NoSchedule"   # Prevents non-Neuron pods from scheduling
Any pods that need to run on Neuron nodes must include matching tolerations in their specifications.

2. Deploy the vLLM Model

kubectl apply -f vllm-deployment.yaml

✅ The Deployment includes the toleration and aws.amazon.com/neuroncore: 2 resource request that drive scheduling onto a Neuron node:
tolerations:
  - key: aws.amazon.com/neuron
    operator: Exists
    effect: NoSchedule
resources:
  requests:
    aws.amazon.com/neuroncore: 2
  limits:
    aws.amazon.com/neuroncore: 2

Wait for the pod to schedule onto a Neuron node and report Running:

kubectl -n vllm-neuron get pods -w

Once the pod is Running, vLLM still needs a minute or two to load the model. Tail the logs and wait for Application startup complete:

kubectl -n vllm-neuron logs -f deployment/deepseek-r1-qwen3-8b-neuron

📘 The manifest provisions a ClusterIP Service deepseek-r1-qwen3-8b-neuron (port 80 → 8000) and an Ingress deepseek-r1-qwen3-8b-neuron using the cluster-wide alb IngressClass. By default the ALB scheme is internal (VPC-only) — no public endpoint is created.

3. Test the Endpoint

By default, this example exposes its API via an internal ALB — reachable from inside the VPC only. To call it from your laptop, port-forward the Service:

kubectl -n vllm-neuron port-forward svc/deepseek-r1-qwen3-8b-neuron 8000:80

Then hit the OpenAI-compatible chat completions endpoint:

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"deepseek-r1-qwen3-8b-neuron","messages":[{"role":"user","content":"Why is the sky blue?"}]}'

If you want to inspect the ALB DNS name directly (e.g. from a bastion or VPN):

kubectl get ingress deepseek-r1-qwen3-8b-neuron -n vllm-neuron \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

To expose the API publicly over HTTPS, deploy the Terraform stack with var.base_domain set to a public Route53 zone you own (see top-level README). The example will be reachable at https://neuron.<full_domain> once external-dns publishes the record.

Clean up

kubectl delete -f vllm-deployment.yaml
kubectl delete -f ../../nodepools/neuron-nodepool.yaml

Troubleshooting

🔧 Common issues and their solutions:

🎯 Pod Scheduling Issues

Pod stuck in Pending

Check scheduling events:

kubectl -n vllm-neuron describe pod -l app=deepseek-r1-qwen3-8b-neuron

Confirm a Neuron node was provisioned by the NodePool:
```
kubectl get nodes -l karpenter.sh/nodepool=neuron
```

Insufficient Neuron capacity
- Ensure sufficient inf2 quota in your AWS account
- Check pod scheduling events:
```
kubectl -n vllm-neuron get events --sort-by=.lastTimestamp
```

🧠 Model Serving Issues

Pod logs

Tail vLLM startup and request logs:

kubectl -n vllm-neuron logs -f deployment/deepseek-r1-qwen3-8b-neuron

Neuron runtime status

Inspect Neuron devices on the pod:

kubectl -n vllm-neuron exec -it deployment/deepseek-r1-qwen3-8b-neuron -- neuron-ls

🔄 Load Balancer Issues

Ingress / ALB Status

Check Ingress provisioning:

kubectl -n vllm-neuron describe ingress deepseek-r1-qwen3-8b-neuron

Check ALB controller logs:

kubectl -n kube-system logs deployment/aws-load-balancer-controller

Verify the ALB hostname resolves and target group health is green

💡 Tip: Always check pod logs and events first when troubleshooting deployment issues.

Table of Contents​

Prerequisites​

Overview​

Architecture​

🖥️ Instance Types​

🔧 Key Components​

Implementation Steps​

1. Deploy Neuron NodePool​

2. Deploy the vLLM Model​

3. Test the Endpoint​

Clean up​

Troubleshooting​

🎯 Pod Scheduling Issues​

🧠 Model Serving Issues​

🔄 Load Balancer Issues​

Table of Contents