Skip to content

Dynamo vLLM Serving

Deploy vLLM models with NVIDIA Dynamo Platform orchestration. Supports aggregated and disaggregated serving modes, KV cache routing, KV cache offloading (KVBM), and advanced parallelism strategies.

Category nvidia-platform
Official Docs NVIDIA Dynamo vLLM, vLLM
CLI Install ./cli nvidia-platform dynamo-vllm install
CLI Uninstall ./cli nvidia-platform dynamo-vllm uninstall
Namespace dynamo-system

Overview

Dynamo vLLM provides high-performance LLM serving with: - Aggregated mode: Single worker handles prefill + decode - Disaggregated mode: Separate prefill and decode workers with NIXL KV transfer - KV Cache Routing: Routes requests to workers with cached KV blocks - KV Cache Offloading (KVBM): GPU → CPU → Disk cache hierarchy - Model pre-download: Downloads to PVC before deployment - Multi-node serving: Tensor/Pipeline/Expert parallelism with pod replicas

Installation

./cli nvidia-platform dynamo-vllm install

Interactive Configuration

The installer prompts for essential configuration only:

? Enter model name: (Qwen/Qwen3-30B-A3B-Instruct-2507-FP8)
? Select deployment mode: Aggregated / Disaggregated
? Tensor Parallel Size (TP): 1
? Enable KV Router? No
? Enable KV Cache Offloading (KVBM)? No
? Worker replicas: 1
? Additional vLLM args: --gpu-memory-utilization 0.90 --block-size 128
? Deployment name: (qwen3-30b-a3b-instruct-25)
? What would you like to do? Deploy now / Review first / Save only

Auto-Configuration

No prompts for these settings (configured from config.json or auto-detection):

Setting Source
vLLM image tag config.jsondynamoPlatform.releaseVersion
Structured logging Auto-enable if monitoring installed
KV Router temperature 0.5 (Dynamo default)
KV overlap score weight 1.0 (Dynamo default)
KVBM disk directory /tmp (K8s) / /mnt/nvme/kvbm_cache (EKS)
KVBM disk offload filter true (default, protects SSD lifespan)

Deployment Modes

Aggregated (agg)

Single worker handles both prefill (prompt encoding) and decode (token generation):

Request → Frontend → VllmWorker (prefill + decode) → Response

Characteristics: - Simpler architecture - Lower latency for small batches - TP/PP/EP applied to single worker type - Better for single-turn conversations

Disaggregated (disagg)

Separate workers for prefill and decode, with NIXL for KV transfer:

Request → Frontend → VllmPrefillWorker → [NIXL KV Transfer] → VllmDecodeWorker → Response

Characteristics: - Independent scaling of prefill and decode - Separate TP/PP/EP for each worker type - Better for multi-turn conversations with KV Router - KVBM applied to Prefill workers only

Parallelism

Parameter Description Agg Disagg
Tensor Parallel (TP) Split model across GPUs Single setting Separate Prefill TP + Decode TP
Pipeline Parallel (PP) Split layers across GPUs Single setting Separate Prefill PP + Decode PP
Expert Parallel (EP) MoE expert distribution --enable-expert-parallel Per worker type
Replicas Pod-level scaling Worker replicas Prefill replicas + Decode replicas

Example: Disaggregated with TP=2 for prefill, TP=4 for decode:

Prefill: 2 replicas × TP2 = 4 GPUs
Decode:  4 replicas × TP4 = 16 GPUs
Total: 20 GPUs

KV Cache Routing

Routes requests to workers with cached KV blocks for improved TTFT (Time to First Token).

Setting ENV / Arg Default
Enable DYN_ROUTER_MODE=kv Disabled
Temperature DYN_ROUTER_TEMPERATURE 0.5
Overlap Weight DYN_KV_OVERLAP_SCORE_WEIGHT 1.0
KV Events --kv-events-config Auto when router enabled

Use Case: Multi-turn conversations with session affinity. Benchmark with AIPerf Multi-Turn mode.

KV Cache Offloading (KVBM)

Three-tier cache hierarchy for larger effective context:

GPU Memory (G1) → CPU Pinned Memory (G2) → Local SSD/NVMe (G3)
Setting ENV Description
CPU Cache DYN_KVBM_CPU_CACHE_GB CPU pinned memory size (GB)
Disk Cache DYN_KVBM_DISK_CACHE_GB SSD cache size (GB)
Disk Directory DYN_KVBM_DISK_CACHE_DIR Cache path (default: /tmp)
Disk Offload Filter DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER Frequency filter (default: enabled)
Connector --connector kvbm (agg) / --connector kvbm nixl (disagg prefill) Required

Disaggregated Mode

In disaggregated mode, KVBM is applied to Prefill workers only. Decode workers use --connector nixl for KV transfer.

Disk Offload Frequency Filter

Only offloads blocks with frequency >= 2 to disk, protecting SSD lifespan. Frequency doubles on cache hit, decays over time (600s interval).

Model Download

Models are pre-downloaded to PVC using huggingface_hub.snapshot_download:

  • Fixed path: /opt/models/<org>/<model-name>/
  • Auto-detection: Checks if *.safetensors exist before download
  • vLLM args: --model /opt/models/<org>/<model> + --served-model-name <HF ID>

Example:

Model: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
PVC Path: /opt/models/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8/
vLLM: --model /opt/models/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

Verification

# Check deployment
kubectl get dynamographdeployment -n dynamo-system

# Check pods
kubectl get pods -n dynamo-system

# Check frontend service
kubectl get svc -n dynamo-system | grep frontend

# Port-forward for testing
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &

# Auto-detect model name
export MODEL=$(curl -s localhost:8000/v1/models | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(d[0]['id'] if d else 'NONE')")

# Test chat completion
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],
    \"max_tokens\": 50,
    \"stream\": false
  }"

Configuration

Modify config.json to change default settings:

{
  "platform": {
    "eks": {
      "dynamoPlatform": {
        "releaseVersion": "0.9.1"
      }
    }
  }
}

Advanced vLLM Arguments

Common arguments to pass via "Additional vLLM args":

# GPU memory utilization
--gpu-memory-utilization 0.90

# Block size (larger = more memory, fewer blocks)
--block-size 128

# Expert parallel (for MoE models)
--enable-expert-parallel

# Quantization
--quantization fp8

# Max model length
--max-model-len 8192

Integration with LiteLLM

Deploy Dynamo vLLM models, then configure LiteLLM to route traffic:

model_list:
  - model_name: qwen3-30b-fp8
    litellm_params:
      model: openai/qwen3-30b-fp8
      api_base: http://<deployment>-frontend.dynamo-system:8000/v1

Learn More