Dynamo vLLM Serving¶
Deploy vLLM models with NVIDIA Dynamo Platform orchestration. Supports aggregated and disaggregated serving modes, KV cache routing, KV cache offloading (KVBM), and advanced parallelism strategies.
| Category | nvidia-platform |
| Official Docs | NVIDIA Dynamo vLLM, vLLM |
| CLI Install | ./cli nvidia-platform dynamo-vllm install |
| CLI Uninstall | ./cli nvidia-platform dynamo-vllm uninstall |
| Namespace | dynamo-system |
Overview¶
Dynamo vLLM provides high-performance LLM serving with: - Aggregated mode: Single worker handles prefill + decode - Disaggregated mode: Separate prefill and decode workers with NIXL KV transfer - KV Cache Routing: Routes requests to workers with cached KV blocks - KV Cache Offloading (KVBM): GPU → CPU → Disk cache hierarchy - Model pre-download: Downloads to PVC before deployment - Multi-node serving: Tensor/Pipeline/Expert parallelism with pod replicas
Installation¶
Interactive Configuration¶
The installer prompts for essential configuration only:
? Enter model name: (Qwen/Qwen3-30B-A3B-Instruct-2507-FP8)
? Select deployment mode: Aggregated / Disaggregated
? Tensor Parallel Size (TP): 1
? Enable KV Router? No
? Enable KV Cache Offloading (KVBM)? No
? Worker replicas: 1
? Additional vLLM args: --gpu-memory-utilization 0.90 --block-size 128
? Deployment name: (qwen3-30b-a3b-instruct-25)
? What would you like to do? Deploy now / Review first / Save only
Auto-Configuration¶
No prompts for these settings (configured from config.json or auto-detection):
| Setting | Source |
|---|---|
| vLLM image tag | config.json → dynamoPlatform.releaseVersion |
| Structured logging | Auto-enable if monitoring installed |
| KV Router temperature | 0.5 (Dynamo default) |
| KV overlap score weight | 1.0 (Dynamo default) |
| KVBM disk directory | /tmp (K8s) / /mnt/nvme/kvbm_cache (EKS) |
| KVBM disk offload filter | true (default, protects SSD lifespan) |
Deployment Modes¶
Aggregated (agg)¶
Single worker handles both prefill (prompt encoding) and decode (token generation):
Characteristics: - Simpler architecture - Lower latency for small batches - TP/PP/EP applied to single worker type - Better for single-turn conversations
Disaggregated (disagg)¶
Separate workers for prefill and decode, with NIXL for KV transfer:
Characteristics: - Independent scaling of prefill and decode - Separate TP/PP/EP for each worker type - Better for multi-turn conversations with KV Router - KVBM applied to Prefill workers only
Parallelism¶
| Parameter | Description | Agg | Disagg |
|---|---|---|---|
| Tensor Parallel (TP) | Split model across GPUs | Single setting | Separate Prefill TP + Decode TP |
| Pipeline Parallel (PP) | Split layers across GPUs | Single setting | Separate Prefill PP + Decode PP |
| Expert Parallel (EP) | MoE expert distribution | --enable-expert-parallel | Per worker type |
| Replicas | Pod-level scaling | Worker replicas | Prefill replicas + Decode replicas |
Example: Disaggregated with TP=2 for prefill, TP=4 for decode:
KV Cache Routing¶
Routes requests to workers with cached KV blocks for improved TTFT (Time to First Token).
| Setting | ENV / Arg | Default |
|---|---|---|
| Enable | DYN_ROUTER_MODE=kv | Disabled |
| Temperature | DYN_ROUTER_TEMPERATURE | 0.5 |
| Overlap Weight | DYN_KV_OVERLAP_SCORE_WEIGHT | 1.0 |
| KV Events | --kv-events-config | Auto when router enabled |
Use Case: Multi-turn conversations with session affinity. Benchmark with AIPerf Multi-Turn mode.
KV Cache Offloading (KVBM)¶
Three-tier cache hierarchy for larger effective context:
| Setting | ENV | Description |
|---|---|---|
| CPU Cache | DYN_KVBM_CPU_CACHE_GB | CPU pinned memory size (GB) |
| Disk Cache | DYN_KVBM_DISK_CACHE_GB | SSD cache size (GB) |
| Disk Directory | DYN_KVBM_DISK_CACHE_DIR | Cache path (default: /tmp) |
| Disk Offload Filter | DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER | Frequency filter (default: enabled) |
| Connector | --connector kvbm (agg) / --connector kvbm nixl (disagg prefill) | Required |
Disaggregated Mode
In disaggregated mode, KVBM is applied to Prefill workers only. Decode workers use --connector nixl for KV transfer.
Disk Offload Frequency Filter
Only offloads blocks with frequency >= 2 to disk, protecting SSD lifespan. Frequency doubles on cache hit, decays over time (600s interval).
Model Download¶
Models are pre-downloaded to PVC using huggingface_hub.snapshot_download:
- Fixed path:
/opt/models/<org>/<model-name>/ - Auto-detection: Checks if
*.safetensorsexist before download - vLLM args:
--model /opt/models/<org>/<model>+--served-model-name <HF ID>
Example:
Model: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
PVC Path: /opt/models/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8/
vLLM: --model /opt/models/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
Verification¶
# Check deployment
kubectl get dynamographdeployment -n dynamo-system
# Check pods
kubectl get pods -n dynamo-system
# Check frontend service
kubectl get svc -n dynamo-system | grep frontend
# Port-forward for testing
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &
# Auto-detect model name
export MODEL=$(curl -s localhost:8000/v1/models | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(d[0]['id'] if d else 'NONE')")
# Test chat completion
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],
\"max_tokens\": 50,
\"stream\": false
}"
Configuration¶
Modify config.json to change default settings:
Advanced vLLM Arguments¶
Common arguments to pass via "Additional vLLM args":
# GPU memory utilization
--gpu-memory-utilization 0.90
# Block size (larger = more memory, fewer blocks)
--block-size 128
# Expert parallel (for MoE models)
--enable-expert-parallel
# Quantization
--quantization fp8
# Max model length
--max-model-len 8192
Integration with LiteLLM¶
Deploy Dynamo vLLM models, then configure LiteLLM to route traffic:
model_list:
- model_name: qwen3-30b-fp8
litellm_params:
model: openai/qwen3-30b-fp8
api_base: http://<deployment>-frontend.dynamo-system:8000/v1