NVIDIA Dynamo Platform¶
Deploy and serve LLM models with NVIDIA Dynamo on Kubernetes (on-premises) and Amazon EKS. The NVIDIA Platform provides a comprehensive suite of components for GPU-accelerated LLM inference with enterprise-grade monitoring, benchmarking, and auto-configuration.
Architecture¶
+-----------------------+
| CLI |
| ./cli nvidia-platform |
+----------+------------+
|
+--------+--------+--------+--------+--------+
| | | | | |
GPU Op Monitor Dynamo Dynamo Benchmark AIConfig
DCGM Prom, Platform vLLM AIPerf, Quick Est,
Grafana etcd, agg/ Pushgateway SLA Deploy
Operator disagg Pareto
Components¶
| Component | Description | CLI Command |
|---|---|---|
| GPU Operator | NVIDIA GPU resource management | ./cli nvidia-platform gpu-operator install |
| Monitoring | Prometheus + Grafana + Dynamo/DCGM/KVBM/Benchmark dashboards | ./cli nvidia-platform monitoring install |
| Dynamo Platform | CRDs, Operator, etcd, NATS, Grove, KAI Scheduler | ./cli nvidia-platform dynamo-platform install |
| Dynamo vLLM Serving | vLLM model deployment (agg/disagg, KV Router, KVBM) | ./cli nvidia-platform dynamo-vllm install |
| AIPerf Benchmark | Concurrency sweep, multi-turn, seq distribution, prefix cache | ./cli nvidia-platform benchmark install |
| AIConfigurator | TP/PP recommendation (Quick Estimate) + SLA-driven profile + plan + deploy | ./cli nvidia-platform aiconfigurator install |
Installation Order¶
# === Standard Path ===
# 0. (K8s only) Install Ingress controller if not present
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx --force-update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace --set controller.service.type=NodePort --wait
# 1. Monitoring (Prometheus + Grafana) — install first so GPU Operator
# auto-detects ServiceMonitor CRDs and enables DCGM metrics scraping.
./cli nvidia-platform monitoring install
# 2. GPU Operator (detects Prometheus → creates DCGM ServiceMonitor)
./cli nvidia-platform gpu-operator install
# 3. Dynamo Platform (auto-detects Prometheus and sets prometheusEndpoint)
./cli nvidia-platform dynamo-platform install
# 4. Deploy a model
./cli nvidia-platform dynamo-vllm install
Platform Modes¶
The NVIDIA Platform supports two deployment modes, configured via PLATFORM in .env:
K8s Mode (On-premises)¶
Environment: PLATFORM=k8s
Designed for bare-metal Kubernetes clusters with pre-installed NVIDIA drivers and Container Toolkit.
Prerequisites (user must prepare): - Kubernetes v1.33+ - NVIDIA Driver 580+ - NVIDIA Container Toolkit with CDI configured - Fabric Manager (for H100 SXM / NVSwitch GPUs) - StorageClass with ReadWriteMany access (e.g., NFS, CephFS)
Auto-installed by CLI: - local-path StorageClass (if not found) - ingress-nginx (prompted if not found) - Prometheus endpoint auto-configuration
EKS Mode (Amazon Web Services)¶
Environment: PLATFORM=eks
Optimized for Amazon EKS with GPU node groups (g6e, p5, p4d, etc.).
Prerequisites (user must prepare): - EKS Cluster v1.33+ - GPU node groups with EKS GPU AMI - HuggingFace token for gated model access
Auto-installed by CLI: - EFS CSI Driver (via EKS addon) - EFS StorageClass - ALB Ingress annotations - Prometheus endpoint auto-configuration
Key Features¶
Deployment Modes¶
- Aggregated: Single worker handles prefill + decode
- Disaggregated: Separate prefill and decode workers with NIXL KV transfer
Parallelism Options¶
- Tensor Parallel (TP): Split model across GPUs
- Pipeline Parallel (PP): Split layers across GPUs
- Expert Parallel (EP): MoE expert distribution
- Replicas: Pod-level scaling with Dynamo Frontend routing
Advanced Features¶
- KV Cache Routing: Routes requests to workers with cached KV blocks
- KV Cache Offloading (KVBM): GPU → CPU → Disk cache hierarchy
- Model Download: Pre-download to PVC with auto-detection
- Structured Logging: Auto-enable if monitoring installed
Quick Test¶
# Port-forward the frontend service
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &
# Auto-detect model name
export MODEL=$(curl -s localhost:8000/v1/models | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(d[0]['id'] if d else 'NONE')")
# Chat completion
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello! Who are you?\"}],
\"max_tokens\": 100,
\"stream\": false
}"