NVIDIA Dynamo Platform¶

Deploy and serve LLM models with NVIDIA Dynamo on Kubernetes (on-premises) and Amazon EKS. The NVIDIA Platform provides a comprehensive suite of components for GPU-accelerated LLM inference with enterprise-grade monitoring, benchmarking, and auto-configuration.

Architecture¶

                    +-----------------------+
                    |      CLI              |
                    |  ./cli nvidia-platform |
                    +----------+------------+
                               |
    +--------+--------+--------+--------+--------+
    |        |        |        |        |        |
  GPU Op   Monitor   Dynamo   Dynamo   Benchmark  AIConfig
  DCGM     Prom,     Platform  vLLM    AIPerf,    Quick Est,
           Grafana   etcd,     agg/     Pushgateway SLA Deploy
                    Operator   disagg   Pareto

Components¶

Component	Description	CLI Command
GPU Operator	NVIDIA GPU resource management	`./cli nvidia-platform gpu-operator install`
Monitoring	Prometheus + Grafana + Dynamo/DCGM/KVBM/Benchmark dashboards	`./cli nvidia-platform monitoring install`
Dynamo Platform	CRDs, Operator, etcd, NATS, Grove, KAI Scheduler	`./cli nvidia-platform dynamo-platform install`
Dynamo vLLM Serving	vLLM model deployment (agg/disagg, KV Router, KVBM)	`./cli nvidia-platform dynamo-vllm install`
AIPerf Benchmark	Concurrency sweep, multi-turn, seq distribution, prefix cache	`./cli nvidia-platform benchmark install`
AIConfigurator	TP/PP recommendation (Quick Estimate) + SLA-driven profile + plan + deploy	`./cli nvidia-platform aiconfigurator install`

Installation Order¶

# === Standard Path ===
# 0. (K8s only) Install Ingress controller if not present
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx --force-update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace --set controller.service.type=NodePort --wait

# 1. Monitoring (Prometheus + Grafana) — install first so GPU Operator
#    auto-detects ServiceMonitor CRDs and enables DCGM metrics scraping.
./cli nvidia-platform monitoring install

# 2. GPU Operator (detects Prometheus → creates DCGM ServiceMonitor)
./cli nvidia-platform gpu-operator install

# 3. Dynamo Platform (auto-detects Prometheus and sets prometheusEndpoint)
./cli nvidia-platform dynamo-platform install

# 4. Deploy a model
./cli nvidia-platform dynamo-vllm install

Platform Modes¶

The NVIDIA Platform supports two deployment modes, configured via PLATFORM in .env:

K8s Mode (On-premises)¶

Environment: PLATFORM=k8s

Designed for bare-metal Kubernetes clusters with pre-installed NVIDIA drivers and Container Toolkit.

Prerequisites (user must prepare): - Kubernetes v1.33+ - NVIDIA Driver 580+ - NVIDIA Container Toolkit with CDI configured - Fabric Manager (for H100 SXM / NVSwitch GPUs) - StorageClass with ReadWriteMany access (e.g., NFS, CephFS)

Auto-installed by CLI: - local-path StorageClass (if not found) - ingress-nginx (prompted if not found) - Prometheus endpoint auto-configuration

EKS Mode (Amazon Web Services)¶

Environment: PLATFORM=eks

Optimized for Amazon EKS with GPU node groups (g6e, p5, p4d, etc.).

Prerequisites (user must prepare): - EKS Cluster v1.33+ - GPU node groups with EKS GPU AMI - HuggingFace token for gated model access

Auto-installed by CLI: - EFS CSI Driver (via EKS addon) - EFS StorageClass - ALB Ingress annotations - Prometheus endpoint auto-configuration

Key Features¶

Deployment Modes¶

Aggregated: Single worker handles prefill + decode
Disaggregated: Separate prefill and decode workers with NIXL KV transfer

Parallelism Options¶

Tensor Parallel (TP): Split model across GPUs
Pipeline Parallel (PP): Split layers across GPUs
Expert Parallel (EP): MoE expert distribution
Replicas: Pod-level scaling with Dynamo Frontend routing

Advanced Features¶

KV Cache Routing: Routes requests to workers with cached KV blocks
KV Cache Offloading (KVBM): GPU → CPU → Disk cache hierarchy
Model Download: Pre-download to PVC with auto-detection
Structured Logging: Auto-enable if monitoring installed

Quick Test¶

# Port-forward the frontend service
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &

# Auto-detect model name
export MODEL=$(curl -s localhost:8000/v1/models | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(d[0]['id'] if d else 'NONE')")

# Chat completion
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello! Who are you?\"}],
    \"max_tokens\": 100,
    \"stream\": false
  }"