Monitoring (Prometheus + Grafana)¶
Comprehensive monitoring stack with Prometheus, Grafana, and pre-configured dashboards for NVIDIA Dynamo Platform, DCGM GPU metrics, KV cache benchmarks, and LLM performance analysis.
| Category | nvidia-platform |
| Official Docs | kube-prometheus-stack |
| CLI Install | ./cli nvidia-platform monitoring install |
| CLI Uninstall | ./cli nvidia-platform monitoring uninstall |
| Namespace | monitoring |
Overview¶
The monitoring component deploys kube-prometheus-stack (Prometheus + Grafana) with specialized dashboards for: - Dynamo Frontend and Worker metrics - DCGM GPU utilization, power, temperature - KV cache usage and offloading (KVBM) - Benchmark Pareto comparison (TPS/GPU, TTFT, ITL)
When installed before Dynamo Platform, the prometheusEndpoint is automatically detected and configured.
Installation¶
Auto-Configuration¶
No interactive prompts required. All settings are configured from config.json:
| Setting | Default | Configured via |
|---|---|---|
| Grafana password | admin | config.json → grafanaAdminPassword |
| Retention | 7d | config.json → retention |
| Alertmanager | false | config.json → alertmanagerEnabled |
| Ingress | Auto-detect | Enabled if Ingress controller exists |
| ALB annotations (EKS) | Auto-added | When PLATFORM=eks and Ingress detected |
Verification¶
# Check monitoring pods
kubectl get pods -n monitoring
# Check services
kubectl get svc -n monitoring
# Check Ingress (if enabled)
kubectl get ingress -n monitoring
Access¶
With Ingress (Recommended)¶
If an Ingress controller is detected during installation, Grafana and Prometheus are accessible via HTTP path routing:
K8s Mode (on-premises):
# Find the NodePort
kubectl get svc -n ingress-nginx
# Access URLs
http://<node-ip>:<node-port>/grafana
http://<node-ip>:<node-port>/prometheus
EKS Mode (ALB):
# Find the ALB address
kubectl get ingress -n monitoring
# Access URLs
http://<alb-url>/grafana
http://<alb-url>/prometheus
Remote Access (SSH Tunnel):
# From your local machine
ssh -N -L <local-port>:<node-ip>:<node-port> <user>@<remote-host>
# Open in browser
http://localhost:<local-port>/grafana
http://localhost:<local-port>/prometheus
Without Ingress (Port-Forward Fallback)¶
# Grafana
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring --address 0.0.0.0 &
# Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring --address 0.0.0.0 &
Grafana Login¶
| Field | Value |
|---|---|
| User | admin |
| Password | Configured during install (default: admin) |
To retrieve the current password:
kubectl get secret prometheus-grafana -n monitoring \
-o jsonpath="{.data.admin-password}" | base64 --decode; echo
Dashboards¶
The monitoring stack includes pre-configured dashboards for comprehensive observability:
| Dashboard | Description | Source |
|---|---|---|
| Dynamo Dashboard | Frontend/Worker metrics, request rates, latencies | monitoring/dashboards/dynamo-dashboard.json |
| DCGM GPU Monitoring | GPU utilization, memory, temperature, power | monitoring/dashboards/dcgm-metrics.json |
| KVBM KV Cache | KV cache usage, offloading metrics | monitoring/dashboards/kvbm.json |
| Benchmark Pareto | Benchmark comparison (TPS/GPU, TTFT, ITL vs concurrency) | monitoring/dashboards/benchmark-dashboard.json |
Dashboards are auto-loaded via Grafana sidecar (ConfigMaps with grafana_dashboard: "1" label).
Benchmark Pareto Dashboard¶
The Benchmark dashboard provides interactive comparison of LLM serving configurations:
- TPS/GPU vs Concurrency: Throughput per GPU at different load levels
- TPS/User vs Concurrency: Per-user throughput
- TTFT P50/P99 vs Concurrency: Time to First Token latency
- ITL P50 vs Concurrency: Inter-Token Latency
- GPU Efficiency: TPS/GPU vs TPS/User scatter plot (top-right = optimal)
Use the Benchmark dropdown to select and compare multiple benchmark runs.
Configuration¶
Edit config.json to customize monitoring settings:
{
"platform": {
"monitoring": {
"grafanaAdminPassword": "admin",
"retention": "7d",
"enablePersistentStorage": false,
"prometheusStorageSize": "50Gi",
"alertmanagerEnabled": false
}
}
}
Prometheus Pushgateway¶
Pushgateway is automatically installed for collecting benchmark metrics. After each AIPerf Benchmark run, metrics are pushed to Pushgateway and scraped by Prometheus.
Managing Benchmark Data¶
To reset benchmark metrics in Grafana:
# Delete all Pushgateway data
kubectl port-forward svc/pushgateway-prometheus-pushgateway 19091:9091 -n monitoring &
sleep 2
curl -s http://localhost:19091/metrics | grep -oP 'job="[^"]*"' | sort -u | sed 's/job="//;s/"//' | while read job; do
curl -X DELETE "http://localhost:19091/metrics/job/$job"
done
pkill -f "port-forward.*pushgateway"
Integration with Dynamo Platform¶
When monitoring is installed before Dynamo Platform:
- Dynamo Platform installer detects Prometheus endpoint
- Sets
prometheusEndpointin Helm values - Dynamo Operator auto-creates PodMonitors for Frontend/Workers
- Worker metrics endpoint (
DYN_SYSTEM_PORT=9090) auto-configured - GPU metrics from DCGM Exporter auto-scraped via ServiceMonitor