Demo Walkthrough¶
This guide walks you through the demo environment deployed by ./cli demo-setup, including setup steps and usage instructions for each component.
Deployed Components¶
The demo environment includes:
AI Gateway¶
LiteLLM
- Unified API gateway for multiple LLM providers
- Request routing and load balancing
- Usage tracking and observability
- Access:
litellm.<DOMAIN>/ui
LLM Models¶
vLLM with Qwen3 Models
Two models are deployed:
| Model | Quantization | Hardware | Performance | Mode |
|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507-FP8 | FP8 (8-bit) | Single g6e | ~75 tokens/sec | MoE, non-thinking |
| Qwen3-32B-FP8 | FP8 (8-bit) | Single g6e | ~15 tokens/sec | Dense, thinking & non-thinking |
Model Characteristics
- Fast Model: MoE architecture optimized for speed
- Slow Model: Dense model with reasoning capabilities
Observability¶
Langfuse
- LLM observability and analytics
- Trace tracking and debugging
- Cost and performance monitoring
- Access:
langfuse.<DOMAIN>
GUI Application¶
Open WebUI
- Chat interface for LLM interaction
- Document RAG capabilities
- AI agent integration
- Function/tool marketplace
- Access:
openwebui.<DOMAIN>
Vector Database¶
Qdrant
- High-performance vector database
- Used for RAG document embeddings
- REST and gRPC API support
Embedding Models¶
Text Embedding Inference (TEI) with Qwen3-Embedding
| Model | Quantization | Hardware |
|---|---|---|
| Qwen3-Embedding-4B | BF16 (16-bit) | Single r7i |
MCP Server¶
Calculator MCP Server
- Built with FastMCP 2.0
- Basic calculator operations
- Demonstrates MCP server implementation
AI Agent¶
Strands Calculator Agent
- Built with Strands Agents framework
- Stateful calculator with memory
- Integrated with Open WebUI
Demo Setup¶
After running ./cli demo-setup, configure Open WebUI:
1. Access Open WebUI¶
Navigate to openwebui.<DOMAIN> in your browser.
2. Agent Functions (Auto-Registered)¶
Agent pipe functions are automatically registered when agents are installed:
The Strands Agents - Calculator Agent function appears in Open WebUI automatically.
3. Add Optional Functions¶
Add the Time Token Tracker function from the marketplace:
- Navigate to Functions in Open WebUI
- Search for "Time Token Tracker"
- Click install
Time Token Tracker on Open WebUI →
See Open WebUI Functions documentation for more details.
4. Configure RAG Embedding Model¶
Set up Open WebUI to use the deployed Qwen3-Embedding model:
- Navigate to Admin Panel → Settings → Documents
- Set the embedding model endpoint
- Get the API key from
.env.local(checkLITELLM_API_KEY) - Save configuration
RAG Embedding Support documentation →
Using the Demo¶
Chat with LLM Models¶
Interact with deployed models through Open WebUI:
-
Start a Conversation
- Select a model from the dropdown
- Type your message and press Enter
- Try both the fast and slow models
-
Compare Models
- Fast model (Qwen3-30B-A3B): Better for quick responses
- Slow model (Qwen3-32B): Better for complex reasoning
Model Selection
Use the fast model for general chat and the slow model when you need detailed reasoning or step-by-step thinking.
Document RAG¶
Use the RAG feature to chat with your documents:
-
Upload Documents
- Click the document icon in the chat interface
- Select files to upload (PDF, TXT, DOCX, etc.)
- Wait for embedding processing
-
Query Documents
- Ask questions about uploaded documents
- The system retrieves relevant chunks using Qdrant
- Responses are grounded in your documents
-
Manage Collections
- View and organize document collections
- Update or delete documents as needed
Calculator Agent¶
Explore the Strands Calculator Agent:
-
Select the Agent
- In Open WebUI, select
Strands Agents - Calculator Agent - The agent maintains memory across conversations
- In Open WebUI, select
-
Perform Calculations
-
Reset Calculator
Agent Features
- Memory: Continues calculations across messages
- Context Aware: Understands "that" and "the result"
- Stateful: Maintains current value
Source Code: examples/strands-agents/calculator-agent/
LiteLLM Dashboard¶
Monitor and manage API requests:
-
Access Dashboard
- Navigate to
litellm.<DOMAIN>/ui - Credentials in
.env.local:- Username:
LITELLM_UI_USERNAME - Password:
LITELLM_UI_PASSWORD
- Username:
- Navigate to
-
Features
- Request logs and metrics
- Model routing configuration
- Rate limiting and budgets
- API key management
LiteLLM Proxy Server documentation →
Request Flow
Open WebUI → LiteLLM Proxy → vLLM/Model
Langfuse Dashboard¶
Track LLM observability and analytics:
-
Access Dashboard
- Navigate to
langfuse.<DOMAIN> - Credentials in
.env.local:- Email:
LANGFUSE_USERNAME - Password:
LANGFUSE_PASSWORD
- Email:
- Navigate to
-
Features
- Trace visualization
- Cost tracking
- Performance metrics
- User analytics
- Model comparisons
-
Integration
- LiteLLM automatically logs to Langfuse
- All requests from Open WebUI are tracked
- Agent interactions are captured
Verification¶
Verify all components are running:
# Check pods
kubectl get pods -n litellm
kubectl get pods -n vllm
kubectl get pods -n langfuse
kubectl get pods -n openwebui
kubectl get pods -n qdrant
kubectl get pods -n tei
# Check services
kubectl get svc --all-namespaces
# Check ingress endpoints
kubectl get ingress --all-namespaces
All pods should show Running status.
Troubleshooting¶
Component Not Accessible¶
Check ingress and service configuration:
Model Not Loading¶
Check pod logs:
Common issues:
- Insufficient GPU memory
- Model download in progress
- Incorrect model path
RAG Not Working¶
Verify embedding model:
Check Open WebUI embedding configuration in Admin Panel.
Next Steps¶
- Explore different models and their capabilities
- Build custom agents using the Strands Agents framework
- Create MCP servers with FastMCP 2.0
- Integrate with your own applications via LiteLLM API
- Monitor costs and performance with Langfuse