Demo Walkthrough¶

This guide walks you through the demo environment deployed by ./cli demo-setup, including setup steps and usage instructions for each component.

Deployed Components¶

The demo environment includes:

AI Gateway¶

LiteLLM

Unified API gateway for multiple LLM providers
Request routing and load balancing
Usage tracking and observability
Access: litellm.<DOMAIN>/ui

LLM Models¶

vLLM with Qwen3 Models

Two models are deployed:

Model	Quantization	Hardware	Performance	Mode
Qwen3-30B-A3B-Instruct-2507-FP8	FP8 (8-bit)	Single g6e	~75 tokens/sec	MoE, non-thinking
Qwen3-32B-FP8	FP8 (8-bit)	Single g6e	~15 tokens/sec	Dense, thinking & non-thinking

Model Characteristics

Fast Model: MoE architecture optimized for speed
Slow Model: Dense model with reasoning capabilities

Observability¶

Langfuse

LLM observability and analytics
Trace tracking and debugging
Cost and performance monitoring
Access: langfuse.<DOMAIN>

GUI Application¶

Open WebUI

Chat interface for LLM interaction
Document RAG capabilities
AI agent integration
Function/tool marketplace
Access: openwebui.<DOMAIN>

Vector Database¶

Qdrant

High-performance vector database
Used for RAG document embeddings
REST and gRPC API support

Embedding Models¶

Text Embedding Inference (TEI) with Qwen3-Embedding

Model	Quantization	Hardware
Qwen3-Embedding-4B	BF16 (16-bit)	Single r7i

MCP Server¶

Calculator MCP Server

Built with FastMCP 2.0
Basic calculator operations
Demonstrates MCP server implementation

AI Agent¶

Strands Calculator Agent

Built with Strands Agents framework
Stateful calculator with memory
Integrated with Open WebUI

Demo Setup¶

After running ./cli demo-setup, configure Open WebUI:

1. Access Open WebUI¶

Navigate to openwebui.<DOMAIN> in your browser.

2. Agent Functions (Auto-Registered)¶

Agent pipe functions are automatically registered when agents are installed:

./cli strands-agents calculator-agent install

The Strands Agents - Calculator Agent function appears in Open WebUI automatically.

3. Add Optional Functions¶

Add the Time Token Tracker function from the marketplace:

Navigate to Functions in Open WebUI
Search for "Time Token Tracker"
Click install

Time Token Tracker on Open WebUI →

See Open WebUI Functions documentation for more details.

4. Configure RAG Embedding Model¶

Set up Open WebUI to use the deployed Qwen3-Embedding model:

Navigate to Admin Panel → Settings → Documents
Set the embedding model endpoint
Get the API key from .env.local (check LITELLM_API_KEY)
Save configuration

RAG Embedding Support documentation →

Using the Demo¶

Chat with LLM Models¶

Interact with deployed models through Open WebUI:

Start a Conversation
- Select a model from the dropdown
- Type your message and press Enter
- Try both the fast and slow models
Compare Models
- Fast model (Qwen3-30B-A3B): Better for quick responses
- Slow model (Qwen3-32B): Better for complex reasoning

Chat Features Overview →

Model Selection

Use the fast model for general chat and the slow model when you need detailed reasoning or step-by-step thinking.

Document RAG¶

Use the RAG feature to chat with your documents:

Upload Documents
- Click the document icon in the chat interface
- Select files to upload (PDF, TXT, DOCX, etc.)
- Wait for embedding processing
Query Documents
- Ask questions about uploaded documents
- The system retrieves relevant chunks using Qdrant
- Responses are grounded in your documents
Manage Collections
- View and organize document collections
- Update or delete documents as needed

RAG Tutorial →

Calculator Agent¶

Explore the Strands Calculator Agent:

Select the Agent
- In Open WebUI, select Strands Agents - Calculator Agent
- The agent maintains memory across conversations

Perform Calculations

You: Add 15 and 27
Agent: The result is 42

You: Multiply that by 2
Agent: The result is 84

You: What's the current total?
Agent: 84

Reset Calculator

You: Reset the calculator
Agent: Calculator has been reset

Agent Features

Memory: Continues calculations across messages
Context Aware: Understands "that" and "the result"
Stateful: Maintains current value

Source Code: examples/strands-agents/calculator-agent/

LiteLLM Dashboard¶

Monitor and manage API requests:

Access Dashboard
- Navigate to litellm.<DOMAIN>/ui
- Credentials in .env.local:
  - Username: LITELLM_UI_USERNAME
  - Password: LITELLM_UI_PASSWORD
Features
- Request logs and metrics
- Model routing configuration
- Rate limiting and budgets
- API key management

LiteLLM Proxy Server documentation →

Request Flow

Open WebUI → LiteLLM Proxy → vLLM/Model

Langfuse Dashboard¶

Track LLM observability and analytics:

Access Dashboard
- Navigate to langfuse.<DOMAIN>
- Credentials in .env.local:
  - Email: LANGFUSE_USERNAME
  - Password: LANGFUSE_PASSWORD
Features
- Trace visualization
- Cost tracking
- Performance metrics
- User analytics
- Model comparisons
Integration
- LiteLLM automatically logs to Langfuse
- All requests from Open WebUI are tracked
- Agent interactions are captured

Langfuse Features →

Verification¶

Verify all components are running:

# Check pods
kubectl get pods -n litellm
kubectl get pods -n vllm
kubectl get pods -n langfuse
kubectl get pods -n openwebui
kubectl get pods -n qdrant
kubectl get pods -n tei

# Check services
kubectl get svc --all-namespaces

# Check ingress endpoints
kubectl get ingress --all-namespaces

All pods should show Running status.

Troubleshooting¶

Component Not Accessible¶

Check ingress and service configuration:

kubectl describe ingress <name> -n <namespace>
kubectl describe svc <name> -n <namespace>

Model Not Loading¶

Check pod logs:

kubectl logs -f deploy/vllm-<model> -n vllm

Common issues:

Insufficient GPU memory
Model download in progress
Incorrect model path

RAG Not Working¶

Verify embedding model:

kubectl get pods -n tei
kubectl logs -f deploy/tei-qwen3-embedding -n tei

Check Open WebUI embedding configuration in Admin Panel.

Next Steps¶

Explore different models and their capabilities
Build custom agents using the Strands Agents framework
Create MCP servers with FastMCP 2.0
Integrate with your own applications via LiteLLM API
Monitor costs and performance with Langfuse