SGLang¶

Fast LLM serving engine with RadixAttention for automatic KV cache reuse. Optimized for multi-turn conversations and structured generation workloads.


Category	llm-model
Official Docs	SGLang Documentation
CLI Install	`./cli llm-model sglang install`
CLI Uninstall	`./cli llm-model sglang uninstall`
Namespace	`sglang`

Overview¶

SGLang (Structured Generation Language) is a high-performance LLM serving engine featuring: - RadixAttention: Automatic KV cache reuse via radix tree structure - Structured Generation: JSON schema-constrained outputs - Multi-Turn Optimization: Efficient caching for conversational workloads - Continuous Batching: Dynamic request batching for higher throughput - Tensor Parallel: Distributed inference across multiple GPUs

Installation¶

./cli llm-model sglang install

The installer: 1. Prompts for model selection from config.json 2. Downloads model to PVC (if not cached) 3. Deploys SGLang server with RadixAttention enabled 4. Exposes OpenAI-compatible API endpoint

Model Selection¶

Pre-configured models in config.json:

{
  "llm-model": {
    "sglang": {
      "models": [
        { "name": "qwen3-30b-instruct-fp8", "deploy": true },
        { "name": "qwen3-32b-fp8", "deploy": true },
        { "name": "gpt-oss-20b", "deploy": false }
      ]
    }
  }
}

Verification¶

# Check SGLang pods
kubectl get pods -n sglang

# Check service
kubectl get svc -n sglang

# Port-forward for testing
kubectl port-forward svc/sglang 8000:8000 -n sglang --address 0.0.0.0 &

# List models
curl http://localhost:8000/v1/models

# Test chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-30b-instruct-fp8",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 200
  }'

Configuration¶

RadixAttention¶

RadixAttention is enabled by default and automatically caches KV states in a radix tree structure. This provides: - Automatic prefix matching: Shared prompt prefixes reuse cached KV blocks - Multi-turn efficiency: Conversation history cached without manual management - Zero configuration: No need to specify cache keys or session IDs

SGLang Arguments¶

Common arguments configured via installer:

# Tensor parallel size
--tp 2

# Memory fraction
--mem-fraction-static 0.85

# Context length
--context-length 8192

# Disable RadixAttention (not recommended)
--disable-radix-cache

# Chunked prefill
--chunked-prefill-size 4096

Advanced Features¶

Structured Generation¶

Constrain outputs to JSON schema:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-30b-instruct-fp8",
    "messages": [
      {"role": "user", "content": "Generate a person profile"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person",
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "occupation": {"type": "string"}
          },
          "required": ["name", "age"]
        }
      }
    }
  }'

Regex-Constrained Generation¶

Force outputs to match regex patterns:

import requests

response = requests.post(
    "http://localhost:8000/generate",
    json={
        "text": "Generate a phone number: ",
        "sampling_params": {
            "max_new_tokens": 20,
            "regex": r"\d{3}-\d{3}-\d{4}"  # Format: 123-456-7890
        }
    }
)

Multi-Turn Conversations¶

RadixAttention automatically optimizes multi-turn conversations:

import openai

client = openai.OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

# Turn 1
response = client.chat.completions.create(
    model="qwen3-30b-instruct-fp8",
    messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Turn 2 - conversation history automatically cached
messages.append({"role": "user", "content": "What is its population?"})
response = client.chat.completions.create(
    model="qwen3-30b-instruct-fp8",
    messages=messages
)

Model Management¶

Add New Model¶

Edit config.json:

{
  "llm-model": {
    "sglang": {
      "models": [
        { "name": "my-new-model", "deploy": true }
      ]
    }
  }
}

Create model config at components/llm-model/sglang/models/my-new-model.yaml:

modelId: "organization/model-name"
servedModelName: "my-new-model"
tensorParallelSize: 1
memFractionStatic: 0.85
contextLength: 8192

Reinstall:
```
./cli llm-model sglang install
```

Integration with LiteLLM¶

SGLang endpoints are automatically discovered by LiteLLM:

model_list:
  - model_name: qwen3-30b-instruct-fp8-sglang
    litellm_params:
      model: openai/qwen3-30b-instruct-fp8
      api_base: http://sglang.sglang:8000/v1

Performance Tuning¶

Memory Optimization¶

# Reduce static memory fraction (leave more for KV cache)
--mem-fraction-static 0.80

# Smaller context length
--context-length 4096

Throughput Optimization¶

# Enable chunked prefill for long prompts
--chunked-prefill-size 8192

# Larger batch size (more memory usage)
--max-running-requests 512

Latency Optimization¶

RadixAttention automatically optimizes TTFT for repeated prompts. For first-time prompts:

# Enable chunked prefill
--chunked-prefill-size 4096

Troubleshooting¶

OOM (Out of Memory)¶

# Reduce memory fraction
--mem-fraction-static 0.75

# Reduce context length
--context-length 4096

# Use smaller TP size (if multi-GPU)
--tp 1

Cache not working¶

# Check RadixAttention is enabled (should NOT see this flag)
--disable-radix-cache

# Check logs for cache hit statistics
kubectl logs -n sglang -l app=sglang | grep "cache_hit"

Model download fails¶

# Check PVC exists and is bound
kubectl get pvc -n sglang

# Check HuggingFace token
kubectl get secret -n sglang huggingface-secret

# Manual download
kubectl exec -it -n sglang <sglang-pod> -- bash
huggingface-cli download <model-id> --local-dir /models/<model-name>