Ollama Integration
This guide covers how to integrate EMD-deployed models with Ollama, an open-source framework for running large language models locally.
Overview
Ollama is a popular tool that allows you to run large language models locally on your own hardware. It provides a simple way to download, run, and manage various open-source models. By integrating EMD-deployed models with Ollama, you can create a hybrid setup that leverages both local models and your custom cloud-deployed models.
With Ollama integration, you can: - Use both local models and EMD-deployed models in your applications - Create fallback mechanisms between local and cloud models - Compare performance between local and cloud-deployed versions - Develop applications that work both online and offline - Optimize for cost, performance, or privacy based on specific needs
Key Features of Ollama
- Local Model Execution: Run models on your own hardware
- Simple API: Easy-to-use REST API for model interaction
- Model Library: Access to various open-source models
- Customization: Create and customize model configurations
- Cross-platform: Available for macOS, Windows, and Linux
- Low Resource Usage: Optimized for running on consumer hardware
Integrating EMD Models with Ollama
There are several ways to integrate EMD-deployed models with Ollama:
1. API Orchestration
You can build an orchestration layer that routes requests between Ollama's API and your EMD-deployed model's API based on specific criteria.
Prerequisites
- You have successfully deployed a model using EMD with the OpenAI Compatible API enabled
- You have installed Ollama on your local machine
- You have the base URL and API key for your deployed EMD model
Implementation Example
Here's a simple Python example that routes requests between Ollama and an EMD-deployed model:
import requests
import json
def generate_text(prompt, use_local=True, max_tokens=100):
"""
Generate text using either Ollama (local) or EMD-deployed model (cloud)
Args:
prompt (str): The input prompt
use_local (bool): Whether to use local Ollama model
max_tokens (int): Maximum tokens to generate
Returns:
str: Generated text
"""
if use_local:
# Use Ollama API (local)
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3", # or any other model you have pulled
"prompt": prompt,
"max_tokens": max_tokens
}
)
return response.json().get("response", "")
else:
# Use EMD-deployed model API (cloud)
response = requests.post(
"https://your-endpoint.execute-api.region.amazonaws.com/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer your-api-key"
},
json={
"model": "your-deployed-model-id",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens
}
)
return response.json().get("choices", [{}])[0].get("message", {}).get("content", "")
# Example usage
result = generate_text("Explain quantum computing", use_local=True)
print(result)
result = generate_text("Explain quantum computing", use_local=False)
print(result)
2. Fallback Mechanism
You can implement a fallback mechanism that tries the local Ollama model first and falls back to the EMD-deployed model if the local model fails or produces unsatisfactory results.
def generate_with_fallback(prompt, max_tokens=100):
"""
Try local model first, fall back to cloud model if needed
Args:
prompt (str): The input prompt
max_tokens (int): Maximum tokens to generate
Returns:
str: Generated text
"""
try:
# Try Ollama first
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": prompt,
"max_tokens": max_tokens
},
timeout=5 # Set a timeout for local model
)
if response.status_code == 200:
result = response.json().get("response", "")
if result and len(result) > 20: # Simple quality check
return {"source": "local", "text": result}
except Exception as e:
print(f"Local model error: {e}")
# Fall back to EMD-deployed model
try:
response = requests.post(
"https://your-endpoint.execute-api.region.amazonaws.com/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer your-api-key"
},
json={
"model": "your-deployed-model-id",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens
}
)
if response.status_code == 200:
result = response.json().get("choices", [{}])[0].get("message", {}).get("content", "")
return {"source": "cloud", "text": result}
except Exception as e:
print(f"Cloud model error: {e}")
return {"source": "none", "text": "Failed to generate response from both local and cloud models."}
Example Use Cases
With your EMD models integrated with Ollama, you can build various applications:
- Hybrid AI Applications: Applications that use local models for basic tasks and cloud models for more complex tasks
- Offline-First Applications: Applications that work offline with local models but enhance capabilities when online
- Cost-Optimized Solutions: Use local models for frequent, simple queries and cloud models for important or complex queries
- Privacy-Focused Applications: Process sensitive data locally and only use cloud models for non-sensitive data
- Development and Testing: Use local models during development and testing, and cloud models in production
Troubleshooting
If you encounter issues with the integration:
- Verify that Ollama is running locally (
ollama list
should show available models) - Check that your EMD model is properly deployed and accessible
- Ensure API endpoints and authentication details are correct
- Check network connectivity if using cloud models
- Monitor resource usage if local models are running slowly