Local Model Deployment Guide

This guide provides detailed instructions for deploying models locally using Easy Model Deployer (EMD).

Local Deployment on EC2 Instance

For deploying models using local GPU resources:

Prerequisites

It is recommended to launch an EC2 instance using the AMI "Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.6 (Ubuntu 22.04)".

Deployment Command

Deploy with:

emd deploy --allow-local-deploy

Command Line Deployment

You can deploy models directly with command line parameters:

emd deploy --model-series llama --model-name llama-3.3-70b-instruct-awq --service Local --gpu-ids 0,1,2,3

Additional Parameters

You can provide additional parameters as a JSON string:

emd deploy --extra-params '{"engine_params":{"api_key":"YOUR_API_KEY", "default_cli_args": "--max-total-tokens 30000 --max-concurrent-requests 30"}}'

Or from a file:

emd deploy --extra-params path/to/params.json

The extra parameters format:

{
  "model_params": {
  },
  "service_params":{
  },
  "instance_params":{
  },
  "engine_params":{
      "cli_args": "<command line arguments of current engine>"
  },
  "framework_params":{
      "uvicorn_log_level":"info",
      "limit_concurrency":200
  }
}

Common Model Configurations

Non-reasoning Models

Qwen2.5-72B-Instruct-AWQ

? Select the model series: qwen2.5
? Select the model name: Qwen2.5-72B-Instruct-AWQ
? Select the service for deployment: Local
? input the local gpu ids to deploy the model (e.g. 0,1,2): 0,1,2,3
? Select the inference engine to use: tgi
? (Optional) Additional deployment parameters: {"engine_params":{"api_key":"<YOUR_API_KEY>", "default_cli_args": "--max-total-tokens 30000 --max-concurrent-requests 30"}}

llama-3.3-70b-instruct-awq

? Select the model series: llama
? Select the model name: llama-3.3-70b-instruct-awq
? Select the service for deployment: Local
? input the local gpu ids to deploy the model (e.g. 0,1,2): 0,1,2,3
? Select the inference engine to use: tgi
? (Optional) Additional deployment parameters: {"engine_params":{"api_key":"<YOUR_API_KEY>", "default_cli_args": "--max-total-tokens 30000 --max-concurrent-requests 30"}}

Reasoning Models

DeepSeek-R1-Distill-Qwen-32B

? Select the model series: deepseek reasoning model
? Select the model name: DeepSeek-R1-Distill-Qwen-32B
? Select the service for deployment: Local
? input the local gpu ids to deploy the model (e.g. 0,1,2): 0,1,2,3
? Select the inference engine to use: vllm
? (Optional) Additional deployment parameters: {"engine_params":{"api_key":"<YOUR_API_KEY>", "default_cli_args": "--enable-reasoning --reasoning-parser deepseek_r1 --max_model_len 16000 --disable-log-stats --chat-template emd/models/chat_templates/deepseek_r1_distill.jinja --max_num_seq 20 --gpu_memory_utilization 0.9"}}

deepseek-r1-distill-llama-70b-awq

? Select the model series: deepseek reasoning model
? Select the model name: deepseek-r1-distill-llama-70b-awq
? Select the service for deployment: Local
? input the local gpu ids to deploy the model (e.g. 0,1,2): 0,1,2,3
? Select the inference engine to use: tgi
? (Optional) Additional deployment parameters: {"engine_params":{"api_key":"<YOUR_API_KEY>", "default_cli_args": "--max-total-tokens 30000 --max-concurrent-requests 30"}}

Tips for Local Deployment

  • When you see "Waiting for model: ...", it means the deployment task has started. You can press Ctrl+C to stop the terminal output without affecting the deployment.
  • For multi-GPU deployments, ensure all specified GPUs are available and have sufficient memory.
  • Monitor GPU usage with tools like nvidia-smi during deployment and inference.
  • For optimal performance, consider the recommended GPU memory requirements for each model.

Advanced Options

For more detailed information on: - Advanced deployment parameters: See Best Deployment Practices - Architecture details: See Architecture - Supported models: See Supported Models