This repo contains a collections of examples for LLM Serving on Modal. For comparison purposes on various serving frameworks, benchmarking setup heavily referenced from vLLM is also provided.
Currently, the following framework as been deployed and tested to be working via Modal Deployments.
Framework | GitHub Repo | Modal Script |
---|---|---|
vLLM | https://github.com/vllm-project/vllm | script |
Text Generation Interface (TGI) | https://github.com/huggingface/text-generation-inference | script |
LMDeploy | https://github.com/InternLM/lmdeploy | script |
To ensure for deploying the respective examples, you can setup the environment using the following commands.
This project uses uv for dependency management. To install uv
, please refer to this guide:
# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# With pip.
pip install uv
# With pipx.
pipx install uv
# With Homebrew.
brew install uv
# With Pacman.
pacman -S uv
To install the required dependencies:
# create a virtual env
uv venv
# install dependencies
uv pip install -r requirements.txt # Install from a requirements.txt file.
If you are looking to contribute to the repo, you will also be required to install the pre-commit hooks to ensure that your code changes are linted and formatted accordingly:
pip install pre-commit
pre-commit install &&
pre-commit install --hook-type commit-msg
To deploy on Modal, simply use the CLI, and deploy the respective serving framework as desired.
For example to deploy a vLLM server:
source .venv/bin/activate
modal deploy src/vllm/server.py
Upon successfully deployment, you should see the following (similar) information on your terminal:
┌───────────────────
│ 📁 ~/c/modal-llm-serving master [!]
└─❯ modal deploy src/vllm/server.py
✓ Created objects.
├── 🔨 Created mount /Users/xxx/code/modal-llm-serving/template_mistral_7b_instruct.jinja
├── 🔨 Created mount /Users/xxx/code/modal-llm-serving/src/vllm/server.py
├── 🔨 Created download_hf_model.
└── 🔨 Created serve => https://xxx--vllm-mistralai--mistral-7b-instruct-v02-serve.modal.run
✓ App deployed! 🎉
View Deployment:
https://modal.com/xxx/main/apps/deployed/vllm-mistralai--mistral-7b-instruct-v02
To access the respective Swagger UI, you can either directly access the serve
URL or append /docs
to the URL, depending on the serving frameworks.
To run benchmarks on the deployed LLM inference servers, you can run the benchmark script as follows:
python benchmark/benchmark_server.py --backend vllm \
--model "mistralai--mistral-7b-instruct" \
--num-request 1000 \
--request-rate 64 \
--num-benchmark-runs 3 \
--max-input-len 1024 \
--max-output-len 1024 \
--base-url "https://xxx--vllm-mistralai--mistral-7b-instruct-v02-serve.modal.run"
Important
NOTE: Replace the --base-url
with your own deployment url as indicated upon successful deployment with modal deploy
.