Note
Triton CLI is currently in BETA. Its features and functionality are likely to change as we collect feedback. We're excited to hear any thoughts you have (especially if you find the tool useful) and what features you'd like to see!
Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
| Pre-requisites | Installation | Quickstart | Serving LLM Models | Serving a vLLM Model | Serving a TRT-LLM Model | Additional Dependencies for Custom Environments | Known Limitations |
When using Triton and related tools on your host (outside of a Triton container
image) there are a number of additional dependencies that may be required for
various workflows. Most system dependency issues can be resolved by installing
and running the CLI from within the latest corresponding tritonserver
container image, which should have all necessary system dependencies installed.
For vLLM and TRT-LLM, you can use their respective images:
nvcr.io/nvidia/tritonserver:{YY.MM}-vllm-python-py3
nvcr.io/nvidia/tritonserver:{YY.MM}-trtllm-python-py3
If you decide to run the CLI on the host or in a custom image, please see this list of additional dependencies you may need to install.
Currently, Triton CLI can only be installed from source, with plans to host a pip wheel soon. When installing Triton CLI, please be aware of the versioning matrix below:
Triton CLI Version | TRT-LLM Version | Triton Container Tag |
---|---|---|
0.0.6 | v0.8.0 | 24.02 |
0.0.5 | v0.7.1 | 24.01 |
Install latest from main
branch:
pip install git+https://github.com/triton-inference-server/triton_cli.git
It is also possible to install from a specific branch name, a commit hash
or a tag name. For example to install triton_cli
with tag 0.0.6:
GIT_REF="0.0.6"
pip install git+https://github.com/triton-inference-server/triton_cli.git@${GIT_REF}
# Clone repo for development/contribution
git clone https://github.com/triton-inference-server/triton_cli.git
cd triton_cli
# Should be pointing at directory containing pyproject.toml
pip install .
The instructions below outline the process of deploying a simple gpt2
model using Triton's vLLM backend.
If you are not in an environment where the tritonserver
executable is
present, Triton CLI will automatically generate and run a custom image
capable of serving the model. This behavior is subject to change.
Note
triton start
is a blocking command and will stream server logs to the
current shell. To interact with the running server, you will need to start
a separate shell and docker exec
into the running container if using one.
# Explore the commands
triton -h
# Add a vLLM model to the model repository, downloaded from HuggingFace
triton import -m gpt2
# Start server pointing at the default model repository
triton start
# Infer with CLI
triton infer -m gpt2 --prompt "machine learning is"
# Infer with curl using the generate endpoint
curl -X POST localhost:8000/v2/models/gpt2/generate -d '{"text_input": "machine learning is", "max_tokens": 128}'
# Profile model with Perf Analyzer
triton profile -m gpt2
Triton CLI is particularly adept at simplifying the workflow to deploy and interact with LLM models. The steps below illustrate how to serve a vLLM or TRT-LLM model from scratch in minutes.
Note
Usage of llama-2-7b
requires authentication in Huggingface through either
huggingface-login
or setting the HF_TOKEN
environment variable.
The following models have currently been tested for vLLM through the CLI:
gpt2
llama-2-7b
opt125m
mistral-7b
falcon-7b
# Generate a Triton model repository containing a vLLM model config
triton remove -m all
triton import -m gpt2 --backend vllm
# Start Triton pointing at the default model repository
triton start
# Interact with model
triton infer -m gpt2 --prompt "machine learning is"
# Profile model with Perf Analyzer
triton profile -m gpt2
Note
By default, TRT-LLM engines are generated in /tmp/engines/{model_name}
,
such as /tmp/engines/gpt2
. They are intentionally kept outside of the model
repository to improve re-usability across examples and repositories. This
default location is subject to change, but can be customized with the
ENGINE_DEST_PATH
environment variable.
The model configurations generated by the CLI prioritize accessibility over
performance. As such, the default number of model instances for each model
will be set to 1. This value can be manually tuned post-generation by
modifying the instance_group
field in each model's corresponding
config.pbtxt
file. Increasing the instance counts may result in improved
performance, especially for large batch sizes. For more information, please
see here.
(optional) If you don't want to install TRT-LLM dependencies on the host, you can also run the following instructions inside of a container that is launched with the following command:
# NOTE: Mounting the huggingface cache is optional, but will allow saving and
# re-using downloaded huggingface models across different runs and containers.
# NOTE: Mounting /tmp is also optional, but will allow the saving and re-use of
# TRT-LLM engines across different containers. This assumes the value of
# `ENGINE_DEST_PATH` has not been modified.
docker run -ti \
--gpus all \
--network=host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
Install the TRT-LLM dependencies:
# Install TRT LLM building dependencies
pip install \
"psutil" \
"pynvml>=11.5.0" \
"torch==2.1.2" \
"tensorrt_llm==0.8.0" --extra-index-url https://pypi.nvidia.com/
The following models are currently supported for automating TRT-LLM engine builds through the CLI:
Note
Building a TRT-LLM engine for llama-2-7b
will require a system
with at least 64GB of RAM.
gpt2
llama-2-7b
opt125m
# Build TRT LLM engine and generate a Triton model repository pointing at it
triton remove -m all
triton import -m gpt2 --backend tensorrtllm
# Start Triton pointing at the default model repository
triton start
# Interact with model
triton infer -m gpt2 --prompt "machine learning is"
# Profile model with Perf Analyzer
triton profile -m gpt2 --backend tensorrtllm
When using Triton CLI outside of official Triton NGC containers, you may encounter the following issues, indicating additional dependencies need to be installed.
- If you encounter an error related to
libb64.so
fromtriton profile
orperf_analyzer
such as:
perf_analyzer: error while loading shared libraries: libb64.so.0d
Then you likely need to install this system dependency:
apt install libb64-dev
- If you encounter an error related to
libcudart.so
fromtriton profile
orperf_analyzer
such as:
perf_analyzer: error while loading shared libraries: libcudart.so
Then you likely need to install the CUDA toolkit or set your LD_LIBRARY_PATH
correctly. Refer to: https://developer.nvidia.com/cuda-downloads.
- To build TensorRT LLM engines, you will need MPI installed in your environment. MPI should be shipped in any relevant Triton or TRT-LLM containers, but if building engines on host you can install them like so:
sudo apt install libopenmpi-dev
- Triton CLI's
profile
command currently only supports TRT-LLM and vLLM models. - Models and configurations generated by Triton CLI are focused on ease-of-use, and may not be as optimized as possible for your system or use case.
- Triton CLI currently uses the TRT-LLM dependencies installed in its environment to build TRT-LLM engines, so you must take care to match the build-time and run-time versions of TRT-LLM.
- Triton CLI currently does not support launching the server as a background process.