Skip to content

Example showing Triton hosting of TensorRT HuggingFace T5 and BART models

License

Notifications You must be signed in to change notification settings

kshitizgupta21/triton-trt-oss

Repository files navigation

Example for Hosting TensorRT OSS HuggingFace Models on Triton Inference Server

To build the TensorRT (TRT) Engines

  1. Build TRT 8.5 OSS container
bash build_trt_oss_docker.sh
  1. Launch the container
bash run_trt_oss_docker.sh
  1. Change Directory and Pip install HF demo requirements
cd demo/HuggingFace
pip install -r requirements.txt
  1. Run build_t5_trt.py to build T5 TRT engines and build_bart_trt.py to build BART engines.

Triton Inference

Triton Model Repository is located at model_repository. Each model has model.py associated with it and config.pbtxt along with T5/BART TRT OSS code dependencies.

We showcase 2 models BART and T5 here. Currently, TRT T5 supports both beam search and greedy search. TRT BART only supports greedy search currently.

  • trt_t5_bs1_beam2 = TRT T5 Max Batch Size 1 Model with Beam Search=2
  • trt_bart_bs1_greedy = TRT BART Max Batch Size 1 Model with Greedy Search

Currently, TensorRT engines for T5 and BART don't produce correct output for Batch Sizes > 1 (this bug is being worked on). So we only show batch size = 1 example for T5 and BART here.

Steps for Triton TRT Inference

  1. Build Custom Triton container with TRT and other dependencies. Dockerfile is docker/triton_trt.Dockerfile
cd docker
bash build_triton_trt_docker.sh
cd ..
  1. Launch custom Triton container
bash run_triton_trt_docker.sh
  1. Launch JupyterLab at port 8888
bash start_jupyter.sh
  1. Run through 1_triton_server.ipynb to Launch Triton Server
  2. Run through 2_triton_client.ipynb to perform sample inference for T5 and BART TRT OSS HuggingFace models using Triton Server.

About

Example showing Triton hosting of TensorRT HuggingFace T5 and BART models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published