Initializer for KServe Cluster with shell scripts and kubernetes YAML files.
- YAML: Contains the YAML files for deploying KServe, Triton Inference Server, and other Kubernetes resources.
- Shell: Contains the scripts for running the installation and test operation.
- main.sh: main script for running the whole process.
- ./main.sh run: convert checkpoints, build engines and deploy Triton Inference Server.
- ./main.sh test: test the availability of KServe and Triton Inference Server.
- KServe/install.sh: installing KServe in Kubernetes.
- KServe/test_simple.sh: simple testing KServe's availability.
- TIS/install.sh: installing Triton Inference Server Backend in Kubernetes.
- TIS/run.sh: automatic execution at container startup.
- TIS/test_serve.sh: simply testing inference service's availability.
- TRTLLM/upload_hf_model.sh: uploading huggingface weights to the PVC.
- TRTLLM/convert_weight.sh: converting huggingface weights to formated TensorRT-LLM weights.
- TRTLLM/build_engine.sh: building optimized TensorRT-LLM engines.
- TRTLLM/test_inference.sh: testing TensorRT-LLM engines' availability.
- main.sh: main script for running the whole process.
- Ubuntu: 22.04
- Kubernetes cluster: v1.26.9
- containerd: v1.7.2
- runc: 1.1.12
- cni: v1.5.1
- Istio: 1.21.3
- Knative: v1.12.4
- KServe: v0.13.0
- TensorRT-LLM: release v0.10.0
- Triton Inference Server: release v0.10.0
- Container Image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- Model: Llama-3-8B-Instruct/Llama-3-70B-Instruct
- Install KServe, please check KServe sub-directory.
- Save Model Weights to PVC, please check KServe Official Website.
- (Optional) Build REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Build REServe Image.
- Use REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Use REServe Image.
- Convert Llama-3 huggingface weights to TensorRT weights, and build TensorRT engines, please check Convert and Build TensorRT-LLM Engines.
- Deploy Triton Inference Server with TensorRT-LLM engines, please check Deploy Triton Inference Server.
Build REServe Image with TensorRT-LLM/Backend release v0.10.0:
- Clone the repository:
git clone https://github.com/REServeLLM/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
- Build the TensorRT-LLM Backend image (contains the TensorRT-LLM and Backend components):
# Use the Dockerfile to build the backend in a container
# For Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
--progress auto \
--network host \
-f dockerfile/Dockerfile.trt_llm_backend_network_proxy .
# For No Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
--progress auto \
-f dockerfile/Dockerfile.trt_llm_backend .
- Run the REServe image:
docker run -it -d --network=host --runtime=nvidia \
--cap-add=SYS_PTRACE --cap-add=SYS_ADMIN \
--security-opt seccomp=unconfined \
--shm-size=16g --privileged --ulimit memlock=-1 \
--gpus=all --name=reserve \
reserve-llm:latest
docker exec -it reserve /bin/bash
- Copy the latest REServe source code to the REServe image:
docker cp REServe reserve:/code
- Commit and push the REServe image to the registry:
docker commit reserve harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240709
We provide pre-built REServe image, just pull image from registry:
docker pull harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240700
# Update the REServe Source Code
cd /code/REServe
cd Initializer
git pull
cd ../tensorrtllm_backend
git submodule update --init --recursive
git lfs install
Or you can use your own REServe image from the previous step.
Operations in the REServe container:
cd /code/REServe/TRTLLM
./convert_engine.sh
./build_engine.sh