Skip to content

REServeLLM/Initializer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Initializer

Initializer for KServe Cluster with shell scripts and kubernetes YAML files.

Project Structure

  • YAML: Contains the YAML files for deploying KServe, Triton Inference Server, and other Kubernetes resources.
  • Shell: Contains the scripts for running the installation and test operation.
    • main.sh: main script for running the whole process.
      • ./main.sh run: convert checkpoints, build engines and deploy Triton Inference Server.
      • ./main.sh test: test the availability of KServe and Triton Inference Server.
    • KServe/install.sh: installing KServe in Kubernetes.
    • KServe/test_simple.sh: simple testing KServe's availability.
    • TIS/install.sh: installing Triton Inference Server Backend in Kubernetes.
    • TIS/run.sh: automatic execution at container startup.
    • TIS/test_serve.sh: simply testing inference service's availability.
    • TRTLLM/upload_hf_model.sh: uploading huggingface weights to the PVC.
    • TRTLLM/convert_weight.sh: converting huggingface weights to formated TensorRT-LLM weights.
    • TRTLLM/build_engine.sh: building optimized TensorRT-LLM engines.
    • TRTLLM/test_inference.sh: testing TensorRT-LLM engines' availability.

Environment

  • Ubuntu: 22.04
  • Kubernetes cluster: v1.26.9
  • containerd: v1.7.2
  • runc: 1.1.12
  • cni: v1.5.1
  • Istio: 1.21.3
  • Knative: v1.12.4
  • KServe: v0.13.0
  • TensorRT-LLM: release v0.10.0
  • Triton Inference Server: release v0.10.0
  • Container Image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
  • Model: Llama-3-8B-Instruct/Llama-3-70B-Instruct

Basic Steps

  1. Install KServe, please check KServe sub-directory.
  2. Save Model Weights to PVC, please check KServe Official Website.
  3. (Optional) Build REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Build REServe Image.
  4. Use REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Use REServe Image.
  5. Convert Llama-3 huggingface weights to TensorRT weights, and build TensorRT engines, please check Convert and Build TensorRT-LLM Engines.
  6. Deploy Triton Inference Server with TensorRT-LLM engines, please check Deploy Triton Inference Server.

Build REServe Image

Build REServe Image with TensorRT-LLM/Backend release v0.10.0:

  1. Clone the repository:
git clone https://github.com/REServeLLM/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
  1. Build the TensorRT-LLM Backend image (contains the TensorRT-LLM and Backend components):
# Use the Dockerfile to build the backend in a container
# For Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               --network host \
                               -f dockerfile/Dockerfile.trt_llm_backend_network_proxy .
# For No Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               -f dockerfile/Dockerfile.trt_llm_backend .
  1. Run the REServe image:
docker run -it -d --network=host --runtime=nvidia \
                  --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN \
                  --security-opt seccomp=unconfined \
                  --shm-size=16g --privileged --ulimit memlock=-1 \
                  --gpus=all --name=reserve \
                  reserve-llm:latest
                  
docker exec -it reserve /bin/bash
  1. Copy the latest REServe source code to the REServe image:
docker cp REServe reserve:/code
  1. Commit and push the REServe image to the registry:
docker commit reserve harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240709

Use REServe Image

We provide pre-built REServe image, just pull image from registry:

docker pull harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240700

# Update the REServe Source Code
cd /code/REServe
cd Initializer
git pull
cd ../tensorrtllm_backend
git submodule update --init --recursive
git lfs install

Or you can use your own REServe image from the previous step.

Convert and Build TensorRT-LLM Engines

Operations in the REServe container:

cd /code/REServe/TRTLLM
./convert_engine.sh
./build_engine.sh

Deploy Triton Inference Server

About

Initializer for KServe Cluster

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages