Poros is an AI Inference Accelerator for deep learning framework. It can provide significantly lower inference latency comparing with original model, and provide much flexibility for dynamic graphs. Poros mainly works on the TorchScript IR currently, that means it supports the models from PyTorch, ONNX, TensorFlow and any other framework that can be converted to TorchScript. also, we are planting to support more IRs in the future. Poros is designed to supports multiple hardware backends conveniently, For now, Poros supports GPU and XPU (BAIDU-Kunlun) Device, It's welcomed to add additional devices.
Figure 1 is the architecture of Poros. The central part marked by the red dotted line is Model Optimizer, the main module of Poros. IR graphs are optimized by IR lowering, op fusing, op converting and auto-tuning, and then segmented into engine related subgraph by maximize the op nums of each engine kernel and minimize the total count of engine kernels.
In order to achieve the above goals on GPU, we've rewritten hundreds of TorchScript OPs, which reduced extra subgraphs caused by unsupported op during subgraph partitioning. Dozens of lowering strategy including op fusions were employed to reduce the actual calculating load of CUDA Kernels.
Poros is developed based on PyTorch, CUDA, TensorRT (TRT Engine), CuDNN. The minimum_required (recommended) versions of these packages are listed as below:
Package | Minimum Version | Recommended Version |
---|---|---|
PyTorch | 1.9.0 | 1.12.1 |
CUDA | 10.2 | 11.3 |
TensorRT | 8.2 | 8.4 |
CuDNN | 7.6.5 | 8.4 |
Python | 3.6.5 | 3.8 |
If you want to build for GPU Inference, it's better to align the CUDA version with the version that PyTorch built on. For example, we recommend you to use CUDA 11.1+ if the installed PyTorch version is 1.11.0+cu111, or some "undefined reference CUDA...." errors may appear during building.
There is a known cuBlas related issue of CUDA 10.2. If you are using CUDA 10.2, make sure these two patches have be installed. https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
get Poros source code:
git clone https://github.com/PaddlePaddle/FastDeploy.git
cd poros
git submodule update --init --recursive --jobs 0 -f
We strongly recommend you to prepare the building environment with anaconda3:
conda create --name poros python=3.8
conda activate poros
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
conda install cmake==3.22.1 pytorch==1.12.1 cudatoolkit=11.3 numpy -c pytorch
If CUDA has been installed as system driver, cudatoolkit is not necessary. And CMake version requires >= 3.21, GCC version requires >= 8.2.
Poros uses cmake to manage dependencies. It will find all dependency packages automatically as long as the packages were installed to the usual location. Otherwise, you should assign the install location of these packages manually.
export CUDAToolkit_ROOT=/cuda/install/dir/ #point CUDAToolkit_ROOT to the CUDA installation dir
export TENSORRT_ROOT=/tensorrt/install/dir/ #download from Nvidia and upack, no need to install into system
export CUDNN_ROOT=/cudnn/install/dir/ #download from Nvidia and upack, no need to install into system
Add cuda, tensorrt and cudnn into your environment variables.
export PATH=$CUDAToolkit_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$CUDAToolkit_ROOT/lib64:$TENSORRT_ROOT/lib:$CUDNN_ROOT/lib:$LD_LIBRARY_PATH
Additional dependency mkl
is needed while building with PyTorch1.11 + CUDA11.1
It can be added into cmake by installing, if not, you can try to add it by:
conda install mkl
Other packages that Poros depend on are: gflags, googletest etc. , they can be downloaded
by git submodule update --init --recursive --jobs 0 -f
cd poros
mkdir build
cd build
cmake ..
make
By default, only the shared library (libporos.so) will be built.
To build a static lib (libporos.a):
cmake -DBUILD_STATIC=on ..
make
Poros kernel
contains the framework of Poros, as well as the IR lowering strategy, the sub-graph segmentation strategy
and the engine manager without any specific engine (e.g. TensorRT). For Developers who want to use their own
engines, kernel
can be built separately with options as below:
To build a shared kernel lib (libporos-kernel.so):
cmake -DBUILD_KERNEL=on ..
make
To build a static kernel lib (libporos-kernel.a):
cmake -DBUILD_STATIC_KERNEL=on ..
make
After the libporos.so has been built, you can build the .whl
package for Python3:
cd ../python
python3 setup.py bdist_wheel
The output looks like: poros-0.1.0-cp38-cp38m-linux_x86_64.whl
. It can be installed easily with pip:
cd dist
pip3 install poros-0.1.0-cp38-cp38m-linux_x86_64.whl
or, you can use python3 setup.py develop
to create symbolic link to python
dir.
We provide an example C++ shell for users who want to build an executable binary. The main.cpp
file locates
at tools/main.cpp
, you modify the code according to your needs. The executable binary poros-tool
can be built with
this command:
mkdir build
cd build
cmake -DBUILD_TOOL=on ..
make
cmake -DUT=on ..
make
./unit_test # run unit test
import poros
import torch
from torchvision import models
original_model = models.resnet50(pretrained=True).cuda().eval() #load/download pre-trained model
option = poros.PorosOptions() #set poros option
poros_model = poros.compile(torch.jit.script(original_model), input_datas, option) #build the model
input = torch.randn(1,3,224,224, dtype=torch.float32).cuda()
poros_res = poros_model(input) # use compiled model in the same way as the original model
The complete benchmark example (resnet50) .py script is python/example/test_resnet.py
python3 python/example/test_resnet.py
If the executable binary poros-tool
is built, you can run the benchmark like this:
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=original #original PyTorch model
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=poros #poros compiled model
PyTorch has changed the packaging format of model since 1.4+, while the pretrained model of resnet50 is still using the old format (.tar). You may need to convert the format to the newer one (.zip) by your self. Convert command like this:
original_model = models.resnet50(pretrained=True).cuda().eval() torch.save(original_model, 'std_pretrained_resnet50_gpu.pt', _use_new_zipfile_serialization=False)
Take a look at the Benchmark.
Poros has been incubated for more than 2 years. In this project, NVIDIA helped us a lot (especially Gary Ji, Vincent Zhang, Jie Fang). They answered lots of technical questions about GPU and gave us many suggestions. Appreciate their great support.