Skip to content

Latest commit

 

History

History

poros

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Poros AI Inference Accelerator

Description

Poros is an AI Inference Accelerator for deep learning framework. It can provide significantly lower inference latency comparing with original model, and provide much flexibility for dynamic graphs. Poros mainly works on the TorchScript IR currently, that means it supports the models from PyTorch, ONNX, TensorFlow and any other framework that can be converted to TorchScript. also, we are planting to support more IRs in the future. Poros is designed to supports multiple hardware backends conveniently, For now, Poros supports GPU and XPU (BAIDU-Kunlun) Device, It's welcomed to add additional devices.

How It Works

Figure 1 is the architecture of Poros. The central part marked by the red dotted line is Model Optimizer, the main module of Poros. IR graphs are optimized by IR lowering, op fusing, op converting and auto-tuning, and then segmented into engine related subgraph by maximize the op nums of each engine kernel and minimize the total count of engine kernels.

image

In order to achieve the above goals on GPU, we've rewritten hundreds of TorchScript OPs, which reduced extra subgraphs caused by unsupported op during subgraph partitioning. Dozens of lowering strategy including op fusions were employed to reduce the actual calculating load of CUDA Kernels.

Dependencies

Poros is developed based on PyTorch, CUDA, TensorRT (TRT Engine), CuDNN. The minimum_required (recommended) versions of these packages are listed as below:

Package Minimum Version Recommended Version
PyTorch 1.9.0 1.12.1
CUDA 10.2 11.3
TensorRT 8.2 8.4
CuDNN 7.6.5 8.4
Python 3.6.5 3.8

If you want to build for GPU Inference, it's better to align the CUDA version with the version that PyTorch built on. For example, we recommend you to use CUDA 11.1+ if the installed PyTorch version is 1.11.0+cu111, or some "undefined reference CUDA...." errors may appear during building.

There is a known cuBlas related issue of CUDA 10.2. If you are using CUDA 10.2, make sure these two patches have be installed. https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

How To Build

0. Install Dependencies

get Poros source code:

git clone https://github.com/PaddlePaddle/FastDeploy.git
cd poros
git submodule update --init --recursive --jobs 0 -f

We strongly recommend you to prepare the building environment with anaconda3:

conda create --name poros python=3.8
conda activate poros
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
conda install cmake==3.22.1 pytorch==1.12.1 cudatoolkit=11.3 numpy -c pytorch

If CUDA has been installed as system driver, cudatoolkit is not necessary. And CMake version requires >= 3.21, GCC version requires >= 8.2.

Poros uses cmake to manage dependencies. It will find all dependency packages automatically as long as the packages were installed to the usual location. Otherwise, you should assign the install location of these packages manually.

export CUDAToolkit_ROOT=/cuda/install/dir/  #point CUDAToolkit_ROOT to the CUDA installation dir
export TENSORRT_ROOT=/tensorrt/install/dir/ #download from Nvidia and upack, no need to install into system
export CUDNN_ROOT=/cudnn/install/dir/       #download from Nvidia and upack, no need to install into system

Add cuda, tensorrt and cudnn into your environment variables.

export PATH=$CUDAToolkit_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$CUDAToolkit_ROOT/lib64:$TENSORRT_ROOT/lib:$CUDNN_ROOT/lib:$LD_LIBRARY_PATH

Additional dependency mkl is needed while building with PyTorch1.11 + CUDA11.1 It can be added into cmake by installing, if not, you can try to add it by:

conda install mkl

Other packages that Poros depend on are: gflags, googletest etc. , they can be downloaded by git submodule update --init --recursive --jobs 0 -f

1. Build Project with CMake

cd poros
mkdir build
cd build
cmake ..
make 

By default, only the shared library (libporos.so) will be built.

To build a static lib (libporos.a):

cmake -DBUILD_STATIC=on ..
make 

Poros kernel contains the framework of Poros, as well as the IR lowering strategy, the sub-graph segmentation strategy and the engine manager without any specific engine (e.g. TensorRT). For Developers who want to use their own engines, kernel can be built separately with options as below:

To build a shared kernel lib (libporos-kernel.so):

cmake -DBUILD_KERNEL=on ..
make 

To build a static kernel lib (libporos-kernel.a):

cmake -DBUILD_STATIC_KERNEL=on ..
make 

2. Build Distributing Package with setuptools (Python3)

After the libporos.so has been built, you can build the .whl package for Python3:

cd ../python
python3 setup.py bdist_wheel

The output looks like: poros-0.1.0-cp38-cp38m-linux_x86_64.whl. It can be installed easily with pip:

cd dist
pip3 install poros-0.1.0-cp38-cp38m-linux_x86_64.whl

or, you can use python3 setup.py develop to create symbolic link to python dir.

3. Build Executable Binary

We provide an example C++ shell for users who want to build an executable binary. The main.cpp file locates at tools/main.cpp, you modify the code according to your needs. The executable binary poros-tool can be built with this command:

mkdir build
cd build
cmake -DBUILD_TOOL=on ..
make 

4. Build Test

cmake -DUT=on ..
make 
./unit_test # run unit test

How To Use

1. Python Usage:

import poros
import torch
from torchvision import models

original_model = models.resnet50(pretrained=True).cuda().eval() #load/download pre-trained model
option = poros.PorosOptions() #set poros option
poros_model = poros.compile(torch.jit.script(original_model), input_datas, option) #build the model

input = torch.randn(1,3,224,224, dtype=torch.float32).cuda()
poros_res = poros_model(input) # use compiled model in the same way as the original model

The complete benchmark example (resnet50) .py script is python/example/test_resnet.py

python3 python/example/test_resnet.py

2. CPP Usage:

If the executable binary poros-tool is built, you can run the benchmark like this:

./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=original #original PyTorch model
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=poros #poros compiled model

PyTorch has changed the packaging format of model since 1.4+, while the pretrained model of resnet50 is still using the old format (.tar). You may need to convert the format to the newer one (.zip) by your self. Convert command like this:

original_model = models.resnet50(pretrained=True).cuda().eval()
torch.save(original_model, 'std_pretrained_resnet50_gpu.pt', _use_new_zipfile_serialization=False)

Benchmark

Take a look at the Benchmark.

Acknowledgement

Poros has been incubated for more than 2 years. In this project, NVIDIA helped us a lot (especially Gary Ji, Vincent Zhang, Jie Fang). They answered lots of technical questions about GPU and gave us many suggestions. Appreciate their great support.