Quantization Aware training (QAT) simulates quantization during training by quantizing weights and activation layers. This will help reduce the loss in accuracy when we convert the network trained in FP32 to INT8 for faster inference. QAT introduces additional nodes in the graph which will be used to learn the dynamic ranges of weights and activation layers. Typical workflow for training QAT networks is to train a model until convergence and then finetune with the quantization layers.
For more detailed information, please refer to Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT blogpost.
This is a short example application that shows how to use Torch-TensorRT to perform inference on a quantization-aware-trained model.
- Download CIFAR10 Dataset Binary version (https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz)
- Train a network on CIFAR10 and perform quantization aware training on it. Refer to
examples/int8/training/vgg16/README.md
for detailed instructions. Export the QAT model to Torchscript. - Install NVIDIA's pytorch quantization toolkit
- TensorRT 8.0.1.6 or above
bazel run //examples/int8/qat --compilation_mode=opt <path-to-module> <path-to-cifar10>
If you want insight into what is going under the hood or need debug symbols
bazel run //examples/int8/qat --compilation_mode=dbg <path-to-module> <path-to-cifar10>
This will build a binary named qat
in bazel-out/k8-<opt|dbg>/bin/cpp/int8/qat/
directory. Optionally you can add this to $PATH
environment variable to run qat
from anywhere on your system.
- Download releases of LibTorch, Torch-TensorRT and TensorRT and unpack them in the deps directory. Ensure CUDA is installed at
/usr/local/cuda
, if not you need to modify the CUDA include and lib paths in the Makefile.
cd examples/torch_tensorrt_example/deps
# Download latest Torch-TensorRT release tar file (libtorch_tensorrt.tar.gz) from https://github.com/pytorch/TensorRT/releases
tar -xvzf libtorch_tensorrt.tar.gz
# unzip libtorch downloaded from pytorch.org
unzip libtorch.zip
If TensorRT is not installed on your system / in your LD_LIBRARY_PATH then do the following as well
cd deps
mkdir tensorrt && tar -xvzf <TensorRT TARBALL> --directory tensorrt --strip-components=1
cd ..
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/deps/torch_tensorrt/lib:$(pwd)/deps/libtorch/lib:$(pwd)/deps/tensorrt/lib:/usr/local/cuda/lib
- Build and run
qat
We import header files cifar10.h
and benchmark.h
from ROOT_DIR
. ROOT_DIR
should point to the path where Torch-TensorRT is located <path_to_torch_tensorrt>
.
By default it is set to ../../../
. If your Torch-TensorRT directory structure is different, please set ROOT_DIR
accordingly.
cd examples/int8/qat
# This will generate a ptq binary
make ROOT_DIR=<PATH> CUDA_VERSION=11.1
./qat <path-to-module> <path-to-cifar10>
qat <path-to-module> <path-to-cifar10>
Accuracy of JIT model on test set: 92.1%
Compiling and quantizing module
Accuracy of quantized model on test set: 91.0044%
Latency of JIT model FP32 (Batch Size 32): 1.73497ms
Latency of quantized model (Batch Size 32): 0.365737ms
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.