[Quantization] Update quantized model deployment examples and update …

…readme. (PaddlePaddle#377) * Add PaddleOCR Support * Add PaddleOCR Support * Add PaddleOCRv3 Support * Add PaddleOCRv3 Support * Update README.md * Update README.md * Update README.md * Update README.md * Add PaddleOCRv3 Support * Add PaddleOCRv3 Supports * Add PaddleOCRv3 Suport * Fix Rec diff * Remove useless functions * Remove useless comments * Add PaddleOCRv2 Support * Add PaddleOCRv3 & PaddleOCRv2 Support * remove useless parameters * Add utils of sorting det boxes * Fix code naming convention * Fix code naming convention * Fix code naming convention * Fix bug in the Classify process * Imporve OCR Readme * Fix diff in Cls model * Update Model Download Link in Readme * Fix diff in PPOCRv2 * Improve OCR readme * Imporve OCR readme * Improve OCR readme * Improve OCR readme * Imporve OCR readme * Improve OCR readme * Fix conflict * Add readme for OCRResult * Improve OCR readme * Add OCRResult readme * Improve OCR readme * Improve OCR readme * Add Model Quantization Demo * Fix Model Quantization Readme * Fix Model Quantization Readme * Add the function to do PTQ quantization * Improve quant tools readme * Improve quant tool readme * Improve quant tool readme * Add PaddleInference-GPU for OCR Rec model * Add QAT method to fastdeploy-quantization tool * Remove examples/slim for now * Move configs folder * Add Quantization Support for Classification Model * Imporve ways of importing preprocess * Upload YOLO Benchmark on readme * Upload YOLO Benchmark on readme * Upload YOLO Benchmark on readme * Improve Quantization configs and readme * Add support for multi-inputs model * Add backends and params file for YOLOv7 * Add quantized model deployment support for YOLO series * Fix YOLOv5 quantize readme * Fix YOLO quantize readme * Fix YOLO quantize readme * Improve quantize YOLO readme * Improve quantize YOLO readme * Improve quantize YOLO readme * Improve quantize YOLO readme * Improve quantize YOLO readme * Fix bug, change Fronted to ModelFormat * Change Fronted to ModelFormat * Add examples to deploy quantized paddleclas models * Fix readme * Add quantize Readme * Add quantize Readme * Add quantize Readme * Modify readme of quantization tools * Modify readme of quantization tools * Improve quantization tools readme * Improve quantization readme * Improve PaddleClas quantized model deployment readme * Add PPYOLOE-l quantized deployment examples * Improve quantization tools readme * Improve Quantize Readme * Fix conflicts * Fix conflicts * improve readme * Improve quantization tools and readme * Improve quantization tools and readme * Add quantized deployment examples for PaddleSeg model * Fix cpp readme * Fix memory leak of reader_wrapper function * Fix model file name in PaddleClas quantization examples * Update Runtime and E2E benchmark * Update Runtime and E2E benchmark * Rename quantization tools to auto compression tools * Remove PPYOLOE data when deployed on MKLDNN * Fix readme * Support PPYOLOE with OR without NMS and update readme * Update Readme * Update configs and readme * Update configs and readme * Add Paddle-TensorRT backend in quantized model deploy examples * Support PPYOLOE+ series
fhfeishi · Nov 2, 2022 · a231c9e · a231c9e
1 parent 9437dec
commit a231c9e
Show file tree

Hide file tree

Showing 53 changed files with 1,519 additions and 526 deletions.
diff --git a/docs/cn/quantize.md b/docs/cn/quantize.md
diff --git a/docs/en/quantize.md b/docs/en/quantize.md
@@ -1,11 +1,79 @@
 [English](../en/quantize.md) | 简体中文
 
 # 量化加速
+量化是一种流行的模型压缩方法，量化后的模型拥有更小的体积和更快的推理速度.
+FastDeploy基于PaddleSlim, 集成了一键模型量化的工具, 同时, FastDeploy支持部署量化后的模型, 帮助用户实现推理加速.
 
-简要介绍量化加速的原理。
 
-目前量化支持在哪些硬件及后端的使用
+## FastDeploy 多个引擎和硬件支持量化模型部署
+当前，FastDeploy中多个推理后端可以在不同硬件上支持量化模型的部署. 支持情况如下:
+
+| 硬件/推理后端 | ONNX Runtime | Paddle Inference | TensorRT |
+| :-----------| :--------   | :--------------- | :------- |
+|   CPU       |  支持        |  支持            |          |  
+|   GPU       |             |                  | 支持      |
+
+
+## 模型量化
+
+### 量化方法
+基于PaddleSlim, 目前FastDeploy提供的的量化方法有量化蒸馏训练和离线量化, 量化蒸馏训练通过模型训练来获得量化模型, 离线量化不需要模型训练即可完成模型的量化. FastDeploy 对两种方式产出的量化模型均能部署.
+
+两种方法的主要对比如下表所示:
+| 量化方法 | 量化过程耗时 | 量化模型精度 | 模型体积 | 推理速度 |
+| :-----------| :--------| :-------| :------- | :------- |
+|   离线量化      |  无需训练，耗时短 |  比量化蒸馏训练稍低       | 两者一致   | 两者一致   |  
+|   量化蒸馏训练      |  需要训练，耗时稍高 |  较未量化模型有少量损失 | 两者一致   |两者一致   |  
+
+### 用户使用FastDeploy一键模型量化工具来量化模型
+Fastdeploy基于PaddleSlim, 为用户提供了一键模型量化的工具，请参考如下文档进行模型量化.
+- [FastDeploy 一键模型量化](../../tools/quantization/)
+当用户获得产出的量化模型之后，即可以使用FastDeploy来部署量化模型.
+
 
 ## 量化示例
+目前, FastDeploy已支持的模型量化如下表所示:
+
+### YOLO 系列
+| 模型                 |推理后端            |部署硬件    | FP32推理时延    | INT8推理时延  | 加速比    | FP32 mAP | INT8 mAP | 量化方式   |
+| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |
+| [YOLOv5s](../../examples/vision/detection/yolov5/quantize/)             | TensorRT         |    GPU    |  8.79       |  5.17     |      1.70         | 37.6  | 36.6 | 量化蒸馏训练 |
+| [YOLOv5s](../../examples/vision/detection/yolov5/quantize/)              | ONNX Runtime     |    CPU    |  176.34      |    92.95   |      1.90        | 37.6  | 33.1 |量化蒸馏训练 |
+| [YOLOv5s](../../examples/vision/detection/yolov5/quantize/)              | Paddle Inference  |    CPU    |      217.05  |   133.31     |     1.63         |37.6 | 36.8 | 量化蒸馏训练 |
+| [YOLOv6s](../../examples/vision/detection/yolov6/quantize/)            | TensorRT         |    GPU    |       8.60       |   5.16         |  1.67            | 42.5 | 40.6|量化蒸馏训练 |
+| [YOLOv6s](../../examples/vision/detection/yolov6/quantize/)            | ONNX Runtime     |    CPU    |   338.60           |  128.58          |      2.60         |42.5| 36.1|量化蒸馏训练 |
+| [YOLOv6s](../../examples/vision/detection/yolov6/quantize/)             | Paddle Inference  |    CPU    |        356.62     |    125.72        |     2.84         |42.5| 41.2|量化蒸馏训练 |
+| [YOLOv7](../../examples/vision/detection/yolov7/quantize/)            | TensorRT          |    GPU    |     24.57         |      9.40     |      2.61       | 51.1| 50.8|量化蒸馏训练 |
+| [YOLOv7](../../examples/vision/detection/yolov7/quantize/)             | ONNX Runtime     |    CPU    |     976.88         |  462.69          |  2.11            | 51.1 | 42.5|量化蒸馏训练 |
+| [YOLOv7](../../examples/vision/detection/yolov7/quantize/)             | Paddle Inference  |    CPU    |         1022.55    |     490.87      |   2.08         |51.1 | 46.3|量化蒸馏训练 |
+
+上表中的数据, 为模型量化前后，在FastDeploy部署的Runtime推理性能.
+- 测试数据为COCO2017验证集中的图片.
+- 推理时延为在不同Runtime上推理的时延, 单位是毫秒.
+- CPU为Intel(R) Xeon(R) Gold 6271C, GPU为Tesla T4, TensorRT版本8.4.15, 所有测试中固定CPU线程数为1.
+
+
+### PaddleDetection系列
+| 模型                 |推理后端            |部署硬件    | FP32推理时延    | INT8推理时延  | 加速比    | FP32 mAP | INT8 mAP |量化方式   |
+| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |
+| [ppyoloe_crn_l_300e_coco](../../examples/vision/detection/paddledetection/quantize )  | TensorRT         |    GPU    |  24.52       |  11.53    |      2.13        | 51.4  | 50.7 | 量化蒸馏训练 |
+| [ppyoloe_crn_l_300e_coco](../../examples/vision/detection/paddledetection/quantize)  | ONNX Runtime |    CPU    |     1085.62 |   457.56     |      2.37        |51.4 | 50.0 |量化蒸馏训练 |
+
+上表中的数据, 为模型量化前后，在FastDeploy部署的Runtime推理性能.
+- 测试图片为COCO val2017中的图片.
+- 推理时延为在不同Runtime上推理的时延, 单位是毫秒.
+- CPU为Intel(R) Xeon(R) Gold 6271C, GPU为Tesla T4, TensorRT版本8.4.15, 所有测试中固定CPU线程数为1.
+
+
+### PaddleClas系列
+| 模型                 |推理后端            |部署硬件    | FP32推理时延    | INT8推理时延  | 加速比    | FP32 Top1 | INT8 Top1 |量化方式   |
+| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |
+| [ResNet50_vd](../../examples/vision/classification/paddleclas/quantize/)            | ONNX Runtime         |    CPU    |  77.20       |  40.08     |     1.93        | 79.12  | 78.87|  离线量化|
+| [ResNet50_vd](../../examples/vision/classification/paddleclas/quantize/)            | TensorRT         |    GPU    |  3.70        | 1.80      |      2.06      | 79.12  | 79.06 | 离线量化 |
+| [MobileNetV1_ssld](../../examples/vision/classification/paddleclas/quantize/)             | ONNX Runtime |    CPU    |     30.99   |   10.24    |     3.03        |77.89 | 75.09 |离线量化 |
+| [MobileNetV1_ssld](../../examples/vision/classification/paddleclas/quantize/)             | TensorRT  |    GPU    |     1.80  |   0.58    |      3.10       |77.89 | 76.86 | 离线量化 |
 
-这里一个表格，展示目前支持的量化列表(跳转到相应的example下去)，精度、性能
+上表中的数据, 为模型量化前后，在FastDeploy部署的Runtime推理性能.
+- 测试数据为ImageNet-2012验证集中的图片.
+- 推理时延为在不同Runtime上推理的时延, 单位是毫秒.
+- CPU为Intel(R) Xeon(R) Gold 6271C, GPU为Tesla T4, TensorRT版本8.4.15, 所有测试中固定CPU线程数为1.
diff --git a/examples/vision/classification/paddleclas/quantize/README.md b/examples/vision/classification/paddleclas/quantize/README.md
@@ -1,25 +1,48 @@
 # PaddleClas 量化模型部署
-FastDeploy已支持部署量化模型,并提供一键模型量化的工具.
-用户可以使用一键模型量化工具,自行对模型量化后部署, 也可以直接下载FastDeploy提供的量化模型进行部署.
+FastDeploy已支持部署量化模型,并提供一键模型自动化压缩的工具.
+用户可以使用一键模型自动化压缩工具,自行对模型量化后部署, 也可以直接下载FastDeploy提供的量化模型进行部署.
 
-## FastDeploy一键模型量化工具
-FastDeploy 提供了一键量化工具, 能够简单地通过输入一个配置文件, 对模型进行量化.
-详细教程请见: [一键模型量化工具](../../../../../tools/quantization/)
+## FastDeploy一键模型自动化压缩工具
+FastDeploy 提供了一键模型自动化压缩工具, 能够简单地通过输入一个配置文件, 对模型进行量化.
+详细教程请见: [一键模型自动化压缩工具](../../../../../tools/auto_compression/)
 注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此yaml文件, 用户从FP32模型文件夹下复制此yaml文件到量化后的模型文件夹内即可。
 
 ## 下载量化完成的PaddleClas模型
 用户也可以直接下载下表中的量化模型进行部署.
-| 模型                 |推理后端            |部署硬件    | FP32推理时延    | INT8推理时延  | 加速比    | FP32 Top1 | INT8 Top1 |量化方式   |
-| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |
-| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | ONNX Runtime         |    CPU    |  86.87        |  59 .32     |      1.46         | 79.12  | 78.87|  离线量化|
-| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | TensorRT         |    GPU    |  7.85        |  5.42      |      1.45         | 79.12  | 79.06 | 离线量化 |
-| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)             | ONNX Runtime |    CPU    |      40.32   |   16.87     |      2.39         |77.89 | 75.09 |离线量化 |
-| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)             | TensorRT  |    GPU    |      5.10   |   3.35     |      1.52         |77.89 | 76.86 | 离线量化 |
 
-上表中的数据, 为模型量化前后，在FastDeploy部署的端到端推理性能.
-- 测试图片为ImageNet-2012验证集中的图片.
-- 推理时延为端到端推理(包含前后处理)的平均时延, 单位是毫秒.
-- CPU为Intel(R) Xeon(R) Gold 6271C, GPU为Tesla T4, TensorRT版本8.4.15, 所有测试中固定CPU线程数为1.
+Benchmark表格说明:
+- Rtuntime时延为模型在各种Runtime上的推理时延,包含CPU->GPU数据拷贝,GPU推理,GPU->CPU数据拷贝时间. 不包含模型各自的前后处理时间.
+- 端到端时延为模型在实际推理场景中的时延, 包含模型的前后处理.
+- 所测时延均为推理1000次后求得的平均值, 单位是毫秒.
+- INT8 + FP16 为在推理INT8量化模型的同时, 给Runtime 开启FP16推理选项
+- INT8 + FP16 + PM, 为在推理INT8量化模型和开启FP16的同时, 开启使用Pinned Memory的选项,可加速GPU->CPU数据拷贝的速度
+- 最大加速比, 为FP32时延除以INT8推理的最快时延,得到最大加速比.
+- 策略为量化蒸馏训练时, 采用少量无标签数据集训练得到量化模型, 并在全量验证集上验证精度, INT8精度并不代表最高的INT8精度.
+- CPU为Intel(R) Xeon(R) Gold 6271C, 所有测试中固定CPU线程数为1.  GPU为Tesla T4, TensorRT版本8.4.15.
+
+### Runtime Benchmark
+| 模型                 |推理后端            |部署硬件    | FP32 Runtime时延   | INT8 Runtime时延 | INT8 + FP16 Runtime时延  | INT8+FP16+PM Runtime时延  | 最大加速比    | FP32 Top1 | INT8 Top1 | 量化方式   |
+| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |----- |----- |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | TensorRT         |    GPU    |  3.55 | 0.99|0.98|1.06  |      3.62      | 79.12  | 79.06 | 离线量化 |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | Paddle-TensorRT  |    GPU    |  3.46 |None |0.87|1.03  |      3.98      | 79.12  | 79.06 | 离线量化 |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | ONNX Runtime    |    CPU    |  76.14       |  35.43  |None|None  |     2.15        | 79.12  | 78.87|  离线量化|
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | Paddle Inference  |    CPU    |  76.21       |  24.01 |None|None  |     3.17       | 79.12  | 78.55 |  离线量化|
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | TensorRT  |    GPU    |     0.91 |   0.43 |0.49 | 0.54    |      2.12       |77.89 | 76.86 | 离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | Paddle-TensorRT   |    GPU    |  0.88|   None| 0.49|0.51 |      1.80      |77.89 | 76.86 | 离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | ONNX Runtime |    CPU    |     30.53   |   9.59|None|None    |     3.18       |77.89 | 75.09 |离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        |  Paddle Inference  |    CPU    |     12.29  |   4.68  |     None|None|2.62       |77.89 | 71.36 |离线量化 |
+
+### 端到端 Benchmark
+| 模型                 |推理后端            |部署硬件    | FP32 Runtime时延   | INT8 Runtime时延 | INT8 + FP16 Runtime时延  | INT8+FP16+PM Runtime时延  | 最大加速比    | FP32 Top1 | INT8 Top1 | 量化方式   |
+| ------------------- | -----------------|-----------|  --------     |--------      |--------      | --------- |-------- |----- |----- |----- |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | TensorRT         |    GPU    |  4.92| 2.28|2.24|2.23 |      2.21     | 79.12  | 79.06 | 离线量化 |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | Paddle-TensorRT  |    GPU    |  4.48|None |2.09|2.10 |      2.14   | 79.12  | 79.06 | 离线量化 |
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | ONNX Runtime    |    CPU    |  77.43    |  41.90 |None|None  |     1.85        | 79.12  | 78.87|  离线量化|
+| [ResNet50_vd](https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar)            | Paddle Inference  |    CPU    |   80.60     |  27.75 |None|None  |     2.90     | 79.12  | 78.55 |  离线量化|
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | TensorRT  |    GPU    |     2.19 |   1.48|1.57| 1.57   |      1.48     |77.89 | 76.86 | 离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | Paddle-TensorRT   |    GPU    |  2.04|   None| 1.47|1.45 |      1.41     |77.89 | 76.86 | 离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        | ONNX Runtime |    CPU    |     34.02  |   12.97|None|None    |    2.62       |77.89 | 75.09 |离线量化 |
+| [MobileNetV1_ssld](https://bj.bcebos.com/paddlehub/fastdeploy/mobilenetv1_ssld_ptq.tar)        |  Paddle Inference  |    CPU    |    16.31 |   7.42  |     None|None| 2.20      |77.89 | 71.36 |离线量化 |
 
 ## 详细部署文档
 

diff --git a/examples/vision/classification/paddleclas/quantize/cpp/README.md b/examples/vision/classification/paddleclas/quantize/cpp/README.md
@@ -1,4 +1,4 @@
-# PaddleClas 量化模型 Python部署示例
+# PaddleClas 量化模型 C++部署示例
 本目录下提供的`infer.cc`,可以帮助用户快速完成PaddleClas量化模型在CPU/GPU上的部署推理加速.
 
 ## 部署准备
@@ -8,7 +8,7 @@
 
 ### 量化模型准备
 - 1. 用户可以直接使用由FastDeploy提供的量化模型进行部署.
-- 2. 用户可以使用FastDeploy提供的[一键模型量化工具](../../../../../../tools/quantization/),自行进行模型量化, 并使用产出的量化模型进行部署.(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此yaml文件, 用户从FP32模型文件夹下复制此yaml文件到量化后的模型文件夹内即可.)
+- 2. 用户可以使用FastDeploy提供的[一键模型自动化压缩工具](../../../../../tools/auto_compression/),自行进行模型量化, 并使用产出的量化模型进行部署.(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此yaml文件, 用户从FP32模型文件夹下复制此yaml文件到量化后的模型文件夹内即可.)
 
 ## 以量化后的ResNet50_Vd模型为例, 进行部署
 在本目录执行如下命令即可完成编译,以及量化模型部署.
@@ -26,8 +26,10 @@ tar -xvf resnet50_vd_ptq.tar
 wget https://gitee.com/paddlepaddle/PaddleClas/raw/release/2.4/deploy/images/ImageNet/ILSVRC2012_val_00000010.jpeg
 
 
-# 在CPU上使用Paddle-Inference推理量化模型
+# 在CPU上使用ONNX Runtime推理量化模型
 ./infer_demo resnet50_vd_ptq ILSVRC2012_val_00000010.jpeg 0
 # 在GPU上使用TensorRT推理量化模型
 ./infer_demo resnet50_vd_ptq ILSVRC2012_val_00000010.jpeg 1
+# 在GPU上使用Paddle-TensorRT推理量化模型
+./infer_demo resnet50_vd_ptq ILSVRC2012_val_00000010.jpeg 2
 ```
diff --git a/examples/vision/classification/paddleclas/quantize/cpp/infer.cc b/examples/vision/classification/paddleclas/quantize/cpp/infer.cc
@@ -21,8 +21,8 @@ const char sep = '/';
 
 void InitAndInfer(const std::string& model_dir, const std::string& image_file,
                   const fastdeploy::RuntimeOption& option) {
-  auto model_file = model_dir + sep + "inference.pdmodel";
-  auto params_file = model_dir + sep + "inference.pdiparams";
+  auto model_file = model_dir + sep + "model.pdmodel";
+  auto params_file = model_dir + sep + "model.pdiparams";
   auto config_file = model_dir + sep + "inference_cls.yaml";
 
   auto model = fastdeploy::vision::classification::PaddleClasModel(
@@ -67,7 +67,11 @@ int main(int argc, char* argv[]) {
     option.UseGpu();
     option.UseTrtBackend();
     option.SetTrtInputShape("inputs",{1, 3, 224, 224});
-  }
+  } else if (flag == 2) {
+    option.UseGpu();
+    option.UseTrtBackend();
+    option.EnablePaddleToTrt();
+    }
 
   std::string model_dir = argv[1];
   std::string test_image = argv[2];

diff --git a/examples/vision/classification/paddleclas/quantize/cpp/ocr.sh b/examples/vision/classification/paddleclas/quantize/cpp/ocr.sh
@@ -0,0 +1,10 @@
+rm -rf build
+mkdir build
+
+cd build
+
+#/xieyunyao/project/FastDeploy
+
+cmake .. -DFASTDEPLOY_INSTALL_DIR=/xieyunyao/project/FastDeploy
+
+make -j
diff --git a/examples/vision/classification/paddleclas/quantize/python/README.md b/examples/vision/classification/paddleclas/quantize/python/README.md
@@ -8,7 +8,7 @@
 
 ### 量化模型准备
 - 1. 用户可以直接使用由FastDeploy提供的量化模型进行部署.
-- 2. 用户可以使用FastDeploy提供的[一键模型量化工具](../../../../../../tools/quantization/),自行进行模型量化, 并使用产出的量化模型进行部署.(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此yaml文件, 用户从FP32模型文件夹下复制此yaml文件到量化后的模型文件夹内即可.)
+- 2. 用户可以使用FastDeploy提供的[一键模型自动化压缩工具](../../tools/auto_compression/),自行进行模型量化, 并使用产出的量化模型进行部署.(注意: 推理量化后的分类模型仍然需要FP32模型文件夹下的inference_cls.yaml文件, 自行量化的模型文件夹内不包含此yaml文件, 用户从FP32模型文件夹下复制此yaml文件到量化后的模型文件夹内即可.)
 
 
 ## 以量化后的ResNet50_Vd模型为例, 进行部署
@@ -22,8 +22,10 @@ wget https://bj.bcebos.com/paddlehub/fastdeploy/resnet50_vd_ptq.tar
 tar -xvf resnet50_vd_ptq.tar
 wget https://gitee.com/paddlepaddle/PaddleClas/raw/release/2.4/deploy/images/ImageNet/ILSVRC2012_val_00000010.jpeg
 
-# 在CPU上使用Paddle-Inference推理量化模型
+# 在CPU上使用ONNX Runtime推理量化模型
 python infer.py --model resnet50_vd_ptq --image ILSVRC2012_val_00000010.jpeg --device cpu --backend ort
 # 在GPU上使用TensorRT推理量化模型
 python infer.py --model resnet50_vd_ptq --image ILSVRC2012_val_00000010.jpeg --device gpu --backend trt
+# 在GPU上使用Paddle-TensorRT推理量化模型
+python infer.py --model resnet50_vd_ptq --image ILSVRC2012_val_00000010.jpeg --device gpu --backend pptrt
 ```
diff --git a/examples/vision/classification/paddleclas/quantize/python/infer.py b/examples/vision/classification/paddleclas/quantize/python/infer.py
@@ -48,6 +48,11 @@ def build_option(args):
         ) == "gpu", "TensorRT backend require inferences on device GPU."
         option.use_trt_backend()
         option.set_trt_input_shape("inputs", min_shape=[1, 3, 224, 224])
+    elif args.backend.lower() == "pptrt":
+        assert args.device.lower(
+        ) == "gpu", "TensorRT backend require inference on device GPU."
+        option.use_trt_backend()
+        option.enable_paddle_to_trt()
     elif args.backend.lower() == "ort":
         option.use_ort_backend()
     elif args.backend.lower() == "paddle":