Skip to content

C++ and Python implementations of YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLOv11 inference.

Notifications You must be signed in to change notification settings

taifyang/yolo-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yolo-inference

C++ and Python implementations of YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLOv11 inference.

Supported inference backends include Libtorch/PyTorch, ONNXRuntime, OpenCV, OpenVINO and TensorRT.

Supported task types include Classify, Detect and Segment.

Supported model types include FP32, FP16 and INT8.

Dependencies(tested):

  • CUDA version 11.8.0/12.5.0
  • OpenCV version 4.9.0/4.10.0 (built with CUDA)
  • ONNXRuntime version 1.18.1/1.20.0
  • OpenVINO version 2024.1.0/2024.4.0
  • TensorRT version 8.2.1.8/10.6.0.26
  • Torch version 2.0.0+cu118/2.5.0+cu124

You can test C++ code with:

# Windows
mkdir build ; cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
./run.bat

or

# Linux
mkdir build ; cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
./run.sh

C++ test in Docker(CPU i7-13700F, GPU RTX4070):

Model Task Device Precision LibTorch ONNXRuntime OpenCV OpenVINO TensorRT
YOLOv5n Classify CPU FP32 11.0ms 12.0ms 14.0ms 9.8ms ×
YOLOv5n Classify GPU FP32 3.2ms 5.6ms 4.1ms ? 2.6ms
YOLOv5n Classify CPU FP16 × 17.6ms 14.2ms 9.8ms ×
YOLOv5n Classify GPU FP16 3.7ms 7.2ms 4.0ms ? 2.4ms
YOLOv5n Classify CPU INT8 × 16.8ms × ? ×
YOLOv5n Classify GPU INT8 × 30.5ms × ? 2.4ms
YOLOv5n Detect CPU FP32 20.9ms 16.5ms 45.2ms 14.1ms ×
YOLOv5n Detect GPU FP32 4.1ms 7.1ms 6.0ms ? 2.9ms
YOLOv5n Detect CPU FP16 × 31.7ms 45.3ms 14.1ms ×
YOLOv5n Detect GPU FP16 3.7ms 16.5ms 5.8ms ? 2.5ms
YOLOv5n Detect CPU INT8 × 22.5ms × 13.7ms ×
YOLOv5n Detect GPU INT8 × 44.7ms × ? 2.5ms
YOLOv5n Segment CPU FP32 27.5ms 22.9ms 61.8ms 20.2ms ×
YOLOv5n Segment GPU FP32 6.9ms 10.7ms 7.8ms ? 4.5ms
YOLOv5n Segment CPU FP16 × 43.7ms 61.5ms 20.2ms ×
YOLOv5n Segment GPU FP16 6.3ms 27.8ms 7.4ms ? 4.0ms
YOLOv5n Segment CPU INT8 × 30.4ms × ? ×
YOLOv5n Segment GPU INT8 × 62.7ms × ? ?
YOLOv6n Detect CPU FP32 ? 19.7ms 21.8ms 21.7ms ×
YOLOv6n Detect GPU FP32 ? 7.3ms 4.9ms ? 3.3ms
YOLOv6n Detect CPU FP16 × 37.7ms 21.7ms 21.8ms ×
YOLOv6n Detect GPU FP16 ? 14.5ms 4.5ms ? 2.6ms
YOLOv6n Detect CPU INT8 × 34.1ms × 18.9ms ×
YOLOv6n Detect GPU INT8 × 64.6ms × ? 2.4ms
YOLOv7t Detect CPU FP32 42.9ms 24.6ms 49.5ms 27.6ms ×
YOLOv7t Detect GPU FP32 4.9ms 7.8ms 6.5ms ? 3.5ms
YOLOv7t Detect CPU FP16 × 54.2ms 49.4ms 27.5ms ×
YOLOv7t Detect GPU FP16 ? 22.6ms 5.7ms ? 2.8ms
YOLOv7t Detect CPU INT8 × 42.4ms × 24.3ms ×
YOLOv7t Detect GPU INT8 × 77.4ms × ? 2.5ms
YOLOv8n Classify CPU FP32 3.0ms 1.7ms 2.5ms 1.5ms ×
YOLOv8n Classify GPU FP32 0.9ms 1.0ms 1.4ms ? 0.7ms
YOLOv8n Classify CPU FP16 × 3.3ms 2.5ms 1.5ms ×
YOLOv8n Classify GPU FP16 ? 1.3ms 1.5ms ? 0.6ms
YOLOv8n Classify CPU INT8 × 2.7ms × ? ×
YOLOv8n Classify GPU INT8 × 5.9ms × ? 0.6ms
YOLOv8n Detect CPU FP32 26.3ms 24.8ms 29.0ms 21.2ms ×
YOLOv8n Detect GPU FP32 3.8ms 7.7ms 5.1ms ? 3.3ms
YOLOv8n Detect CPU FP16 × 42.7ms 28.9ms 21.3ms ×
YOLOv8n Detect GPU FP16 ? 20.0ms 4.7ms ? 2.8ms
YOLOv8n Detect CPU INT8 × 32.6ms × 19.3ms ×
YOLOv8n Detect GPU INT8 × 59.7ms × ? 2.6ms
YOLOv8n Segment CPU FP32 ? 34.0ms 38.1ms 28.4ms ×
YOLOv8n Segment GPU FP32 6.3ms 10.5ms 6.8ms ? 4.9ms
YOLOv8n Segment CPU FP16 × 55.8ms 37.9ms 28.4ms ×
YOLOv8n Segment GPU FP16 ? 26.9ms 6.5ms ? 4.3ms
YOLOv8n Segment CPU INT8 × 42.8ms × ? ×
YOLOv8n Segment GPU INT8 × 81.3ms × ? ?
YOLOv9t Detect CPU FP32 36.6ms 27.0ms 35.4ms 21.8ms ×
YOLOv9t Detect GPU FP32 6.1ms 9.8m 8.4ms ? 4.3ms
YOLOv9t Detect CPU FP16 × 44.0ms 35.5ms 21.8ms ×
YOLOv9t Detect GPU FP16 ? 19.9ms 8.7ms ? 3.7ms
YOLOv9t Detect CPU INT8 × 39.7ms × 20.4ms ×
YOLOv9t Detect GPU INT8 × 88.3ms × ? 3.6ms
YOLOv10n Detect CPU FP32 25.3ms 23.3ms x 18.6ms ×
YOLOv10n Detect GPU FP32 3.0ms 7.3m × ? 3.0ms
YOLOv10n Detect CPU FP16 × 41.6ms × 18.6ms ×
YOLOv10n Detect GPU FP16 ? 12.0ms × ? 2.4ms
YOLOv10n Detect CPU INT8 × 35.0ms × 17.3ms ×
YOLOv10n Detect GPU INT8 × 63.7ms × ? 2.3ms
YOLOv11n Classify CPU FP32 2.9ms 1.9ms 2.9ms 1.6ms ×
YOLOv11n Classify GPU FP32 1.2ms 1.3ms × ? 0.9ms
YOLOv11n Classify CPU FP16 × 3.4ms 3.0ms 1.7ms ×
YOLOv11n Classify GPU FP16 ? 1.6ms × ? 0.7ms
YOLOv11n Classify CPU INT8 × ? × ? ×
YOLOv11n Classify GPU INT8 × ? × ? 0.7ms
YOLOv11n Detect CPU FP32 28.9ms 23.3ms 31.4ms 18.5ms ×
YOLOv11n Detect GPU FP32 4.1ms 8.2ms × ? 3.4ms
YOLOv11n Detect CPU FP16 × 46.4ms 31.1ms 18.4ms ×
YOLOv11n Detect GPU FP16 ? 17.6ms × ? 3.0ms
YOLOv11n Detect CPU INT8 × ? × 17.0ms ×
YOLOv11n Detect GPU INT8 × ? × ? 2.9ms
YOLOv11n Segment CPU FP32 × 32.3ms 40.5ms 25.7ms ×
YOLOv11n Segment GPU FP32 × 11.8ms × ? 5.0ms
YOLOv11n Segment CPU FP16 × 59.8ms 40.1ms 25.6ms ×
YOLOv11n Segment GPU FP16 × 27.1ms × ? 4.5ms
YOLOv11n Segment CPU INT8 × ? × ? ×
YOLOv11n Segment GPU INT8 × ? × ? ?

You can test Python code with:

# Windows 
pip install -r requirements.txt
./run.bat

or

# Linux
pip install -r requirements.txt
./run.sh

Python test in Docker(CPU i7-13700F, GPU RTX4070):

Model Task Device Precision PyTorch ONNXRuntime OpenCV OpenVINO TensorRT
YOLOv5n Classify CPU FP32 19.1ms 19.3ms 23.3ms 17.6ms ×
YOLOv5n Classify GPU FP32 11.8ms 15.1ms 12.6ms ? 8.9ms
YOLOv5n Classify CPU FP16 × 24.9ms 23.5ms 17.6ms ×
YOLOv5n Classify GPU FP16 13.5ms 15.4ms 13.2ms ? 10.2ms
YOLOv5n Classify CPU INT8 × 25.2ms × ? ×
YOLOv5n Classify GPU INT8 × 39.2ms × ? 10.3ms
YOLOv5n Detect CPU FP32 22.3ms 21.0ms 47.7ms 18.2ms ×
YOLOv5n Detect GPU FP32 9.1ms 12.8ms 8.1ms ? 5.5ms
YOLOv5n Detect CPU FP16 × 32.6ms 46.9ms 18.2ms ×
YOLOv5n Detect GPU FP16 8.3ms 15.4ms 8.0ms ? 5.1ms
YOLOv5n Detect CPU INT8 × 27.9ms × 17.9ms ×
YOLOv5n Detect GPU INT8 × 54.7ms × ? 6.0ms
YOLOv5n Segment CPU FP32 154.8ms 98.1ms 129.0ms 44.3ms ×
YOLOv5n Segment GPU FP32 31.2ms 42.6ms 31.4ms ? 32.6ms
YOLOv5n Segment CPU FP16 × 119.0ms 129.3ms 44.7ms ×
YOLOv5n Segment GPU FP16 42.3ms 50.2ms 31.6ms ? 33.4ms
YOLOv5n Segment CPU INT8 × 112.5ms × ? ×
YOLOv5n Segment GPU INT8 × 166.6ms × ? ?
YOLOv6n Detect CPU FP32 ? 45.8ms 37.5ms 39.3ms ×
YOLOv6n Detect GPU FP32 ? 36.6ms 30.6ms ? 27.8ms
YOLOv6n Detect CPU FP16 × 58.6ms 39.2ms 39.4ms ×
YOLOv6n Detect GPU FP16 ? 35.3ms 29.1ms ? 24.0ms
YOLOv6n Detect CPU INT8 × 59.4ms × 36.5ms ×
YOLOv6n Detect GPU INT8 × 110.8ms × ? 22.1ms
YOLOv7t Detect CPU FP32 47.0ms 32.3ms 52.0s 31.8ms ×
YOLOv7t Detect GPU FP32 8.0ms 12.6ms 8.9ms ? 6.1ms
YOLOv7t Detect CPU FP16 × 55.6ms 52.2ms 31.7ms ×
YOLOv7t Detect GPU FP16 ? 18.9ms 7.8ms ? 5.4ms
YOLOv7t Detect CPU INT8 × 48.6ms × 27.5ms ×
YOLOv7t Detect GPU INT8 × 90.9ms × ? 5.0ms
YOLOv8n Classify CPU FP32 3.3ms 1.9ms 2.7ms 1.4ms ×
YOLOv8n Classify GPU FP32 1.1ms 1.1ms 1.6ms ? 0.7ms
YOLOv8n Classify CPU FP16 × 3.6ms 2.6ms 1.4ms ×
YOLOv8n Classify GPU FP16 ? 1.4ms 1.6ms ? 0.6ms
YOLOv8n Classify CPU INT8 × 3.3ms × ? ×
YOLOv8n Classify GPU INT8 × 6.5ms × ? 0.6ms
YOLOv8n Detect CPU FP32 45.2ms 53.7ms 45.4ms 37.3ms ×
YOLOv8n Detect GPU FP32 28.5ms 33.9ms 28.3ms ? 25.9ms
YOLOv8n Detect CPU FP16 × 61.7ms 43.8ms 37.4ms ×
YOLOv8n Detect GPU FP16 ? 39.5ms 27.5ms ? 22.9ms
YOLOv8n Detect CPU INT8 × 59.2ms × 35.5ms ×
YOLOv8n Detect GPU INT8 × 93.3ms × ? 21.6ms
YOLOv8n Segment CPU FP32 170.8ms 144.0ms 133.8ms 87.9ms ×
YOLOv8n Segment GPU FP32 78.1ms 84.1ms 81.5ms ? 70.1ms
YOLOv8n Segment CPU FP16 × 155.6ms 132.9ms 89.1ms ×
YOLOv8n Segment GPU FP16 ? 85.8ms 76.8ms ? 76.6ms
YOLOv8n Segment CPU INT8 × 157.0ms × ? ×
YOLOv8n Segment GPU INT8 × 210.2ms × ? ?
YOLOv9t Detect CPU FP32 57.7ms 52.3ms 52.5ms 37.9ms ×
YOLOv9t Detect GPU FP32 30.4ms 39.1ms 32.6ms ? 27.6ms
YOLOv9t Detect CPU FP16 × 67.3ms 51.9ms 38.1ms ×
YOLOv9t Detect GPU FP16 ? 42.0ms 31.9ms ? 26.9ms
YOLOv9t Detect CPU INT8 × 67.5ms × 36.5ms ×
YOLOv9t Detect GPU INT8 × 122.3ms × ? 26.2ms
YOLOv10n Detect CPU FP32 29.5ms 32.8ms × 20.6ms ×
YOLOv10n Detect GPU FP32 5.5ms 11.2m × ? 5.3ms
YOLOv10n Detect CPU FP16 × 44.2ms × 20.6ms ×
YOLOv10n Detect GPU FP16 ? 12.7ms × ? 4.7ms
YOLOv10n Detect CPU INT8 × 47.8ms × 19.5ms ×
YOLOv10n Detect GPU INT8 × 78.0ms × ? 4.6ms
YOLOv11n Classify CPU FP32 3.4ms 2.2ms 3.0ms 1.5ms ×
YOLOv11n Classify GPU FP32 1.4ms 1.4ms × ? 0.8ms
YOLOv11n Classify CPU FP16 × 4.0ms 3.0ms 1.6ms ×
YOLOv11n Classify GPU FP16 ? 1.6ms × ? 0.7ms
YOLOv11n Classify CPU INT8 × ? × ? ×
YOLOv11n Classify GPU INT8 × ? × ? 0.7ms
YOLOv11n Detect CPU FP32 48.1ms 49.1ms 46.5ms 34.7ms ×
YOLOv11n Detect GPU FP32 30.5ms 34.2ms × ? 26.9ms
YOLOv11n Detect CPU FP16 × 69.2ms 46.7ms 34.7ms ×
YOLOv11n Detect GPU FP16 ? 36.9ms × ? 23.8ms
YOLOv11n Detect CPU INT8 × ? × 33.8ms ×
YOLOv11n Detect GPU INT8 × ? × ? 22.4ms
YOLOv11n Segment CPU FP32 171.1ms 137.4ms 137.8ms 81.6ms ×
YOLOv11n Segment GPU FP32 79.1ms 81.9ms × ? 64.6ms
YOLOv11n Segment CPU FP16 × 159.2ms 137.3ms 80.8ms ×
YOLOv11n Segment GPU FP16 ?ms 84.5ms × ? 58.6ms
YOLOv11n Segment CPU INT8 × ? × ? ×
YOLOv11n Segment GPU INT8 × ? × ? ?

You can get a docker image with:

docker pull taify/yolo_inference:cuda11.8

or

docker pull taify/yolo_inference:cuda12.5

You Can download some model weights in: https://pan.baidu.com/s/1L8EyTa59qu_eEb3lKRnPQA?pwd=itda

For your own model, you should transpose output dims for YOLOv8, YOLOv9, YOLOv11 detection and segmentation. For onnx model, you can use a scirpt like this:

import onnx
import onnx.helper as helper
import sys
import os

def main():

   if len(sys.argv) < 2:
       print("Usage:\n python transpose.py yolov8n.onnx")
       return 1

   file = sys.argv[1]
   if not os.path.exists(file):
       print(f"Not exist path: {file}")
       return 1

   prefix, suffix = os.path.splitext(file)
   dst = prefix + ".trans" + suffix

   model = onnx.load(file)
   node  = model.graph.node[-1]

   old_output = node.output[0]
   node.output[0] = "pre_transpose"

   for specout in model.graph.output:
       if specout.name == old_output:
           shape0 = specout.type.tensor_type.shape.dim[0]
           shape1 = specout.type.tensor_type.shape.dim[1]
           shape2 = specout.type.tensor_type.shape.dim[2]
           new_out = helper.make_tensor_value_info(
               specout.name,
               specout.type.tensor_type.elem_type,
               [0, 0, 0]
           )
           new_out.type.tensor_type.shape.dim[0].CopyFrom(shape0)
           new_out.type.tensor_type.shape.dim[2].CopyFrom(shape1)
           new_out.type.tensor_type.shape.dim[1].CopyFrom(shape2)
           specout.CopyFrom(new_out)

   model.graph.node.append(
       helper.make_node("Transpose", ["pre_transpose"], [old_output], perm=[0, 2, 1])
   )

   print(f"Model save to {dst}")
   onnx.save(model, dst)
   return 0

if __name__ == "__main__":
   sys.exit(main())

About

C++ and Python implementations of YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLOv11 inference.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published