Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Benchmark Result from Trained Model Using SparseML. #2361

Open
WayneSkywalker opened this issue Mar 3, 2025 · 0 comments
Open

Slow Benchmark Result from Trained Model Using SparseML. #2361

WayneSkywalker opened this issue Mar 3, 2025 · 0 comments

Comments

@WayneSkywalker
Copy link

Hi, I have an issue about Transter Learning using SparseML by following instructions in https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md.

More specific, I trained:

sparseml.ultralytics.train \
  --model "zoo:cv/detection/yolov8-m/pytorch/ultralytics/coco/pruned80-none" \
  --recipe "zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80_quant-none" \
  --data "coco128.yaml" \
  --batch 2

and then export the trained model:

sparseml.ultralytics.export_onnx \
  --model ./runs/detect/train/weights/last.pt \
  --save_dir yolov8-m

And then run benchmark using Deepsparse:

>> deepsparse.benchmark /home/ubuntu/code/models/trained_model.onnx
2025-03-03 03:23:56 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/trained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 0.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:23:56 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/trained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 4.1084
Latency Mean (ms/batch): 243.3896
Latency Median (ms/batch): 240.5514
Latency Std (ms/batch): 10.9256
Iterations: 42

And here are related dependencies and training environment
Libraries:

  • torch==2.5.1
  • sparseml==1.8.0
  • deepsparse==1.8.0
  • ultralytics==8.0.124
  • onnx==1.14.1
  • onnxruntime==1.17.0

Training Environment:

  • NVIDIA GeForece RTX 4070 Ti (12 GB RAM)
  • Ubuntu 22.04

It is quite slow. I suspect that it is about fraction_of_supported_ops: 0.0 related to the benchmark result, because I run benchmark on the pretrained weight used to train in the training command mentioned (get from https://sparsezoo.neuralmagic.com/models/yolov8-m-coco-pruned80_quantized?hardware=deepsparse-c6i.12xlarge&comparison=yolov8-m-coco-base).

>> deepsparse.benchmark /home/ubuntu/code/models/pretrained_model.onnx
2025-03-03 03:52:06 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:52:07 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/pretrained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 1.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:52:08 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:52:08 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/pretrained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 25.9231
Latency Mean (ms/batch): 38.5548
Latency Median (ms/batch): 38.2803
Latency Std (ms/batch): 1.4339
Iterations: 260

I found out that fraction_of_supported_ops is 1.0.

Then I searched about this, I found that is about optimized runtime as described in https://github.com/neuralmagic/deepsparse/blob/36b92eeb730a74a787cea467c9132eaa1b78167f/src/deepsparse/engine.py#L417, and that's it.

I have some questions:

  1. What exactly is fraction_of_supported_ops?
  2. What can I do about fraction_of_supported_ops?
  3. And how fraction_of_supported_ops affect to the benchmark result?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant