Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yf711
Copy link
Contributor

@yf711 yf711 commented Feb 3, 2025

Description

When building ORT on windows with cuda 12.8, there were compile errors and log was prompting To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

This PR

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed

Motivation and Context

@yf711 yf711 changed the title enable rdc and skip error Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 Feb 3, 2025
@yf711 yf711 marked this pull request as ready for review February 3, 2025 07:29
@yf711 yf711 requested a review from snnn February 3, 2025 07:29

# relocatable-device-code=true
if (MSVC AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
set_target_properties(${target} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try separate compilation with link time optimization:
https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/
That might have better performance than separate compilation.

@snnn
Copy link
Member

snnn commented Feb 3, 2025

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

@yf711
Copy link
Contributor Author

yf711 commented Feb 4, 2025

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

I can't repro this issue on my env (sm75 gpu), but it seems stricter diagnosis with cuda 12.8 header files cause this error.
I suppress this 221 error. Please verify if this could help on your side

Copy link
Member

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I tried it. It's good.

@tianleiwu
Copy link
Contributor

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

@yf711
Copy link
Contributor Author

yf711 commented Feb 5, 2025

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

The EP Perf CI runs on Ubuntu with python env. I tested on Windows desktop with/without this PR with few models (Resnet50, FRCNN) via ort_perf_test and saw similar results

@snnn
Copy link
Member

snnn commented Feb 5, 2025

I tried your branch in our Windows CUDA CI pipeline, but there were some errors:

https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/1606829/logs/30

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 5, 2025

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP.
TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

@yf711
Copy link
Contributor Author

yf711 commented Feb 5, 2025

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP. TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

Thanks for the comment. I just ran some perf comparison on windows desktop (T1000 GPU, sm75) with main branch, current PR and PR with LTO:

.\onnxruntime_perf_test.exe -e cuda -r 1000 Main (cu126) Cu128+Separation compilation Cu128+Separation compilation with lto
faster_rcnn_R_50_FPN_1x Average inference time cost: 24.8741 ms Average inference time cost: 24.7539 ms Average inference time cost: 25.3358 ms
resnet50-v2-7 Average inference time cost: 8.83048 ms Average inference time cost: 8.82946 ms Average inference time cost: 8.86747 ms

So far, I didn't see perf regression on CUDA EP, but I will find more model to test. Feel free to try this PR and let me know if you see perf regression

On other hand, I am still exploring on LTO, which might need broader config change.
If there's no significant perf change on this PR, I will merge it and adapt to LTO in another PR

@tianleiwu
Copy link
Contributor

I did some test using bert-large model on H100 and Ubuntu, and latency on batch size 16 and sequence length 256 increased by 1.2% after this change so it has some negative impact on performance.

BTW, building the wheel only (no tests) does not need this change in Linux. Shall we limit the scope like (Windows only, Test only etc)?

@snnn
Copy link
Member

snnn commented Feb 11, 2025

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 11, 2025

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

I run 3 times (10570 samples per benchmark). The average latency (in ms) of baseline (main branch): 2.495, 2.502, 2.504; latency of this branch: 2.545, 2.530, 2.537. There are some variance but the trend is same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants