Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

yf711 · 2025-02-03T01:57:48Z

Description

When building ORT on windows with cuda 12.8, there were compile errors and log was prompting To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

This PR

enables -rdc=true (Relocatable Device Code (RDC))
enable CUDA_SEPARABLE_COMPILATION to support separate compilation of device code
skips the 4505 compiler check, as enabling rdc would init check towards internal linkage and make 4505 warning that treated as error

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed

Motivation and Context

tianleiwu · 2025-02-03T17:29:43Z

cmake/onnxruntime_providers_cuda.cmake

+
+    # relocatable-device-code=true 
+    if (MSVC AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
+      set_target_properties(${target} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)


Please try separate compilation with link time optimization:
https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/
That might have better performance than separate compilation.

snnn · 2025-02-03T22:59:31Z

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

yf711 · 2025-02-04T00:11:29Z

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

I can't repro this issue on my env (sm75 gpu), but it seems stricter diagnosis with cuda 12.8 header files cause this error.
I suppress this 221 error. Please verify if this could help on your side

snnn

Thanks. I tried it. It's good.

tianleiwu · 2025-02-04T22:49:56Z

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

yf711 · 2025-02-05T03:37:35Z

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

The EP Perf CI runs on Ubuntu with python env. I tested on Windows desktop with/without this PR with few models (Resnet50, FRCNN) via ort_perf_test and saw similar results

snnn · 2025-02-05T04:09:24Z

I tried your branch in our Windows CUDA CI pipeline, but there were some errors:

https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/1606829/logs/30

tianleiwu · 2025-02-05T17:44:12Z

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP.
TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

yf711 · 2025-02-05T19:56:01Z

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP. TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

Thanks for the comment. I just ran some perf comparison on windows desktop (T1000 GPU, sm75) with main branch, current PR and PR with LTO:

.\onnxruntime_perf_test.exe -e cuda -r 1000	Main (cu126)	Cu128+Separation compilation	Cu128+Separation compilation with lto
faster_rcnn_R_50_FPN_1x	Average inference time cost: 24.8741 ms	Average inference time cost: 24.7539 ms	Average inference time cost: 25.3358 ms
resnet50-v2-7	Average inference time cost: 8.83048 ms	Average inference time cost: 8.82946 ms	Average inference time cost: 8.86747 ms

So far, I didn't see perf regression on CUDA EP, but I will find more model to test. Feel free to try this PR and let me know if you see perf regression

On other hand, I am still exploring on LTO, which might need broader config change.
If there's no significant perf change on this PR, I will merge it and adapt to LTO in another PR

tianleiwu · 2025-02-07T02:27:15Z

I did some test using bert-large model on H100 and Ubuntu, and latency on batch size 16 and sequence length 256 increased by 1.2% after this change so it has some negative impact on performance.

BTW, building the wheel only (no tests) does not need this change in Linux. Shall we limit the scope like (Windows only, Test only etc)?

snnn · 2025-02-11T01:55:01Z

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

tianleiwu · 2025-02-11T19:01:31Z

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

I run 3 times (10570 samples per benchmark). The average latency (in ms) of baseline (main branch): 2.495, 2.502, 2.504; latency of this branch: 2.545, 2.530, 2.537. There are some variance but the trend is same.

enable rdc and skip error

0f64714

yf711 changed the title ~~enable rdc and skip error~~ Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 Feb 3, 2025

yf711 marked this pull request as ready for review February 3, 2025 07:29

yf711 requested a review from snnn February 3, 2025 07:29

tianleiwu reviewed Feb 3, 2025

View reviewed changes

suppress 221

6181e78

yf711 added 2 commits February 4, 2025 14:10

apply to linux&win

b4539c2

Merge

15cea1b

snnn approved these changes Feb 4, 2025

View reviewed changes

Merge branch 'main' into yifanl/cu128_build

737cd3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

yf711 commented Feb 3, 2025 •

edited

Loading

tianleiwu Feb 3, 2025

snnn commented Feb 3, 2025

yf711 commented Feb 4, 2025

snnn left a comment

tianleiwu commented Feb 4, 2025

yf711 commented Feb 5, 2025

snnn commented Feb 5, 2025

tianleiwu commented Feb 5, 2025 •

edited

Loading

yf711 commented Feb 5, 2025 •

edited

Loading

tianleiwu commented Feb 7, 2025

snnn commented Feb 11, 2025

tianleiwu commented Feb 11, 2025 •

edited

Loading

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

Are you sure you want to change the base?

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

Conversation

yf711 commented Feb 3, 2025 • edited Loading

Description

Motivation and Context

tianleiwu Feb 3, 2025

Choose a reason for hiding this comment

snnn commented Feb 3, 2025

yf711 commented Feb 4, 2025

snnn left a comment

Choose a reason for hiding this comment

tianleiwu commented Feb 4, 2025

yf711 commented Feb 5, 2025

snnn commented Feb 5, 2025

tianleiwu commented Feb 5, 2025 • edited Loading

yf711 commented Feb 5, 2025 • edited Loading

tianleiwu commented Feb 7, 2025

snnn commented Feb 11, 2025

tianleiwu commented Feb 11, 2025 • edited Loading

yf711 commented Feb 3, 2025 •

edited

Loading

tianleiwu commented Feb 5, 2025 •

edited

Loading

yf711 commented Feb 5, 2025 •

edited

Loading

tianleiwu commented Feb 11, 2025 •

edited

Loading