-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562
base: main
Are you sure you want to change the base?
Conversation
|
||
# relocatable-device-code=true | ||
if (MSVC AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8) | ||
set_target_properties(${target} PROPERTIES CUDA_SEPARABLE_COMPILATION ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try separate compilation with link time optimization:
https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/
That might have better performance than separate compilation.
I still see errors like:
|
I can't repro this issue on my env (sm75 gpu), but it seems stricter diagnosis with cuda 12.8 header files cause this error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I tried it. It's good.
@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)? If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large: |
I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%. The EP Perf CI runs on Ubuntu with python env. I tested on Windows desktop with/without this PR with few models (Resnet50, FRCNN) via ort_perf_test and saw similar results |
I tried your branch in our Windows CUDA CI pipeline, but there were some errors: |
Please make sure you test the perf of CUDA EP instead of TRT EP. |
Thanks for the comment. I just ran some perf comparison on windows desktop (T1000 GPU, sm75) with main branch, current PR and PR with LTO:
So far, I didn't see perf regression on CUDA EP, but I will find more model to test. Feel free to try this PR and let me know if you see perf regression On other hand, I am still exploring on LTO, which might need broader config change. |
I did some test using bert-large model on H100 and Ubuntu, and latency on batch size 16 and sequence length 256 increased by 1.2% after this change so it has some negative impact on performance. BTW, building the wheel only (no tests) does not need this change in Linux. Shall we limit the scope like (Windows only, Test only etc)? |
The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies. |
I run 3 times (10570 samples per benchmark). The average latency (in ms) of baseline (main branch): 2.495, 2.502, 2.504; latency of this branch: 2.545, 2.530, 2.537. There are some variance but the trend is same. |
Description
When building ORT on windows with cuda 12.8, there were compile errors and log was prompting
To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)
This PR
-rdc=true
(Relocatable Device Code (RDC))Motivation and Context