-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during training (Assertion input_val >= zero && input_val <= one failed.) #11
Comments
Same issue |
Same issue. My mmdet's version is 2.19.0 and raise error during training the 3rd epoch |
You can try to clamp the value of the box area when computing GIoU loss, e.g., TOOD/mmdet/core/bbox/iou_calculators/iou2d_calculator.py Lines 212 to 215 in 93b3a87
|
hello sir,i have clamp the value of box area as you show ,but still crash at the 5rd epoch. My mmdet's version is 2.14.0+d3e713d. Error Report: /opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [508,0,0], thread: [26,0,0] Assertion Killing subprocess 19911 Thank you for your reply. |
@fcjian Thanks for reply! It solves the CUDA error, but the model can not converge. During training, a problem similar with gradient cutting happened. The log shows a sudden increase of loss. After that, the loss fluctuates in a tiny range. I'll try again with the original TOOD code without transfering to higher mmdet version.
|
i meet the same issue , my code is |
Problem
thank you for contribution, I encountered gradient exploding during training the model tood_r50_fpn_1x_coco.
I tried to train this model in Mix-Precision Training strategy, and the loss scale was set 'dynamic'. The training soon stopped, and raise RuntimeError: CUDA error: device-side assert triggered.
I also retrained the model with FP32 precision, but it did not work.
A lower lr did not address gradient exploding.
Gradient cutting helps avoid training failure (Mix-Precision Training, loss scale=512.) , but the model can not converge.
I try to google this issue. I think it is not OOM. It seems to relate with the NaN value in prediction head and further cause the error at calculating loss. I do not know if the environment(mmdet-1.15.0) affects with training.
My modification
Environment
Error Report
The text was updated successfully, but these errors were encountered: