You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered some problems while reproducing your training results. I followed the instructions in training section. Seems the motion loss was not convergent while I set world_size = 4 which aligns with the setting in the paper.
"DOT is trained on frames at resolution 512×512 for 500k steps with the ADAM optimizer [32] and a learning rate of 10−4 using 4 NVIDIA V100 GPUs."
Could you please provide some suggestions? thx~
The text was updated successfully, but these errors were encountered:
Hi @wkbian, it is normal that the training loss is a bit noisy. Can you run the evaluation on CVO to properly evaluate the performance of the final model? For example:
python test_cvo.py --split final --refiner_path checkpoints/YOUR_RUN/last.pth
I have found a bug in the code with the distributed training mode: all the GPUs were sampling the same elements of the dataset simultaneously. The issue is solved in cdee971 .
Also setting the flag --lambda_motion_loss 1000 when training improves a bit motion prediction quality but degrades a bit visibility prediction. This is what we use in our final method.
Hi, @16lemoing,
Congratulations on your paper acceptance! 🎉
I encountered some problems while reproducing your training results. I followed the instructions in training section. Seems the motion loss was not convergent while I set
world_size = 4
which aligns with the setting in the paper."DOT is trained on frames at resolution 512×512 for 500k steps with the ADAM optimizer [32] and a learning rate of 10−4 using 4 NVIDIA V100 GPUs."
Could you please provide some suggestions? thx~
The text was updated successfully, but these errors were encountered: