Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the training results #12

Open
wkbian opened this issue Mar 4, 2024 · 1 comment
Open

Reproducing the training results #12

wkbian opened this issue Mar 4, 2024 · 1 comment

Comments

@wkbian
Copy link

wkbian commented Mar 4, 2024

Hi, @16lemoing,

Congratulations on your paper acceptance! 🎉

I encountered some problems while reproducing your training results. I followed the instructions in training section. Seems the motion loss was not convergent while I set world_size = 4 which aligns with the setting in the paper.
"DOT is trained on frames at resolution 512×512 for 500k steps with the ADAM optimizer [32] and a learning rate of 10−4 using 4 NVIDIA V100 GPUs."
image
Could you please provide some suggestions? thx~

@16lemoing
Copy link
Owner

16lemoing commented Mar 7, 2024

Hi @wkbian, it is normal that the training loss is a bit noisy. Can you run the evaluation on CVO to properly evaluate the performance of the final model? For example:

python test_cvo.py --split final --refiner_path checkpoints/YOUR_RUN/last.pth

I have found a bug in the code with the distributed training mode: all the GPUs were sampling the same elements of the dataset simultaneously. The issue is solved in cdee971 .

Also setting the flag --lambda_motion_loss 1000 when training improves a bit motion prediction quality but degrades a bit visibility prediction. This is what we use in our final method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants