Reproducing the training results #12

wkbian · 2024-03-04T16:55:15Z

Congratulations on your paper acceptance! 🎉

I encountered some problems while reproducing your training results. I followed the instructions in training section. Seems the motion loss was not convergent while I set world_size = 4 which aligns with the setting in the paper.
"DOT is trained on frames at resolution 512×512 for 500k steps with the ADAM optimizer [32] and a learning rate of 10−4 using 4 NVIDIA V100 GPUs."

Could you please provide some suggestions? thx~

The text was updated successfully, but these errors were encountered:

16lemoing · 2024-03-07T15:04:58Z

Hi @wkbian, it is normal that the training loss is a bit noisy. Can you run the evaluation on CVO to properly evaluate the performance of the final model? For example:

python test_cvo.py --split final --refiner_path checkpoints/YOUR_RUN/last.pth

I have found a bug in the code with the distributed training mode: all the GPUs were sampling the same elements of the dataset simultaneously. The issue is solved in cdee971 .

Also setting the flag --lambda_motion_loss 1000 when training improves a bit motion prediction quality but degrades a bit visibility prediction. This is what we use in our final method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the training results #12

Reproducing the training results #12

wkbian commented Mar 4, 2024

16lemoing commented Mar 7, 2024 •

edited

Loading

Reproducing the training results #12

Reproducing the training results #12

Comments

wkbian commented Mar 4, 2024

16lemoing commented Mar 7, 2024 • edited Loading

16lemoing commented Mar 7, 2024 •

edited

Loading