-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is your method end-2-end ? #17
Comments
Hi, we unroll a single step during training (1 motion update and 1 depth update). This is end-to-end in the sense that we can backprogate the gradient on the depth output back through the motion module. Due to memory constraints, we only compute the depth for a single frame in the video during training. However, having the depth for a single frame is sufficient as input to the motion module, this corresponds to the "Keyframe Pose Optimization" Sec. 3.2 in our paper. Our network is trained in this setting. At inference time, you can run DeepV2D in "global" mode |
This picture might help. During training we operate in the local mode, where only the depth for a single keyframe is estimated, this is sufficent to estimate the pose of all frames. During inference, we can operate in global mode, which estimates the depth for all frames. This introduces redundant constraints which gives some improvement in performance. Each edge in the graph corresponds to estimating the optical flow between pairs of frames. Keyframes can have both outgoing and incoming edges, while one-way frames (without depth) can have only incoming edges. This graph is used to define the objective function in Eq. 5, where pairs (i,j) \in C correspond to edges in the graph. |
Hi, |
Hi, yes during training we only predict the depth for a keyframe (taken to be the first frame in the sequence). However, with more GPU memory, or a smaller batch size it would certainly be possible with the code to use 2 or more keyframes. But we don't concatenate depth with the images. Instead, the motion module estimates the optical flow between the keyframe and each of the other frames. The optical flow and depth are then used as input to a Least-Squares optimizaton layer which uses the flow & depth to solve for the pose update. |
Hello,
I have read your paper ! Thanks for uploading the code.
However, I would like to ask if your method can be trained end-2-end.
As I understand, the Depth module will build a cost volume around the key frame and then use 3D CNN network to predict the depth of that keyframe. In the Motion module, images and depths are required as the input to predict the relative poses.
If you have N = 5 input images, does it mean that you have to run your Depth module N times to get all N depth maps as input to the Motion module.
The text was updated successfully, but these errors were encountered: