Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is your method end-2-end ? #17

Open
phongnhhn92 opened this issue Jun 22, 2020 · 4 comments
Open

Is your method end-2-end ? #17

phongnhhn92 opened this issue Jun 22, 2020 · 4 comments

Comments

@phongnhhn92
Copy link

phongnhhn92 commented Jun 22, 2020

Hello,
I have read your paper ! Thanks for uploading the code.
However, I would like to ask if your method can be trained end-2-end.
As I understand, the Depth module will build a cost volume around the key frame and then use 3D CNN network to predict the depth of that keyframe. In the Motion module, images and depths are required as the input to predict the relative poses.
If you have N = 5 input images, does it mean that you have to run your Depth module N times to get all N depth maps as input to the Motion module.

@zachteed
Copy link
Collaborator

Hi, we unroll a single step during training (1 motion update and 1 depth update). This is end-to-end in the sense that we can backprogate the gradient on the depth output back through the motion module.

Due to memory constraints, we only compute the depth for a single frame in the video during training. However, having the depth for a single frame is sufficient as input to the motion module, this corresponds to the "Keyframe Pose Optimization" Sec. 3.2 in our paper. Our network is trained in this setting.

At inference time, you can run DeepV2D in "global" mode --mode=global where the depth for all frames are computed as input to the motion module. This is done in a single forward pass by automatically batching the frames.

@zachteed
Copy link
Collaborator

KeyframeGraph

This picture might help. During training we operate in the local mode, where only the depth for a single keyframe is estimated, this is sufficent to estimate the pose of all frames. During inference, we can operate in global mode, which estimates the depth for all frames. This introduces redundant constraints which gives some improvement in performance. Each edge in the graph corresponds to estimating the optical flow between pairs of frames. Keyframes can have both outgoing and incoming edges, while one-way frames (without depth) can have only incoming edges.

This graph is used to define the objective function in Eq. 5, where pairs (i,j) \in C correspond to edges in the graph.

@phongnhhn92
Copy link
Author

Hi,
Thanks for your reply!
So do you mean that during training depth module will only predict the depth map of the key frame and use it to concat with images of different timestep in the Motion module ? I am sorry if my question are a bit too much.

@zachteed
Copy link
Collaborator

Hi, yes during training we only predict the depth for a keyframe (taken to be the first frame in the sequence). However, with more GPU memory, or a smaller batch size it would certainly be possible with the code to use 2 or more keyframes.

But we don't concatenate depth with the images. Instead, the motion module estimates the optical flow between the keyframe and each of the other frames. The optical flow and depth are then used as input to a Least-Squares optimizaton layer which uses the flow & depth to solve for the pose update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants