Is your method end-2-end ? #17

phongnhhn92 · 2020-06-22T07:22:04Z

Hello,
I have read your paper ! Thanks for uploading the code.
However, I would like to ask if your method can be trained end-2-end.
As I understand, the Depth module will build a cost volume around the key frame and then use 3D CNN network to predict the depth of that keyframe. In the Motion module, images and depths are required as the input to predict the relative poses.
If you have N = 5 input images, does it mean that you have to run your Depth module N times to get all N depth maps as input to the Motion module.

zachteed · 2020-06-23T15:09:03Z

Hi, we unroll a single step during training (1 motion update and 1 depth update). This is end-to-end in the sense that we can backprogate the gradient on the depth output back through the motion module.

Due to memory constraints, we only compute the depth for a single frame in the video during training. However, having the depth for a single frame is sufficient as input to the motion module, this corresponds to the "Keyframe Pose Optimization" Sec. 3.2 in our paper. Our network is trained in this setting.

At inference time, you can run DeepV2D in "global" mode --mode=global where the depth for all frames are computed as input to the motion module. This is done in a single forward pass by automatically batching the frames.

zachteed · 2020-06-23T15:20:25Z

This picture might help. During training we operate in the local mode, where only the depth for a single keyframe is estimated, this is sufficent to estimate the pose of all frames. During inference, we can operate in global mode, which estimates the depth for all frames. This introduces redundant constraints which gives some improvement in performance. Each edge in the graph corresponds to estimating the optical flow between pairs of frames. Keyframes can have both outgoing and incoming edges, while one-way frames (without depth) can have only incoming edges.

This graph is used to define the objective function in Eq. 5, where pairs (i,j) \in C correspond to edges in the graph.

phongnhhn92 · 2020-06-24T10:48:11Z

Hi,
Thanks for your reply!
So do you mean that during training depth module will only predict the depth map of the key frame and use it to concat with images of different timestep in the Motion module ? I am sorry if my question are a bit too much.

zachteed · 2020-06-25T04:30:47Z

Hi, yes during training we only predict the depth for a keyframe (taken to be the first frame in the sequence). However, with more GPU memory, or a smaller batch size it would certainly be possible with the code to use 2 or more keyframes.

But we don't concatenate depth with the images. Instead, the motion module estimates the optical flow between the keyframe and each of the other frames. The optical flow and depth are then used as input to a Least-Squares optimizaton layer which uses the flow & depth to solve for the pose update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is your method end-2-end ? #17

Is your method end-2-end ? #17

phongnhhn92 commented Jun 22, 2020 •

edited

Loading

zachteed commented Jun 23, 2020

zachteed commented Jun 23, 2020

phongnhhn92 commented Jun 24, 2020

zachteed commented Jun 25, 2020

Is your method end-2-end ? #17

Is your method end-2-end ? #17

Comments

phongnhhn92 commented Jun 22, 2020 • edited Loading

zachteed commented Jun 23, 2020

zachteed commented Jun 23, 2020

phongnhhn92 commented Jun 24, 2020

zachteed commented Jun 25, 2020

phongnhhn92 commented Jun 22, 2020 •

edited

Loading