This project demonstrates how to use the video decoder from PyTorchVideo to load frames from a video of an object from the Objectron dataset, and use this to train a NeRF [1] model with PyTorch3D. Instead of decoding and storing all the video frames as images, PyTorchVideo offers an easy alternative to load and access frames on the fly. For this project we will be using the NeRF implementation from PyTorch3D.
Install PyTorch3D
# Create new conda environment
conda create -n 3ddemo
conda activate 3ddemo
# Install PyTorch3D
conda install -c pytorch pytorch=1.7.1 torchvision cudatoolkit=10.1
conda install -c conda-forge -c fvcore -c iopath fvcore iopath
conda install pytorch3d -c pytorch3d-nightly
Install PyTorchVideo if you haven't installed it already (assuming you have cloned the repo locally):
cd pytorchvideo
python -m pip install -e .
Install some extras libraries needed for NeRF:
pip install visdom Pillow matplotlib tqdm plotly
pip install hydra-core --upgrade
We will be using the PyTorch3D NeRF implementation. We have already installed the PyTorch3d conda packages, so now we only need to clone the NeRF implementation:
cd pytorchvideo/tutorials/video_nerf
git clone https://github.com/facebookresearch/pytorch3d.git
cp -r pytorch3d/projects/nerf .
# Remove the rest of the PyTorch3D repo
rm -r pytorch3d
The repo contains helper functions for reading the metadata files. Clone it to the path pytorchvideo/tutorials/video_nerf/Objectron
.
git clone https://github.com/google-research-datasets/Objectron.git
# Also install protobuf for parsing the metadata
pip install protobuf
For this demo we will be using a short video of a chair from the Objectron dataset. Each video is accompanied by metadata with the camera parameters for each frame. You can download an example video for a chair and the associated metadata by running the following script:
python download_objectron_data.py
The data files will be downloaded to the path: pytorchvideo/tutorials/video_nerf/nerf/data/objectron
. Within the script you can change the index of the video to use to obtain a different chair video. We will create and save a random split of train/val/test when the video is first loaded by the NeRF model training script.
Most of the videos are recorded in landscape mode with image size (H, W) = [1440, 1920].
For this dataset we need a new config file and data loader to use it with the PyTorch3D NeRF implementation. Copy the relevant dataset and config files into the nerf
folder and replace the original files:
# Make sure you are at the path: pytorchvideo/tutorials/video_nerf
# Rename the current dataset file
mv nerf/nerf/dataset.py nerf/nerf/nerf_dataset.py
# Move the new objectron specific files into the nerf folder
mv dataset.py nerf/nerf/dataset.py
mv dataset_utils.py nerf/nerf/dataset_utils.py
mv objectron.yaml nerf/configs
In the video_dataset.py
file we use the PyTorchVideo EncodedVideo
class to load a video .MOV
file, decode it into frames and access the frames by the index.
Run the model training:
cd nerf
python ./train_nerf.py --config-name objectron
Predictions and metrics will be logged to Visdom. Before training starts launch the visdom server:
python -m visdom.server
Navigate to https://localhost:8097
to view the logs and visualizations.
After training, you can generate predictions on the test set:
python test_nerf.py --config-name objectron test.mode='export_video' data.image_size="[96,128]"
For a higher resolution video you can increase the image size to e.g. [192, 256] (note that this will slow down inference).
You will need to specify the scene_center
for the video in the objectron.yaml
file. This is set for the demo video specified in download_objectron_data.py
. For a different video you can calculate the scene center inside eval_video_utils.py
. After line 99 you can add the following code to compute the center:
# traj is the circular camera trajectory on the camera mean plane.
# We want the camera to always point towards the center of this trajectory.
x_center = traj[..., 0].mean().item()
z_center = traj[..., 2].mean().item()
y_center = traj[0, ..., 1]
scene_center = [x_center, y_center, z_center]
You can also point the camera down/up relative to the camera mean plane e.g. y_center -= 0.5
Here is an example of a video reconstruction generated using a trained NeRF model. NOTE: the quality of reconstruction is highly dependent on the camera pose range and accuracy in the annotations - try training a model for a few different chairs in the dataset to see which one has the best results.
[1] Ben Mildenhall and Pratul P. Srinivasan and Matthew Tancik and Jonathan T. Barron and Ravi Ramamoorthi and Ren Ng, NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV2020