Unofficial implementation of NeRF-W (NeRF in the wild) using pytorch (pytorch-lightning). I try to reproduce (some of) the results on the lego dataset (Section D). Training on Phototourism real images (as the main content of the paper) has also passed. Please read the following sections for the results.
The code is largely based on NeRF implementation (see master or dev branch), the main difference is the model structure and the rendering process, which can be found in the two files under models/
.
- OS: Ubuntu 18.04
- NVIDIA GPU with CUDA>=10.2 (tested with 1 RTX2080Ti)
- Clone this repo by
git clone https://github.com/kwea123/nerf_pl
- Python>=3.6 (installation via anaconda is recommended, use
conda create -n nerf_pl python=3.6
to create a conda environment and activate it byconda activate nerf_pl
) - Python libraries
- Install core requirements by
pip install -r requirements.txt
- Install core requirements by
Steps
Download nerf_synthetic.zip
from here
All random seeds are fixed to reproduce the same perturbations every time. For detailed implementation, see blender.py.
- Color perturbations: Uses the same parameters in the paper.
- Occlusions: The square has size 200x200 (should be the same as the paper), the position is randomly sampled inside the central 400x400 area; the 10 colors are random.
- Combined: First perturb the color then add square.
Base:
python train.py \
--dataset_name blender \
--root_dir $BLENDER_DIR \
--N_importance 64 --img_wh 400 400 --noise_std 0 \
--num_epochs 20 --batch_size 1024 \
--optimizer adam --lr 5e-4 --lr_scheduler cosine \
--exp_name exp
Add --encode_a
for appearance embedding, --encode_t
for transient embedding.
Add --data_perturb color occ
to perturb the dataset.
Example:
python train.py \
--dataset_name blender \
--root_dir $BLENDER_DIR \
--N_importance 64 --img_wh 400 400 --noise_std 0 \
--num_epochs 20 --batch_size 1024 \
--optimizer adam --lr 5e-4 --lr_scheduler cosine \
--exp_name exp \
--data_perturb occ \
--encode_t --beta_min 0.1
To train NeRF-U on occluders (Table 3 bottom left).
See opt.py for all configurations.
You can monitor the training process by tensorboard --logdir logs/
and go to localhost:6006
in your browser.
Example training loss evolution (NeRF-U on occluders):
Steps
Download the scenes you want from here (train/test splits are only provided for "Brandenburg Gate", "Sacre Coeur" and "Trevi Fountain", if you want to train on other scenes, you need to clean the data (Section C) and split the data by yourself)
Download the train/test split from the "Additional links" here and put under each scene's folder (the same level as the "dense" folder)
(Optional but highly recommended) Run python prepare_phototourism.py --root_dir $ROOT_DIR --img_downscale {an integer, e.g. 2 means half the image sizes}
to prepare the training data and save to disk first, if you want to run multiple experiments or run on multiple gpus. This will largely reduce the data preparation step before training.
Take a look at phototourism_visualization.ipynb, a quick visualization of the data: scene geometry, camera poses, rays and bounds, to assure you that my data convertion works correctly.
Run (example)
python train.py \
--root_dir /home/ubuntu/data/IMC-PT/brandenburg_gate/ --dataset_name phototourism \
--img_downscale 8 --use_cache --N_importance 64 --N_samples 64 \
--encode_a --encode_t --beta_min 0.03 --N_vocab 1500 \
--num_epochs 20 --batch_size 1024 \
--optimizer adam --lr 5e-4 --lr_scheduler cosine \
--exp_name brandenburg_scale8_nerfw
--encode_a
and --encode_t
options are both required to maximize NeRF-W performance.
--N_vocab
should be set to an integer larger than the number of images (dependent on different scenes). For example, "brandenburg_gate" has in total 1363 images (under dense/images/
), so any number larger than 1363 works (no need to set to exactly the same number). Attention! If you forget to set this number, or it is set smaller than the number of images, the program will yield RuntimeError: CUDA error: device-side assert triggered
(which comes from torch.nn.Embedding
).
Download the pretrained models and training logs in release.
Use eval.py to create the whole sequence of moving views.
It will create folder results/{dataset_name}/{scene_name}
and run inference on all test data, finally create a gif out of them.
All my experiments are done with image size 200x200, so theoretically PSNR is expected to be lower.
- test_nerfa_color shows that NeRF-A is able to capture image-dependent color variations.
Left: NeRF, PSNR=23.17 (paper=23.38). Right: pretrained NeRF-A, PSNR=28.20 (paper=30.66).
- test_nerfu_occ shows that NeRF-U is able to decompose the scene into static and transient components when the scene has random occluders.
Left: NeRF, PSNR=21.94 (paper=19.35). Right: pretrained NeRF-U, PSNR=28.60 (paper=23.47).
- test_nerfw_all shows that NeRF-W is able to both handle color variation and decompose the scene into static and transient components (color variation is not that well learnt though, maybe adding more layers in the static rgb head will help).
Left: NeRF, PSNR=18.83 (paper=15.73). Right: pretrained NeRF-W, PSNR=24.86 (paper=22.19).
- Reference: Original NeRF (without
--encode_a
and--encode_t
) trained on unperturbed data.
See test_phototourism.ipynb for some paper results' reproduction.
Use eval.py (example) to create a flythrough video. You might need to design a camera path to make it look more cool!
-
Network structure (nerf.py):
- My base MLP uses 8 layers of 256 units as the original NeRF, while NeRF-W uses 512 units each.
- The static rgb head uses 1 layer as the original NeRF, while NeRF-W uses 4 layers. Empirically I found more layers to overfit when there is data perturbation, as it tends to explain the color change by the view change as well.
- I use softplus activation for sigma (reason explained here) while NeRF-W uses relu.
- I apply
+beta_min
all the way at the end of compositing all raw betas (seeresults['beta']
in rendering.py). The paper addsbeta_min
to raw betas first then composite them. I think my implementation is the correct way because initially the network outputs low sigmas, in which case the composited beta (ifbeta_min
is added first) will be low too. Therefore not only values lower thanbeta_min
will be output, but sometimes the composited beta will be zero if all sigmas are zeros, which causes problem in loss computation (division by zero). I'm not totally sure about this part, if anyone finds a better implementation please tell me.
-
Training hyperparameters
- I find larger (but not too large)
beta_min
achieves better result, so my defaultbeta_min
is0.1
instead of0.03
in the paper. - I add 3 to
beta_loss
(equation 13) to make it positive empirically. - When there is no transient head (NeRF-A), the loss is the average MSE error of coarse and fine models (not specified in the paper).
- Other hyperparameters differ quite a lot from the paper (although many are not specified, they say that they use grid search to find the best). Please check each pretrained models in the release.
- I find larger (but not too large)
-
Phototourism evaluation
- To evaluate the results on the testing set, they train on the left half of the image and evaluate on the right half (to train the embedding of the test images). I didn't perform this additional training, I only evaluated on the training images. It should be easy to implement this.