Skip to content

Ongoing research training gaussian splatting at scale by distributed system

License

Notifications You must be signed in to change notification settings

MyForking/Grendel-GS

 
 

Repository files navigation

License GitHub stars Pull Requests

Grendel-GS

Gaussian Splatting at Scale with Distributed Training System

Click Here to Download Pre-trained Models behind the above visualizations

Overview

We design and implement Grendel-GS, which serves as a distributed implementation of 3D Gaussian Splatting training. We aim to help 3DGS achieve its scaling laws with distributed system support, just as the achievements of current LLMs rely on distributed system.

By using Grendel, your 3DGS training could leverage multiple GPUs' capability to achieve significantly faster training, supports a substantially more Gaussians in GPU memory, and ultimately allows for the reconstruction of larger-area, higher-resolution scenes to better PSNR. Grendel-GS retains the original algorithm, making it a direct and safe replacement for original 3DGS implementation in any Gaussian Splatting workflow or application.

For examples, with 4 GPU, Grendel-GS allows you to:

  • Train Mip360 >3.5 times faster.
  • Support directly training large-scale 4K scenes(Mega-NeRF Rubble) using >40 millions gaussians without OOM.
  • Train the Temple&Tanks Truck scene to PSNR 23.79 within merely 45 seconds (on 7000 images)

📢 News

  • 7.15.2024 - We now support gsplat as the CUDA backend during training!

🌟 Follow us for future updates! Interested in collaborating or contributing? Email us!

Table of contents


Why use Grendel-GS

Here is a diagram showing why you may need distributed gaussian splatting training like our Grendel-GS' techniques:

whydistributed

How to use Grendel-GS

This repo and its dependency, our customized distributed version of rendering cuda code(diff-gaussian-rasterization), are both forks from the original 3DGS implementation. Therefore, the usage is generally very similar to the original 3DGS.

The two main differences are:

  1. We support training on multiple GPUs, using the torchrun command-line utility provided by PyTorch to launch jobs.
  2. We support batch sizes greater than 1, with the --bsz argument flag used to specify the batch size.

Setup

Cloning the Repository

The repository contains submodules, thus please check it out with

git clone [email protected]:nyu-systems/Grendel-GS.git --recursive

Pytorch Environment

Ensure you have Conda, GPU with compatible driver and cuda environment installed on your machine, as prerequisites. Then please install PyTorch, Torchvision, Plyfile, tqdm which are essential packages. Make sure PyTorch version >= 1.10 to have torchrun for distributed training. Finally, compile and install two dependent cuda repo diff-gaussian-rasterization and simple-knn containing our customized cuda kernels for rendering and etc.

We provide a yml file for easy environment setup. However, you should choose the versions to match your local running environment.

conda env create --file environment.yml
conda activate gaussian_splatting

NOTES: We kept additional dependencies minimal compared to the original 3DGS. For environment setup issues, maybe you could refer to the original 3DGS repo issue section first.

Dataset

We use colmap format to load dataset. Therefore, please download and unzip colmap datasets before trainning, for example Mip360 dataset and 4 scenes from Tanks&Temple and DeepBlending.

Training

For single-GPU non-distributed training with batch size of 1:

python train.py -s <path to COLMAP dataset> --eval

For 4 GPU distributed training and batch size of 4:

torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py --bsz 4 -s <path to COLMAP dataset> --eval
Command Line Arguments for train.py

--source_path / -s

Path to the source directory containing a COLMAP data set.

--model_path / -m

Path where the trained model and loggings should be stored (/tmp/gaussian_splatting by default).

--eval

Add this flag to use a MipNeRF360-style training/test split for evaluation.

--bsz

The batch size(the number of camera views) in single step training. 1 by default.

--backend

The CUDA backend to use in training. Valid options include diff (diff-gaussian-rasterization) and gsplat. diff by default.

--lr_scale_mode

The mode of scaling learning rate given larger batch size. sqrt by default.

--preload_dataset_to_gpu

Save all groundtruth images from the dataset in GPU, rather than load each image on-the-fly at each training step. If dataset is large, preload_dataset_to_gpu will lead to OOM; when the dataset is small, preload_dataset_to_gpu could speed up the training a little bit by avoiding some cpu-gpu communication.

--iterations

Number of total iterations to train for, 30_000 by default.

--test_iterations

Space-separated iterations at which the training script computes L1 and PSNR over test set, 7000 30000 by default.

--save_iterations

Space-separated iterations at which the training script saves the Gaussian model, 7000 30000 <iterations> by default.

--checkpoint_iterations

Space-separated iterations at which to store a checkpoint for continuing later, saved in the model directory.

--start_checkpoint

Path to a saved checkpoint to continue training from.

--white_background / -w

Add this flag to use white background instead of black (default), e.g., for evaluation of NeRF Synthetic dataset.

--sh_degree

Order of spherical harmonics to be used (no larger than 3). 3 by default.

--feature_lr

Spherical harmonics features learning rate, 0.0025 by default.

--opacity_lr

Opacity learning rate, 0.05 by default.

--scaling_lr

Scaling learning rate, 0.005 by default.

--rotation_lr

Rotation learning rate, 0.001 by default.

--position_lr_max_steps

Number of steps (from 0) where position learning rate goes from initial to final. 30_000 by default.

--position_lr_init

Initial 3D position learning rate, 0.00016 by default.

--position_lr_final

Final 3D position learning rate, 0.0000016 by default.

--position_lr_delay_mult

Position learning rate multiplier (cf. Plenoxels), 0.01 by default.

--densify_from_iter

Iteration where densification starts, 500 by default.

--densify_until_iter

Iteration where densification stops, 15_000 by default.

--densify_grad_threshold

Limit that decides if points should be densified based on 2D position gradient, 0.0002 by default.

--densification_interval

How frequently to densify, 100 (every 100 iterations) by default.

--opacity_reset_interval

How frequently to reset opacity, 3_000 by default.

--lambda_dssim

Influence of SSIM on total loss from 0 to 1, 0.2 by default.

--percent_dense

Percentage of scene extent (0--1) a point must exceed to be forcibly densified, 0.01 by default.


Rendering

python render.py -s <path to COLMAP dataset> --model_path <path to folder of saving model> 
Command Line Arguments for render.py

--model_path / -m

Path to the trained model directory you want to create renderings for.

--skip_train

Flag to skip rendering the training set.

--skip_test

Flag to skip rendering the test set.

--distributed_load

If point cloud models are saved distributedly during training, we should set this flag to load all of them.

--quiet

Flag to omit any text written to standard out pipe.

The below parameters will be read automatically from the model path, based on what was used for training. However, you may override them by providing them explicitly on the command line.

--source_path / -s

Path to the source directory containing a COLMAP or Synthetic NeRF data set.

--images / -i

Alternative subdirectory for COLMAP images (images by default).

--eval

Add this flag to use a MipNeRF360-style training/test split for evaluation.

--llffhold

The training/test split ratio in the whole dataset for evaluation. llffhold=8 means 1/8 is used as test set and others are used as train set.

--white_background / -w

Add this flag to use white background instead of black (default), e.g., for evaluation of NeRF Synthetic dataset.

For interactive rendering, please refer to GaussFusion, which also support rendering two checkpoints with interactive controls.

Evaluating metrics

python metrics.py --model_path <path to folder of saving model> 
Command Line Arguments for metrics.py

--model_paths / -m

Space-separated list of model paths for which metrics should be computed.


Migrating from original 3DGS codebase

If you are currently using the original 3DGS codebase for training in your application, you can effortlessly switch to our codebase because we haven't made any algorithmic changes. This will allow you to train faster and successfully train larger, higher-precision scenes without running out of memory (OOM) within a reasonable time frame.

It is worth noting that we only support the training functionality; this repository does not include the interactive viewer, network viewer, or colmap features from the original 3DGS. We are actively developing to support more features. Please let us know your needs or directly contribute to our project. Thank you!


Benefits and Examples

Significantly Faster Training Without Compromising Reconstruction Quality On Mip360 Dataset

Training Time

30k Train Time(min) stump bicycle kitchen room counter garden bonsai
1 GPU + Batch Size=1 24.03 30.18 25.58 22.45 21.6 30.15 19.18
4 GPU + Batch Size=1 9.07 11.67 9.53 8.93 8.82 10.85 8.03
4 GPU + Batch Size=4 5.22 6.47 6.98 6.18 5.98 6.48 5.28

Test PSNR

30k Test PSNR stump bicycle kitchen room counter garden bonsai
1 GPU + Batch Size=1 26.61 25.21 31.4 31.4 28.93 27.27 32.01
4 GPU + Batch Size=1 26.65 25.19 31.41 31.38 28.98 27.28 31.92
4 GPU + Batch Size=4 26.59 25.17 31.37 31.32 28.98 27.2 31.94

Reproduction Instructions

  1. Download and unzip the Mip360 dataset.
  2. Activate the appropriate conda/python environment.
  3. To execute all experiments and generate this table, run the following command:
    bash examples/mip360/eval_all_mip360.sh <path_to_save_experiment_results> <path_to_mip360_dataset>

Significantly Speed up and Reduce per-GPU memory usage on Mip360 at 4K Resolution

Configuration 50k Training Time Memory Per GPU PSNR
bicycle + 1 GPU + Batch Size=1 2h 38min 37.18 23.78
bicycle + 4 GPU + Batch Size=1 0h 50min 10.39 23.79
garden + 1 GPU + Batch Size=1 2h 49min 29.87 26.06
garden + 4 GPU + Batch Size=1 0h 50min 7.88 26.06

Unlike the typical approach of downsampling the Mip360 dataset by a factor of four before training, our system can train directly at full resolution. The bicycle and garden images have resolutions of 4946x3286 and 5187x3361, respectively. Our distributed system demonstrates that we can significantly accelerate and reduce memory usage per GPU by several folds without sacrificing quality.

Reproduction Instructions

Set up the dataset and Python environment as outlined previously, then execute the following:

   bash examples/mip360_4k/eval_mip360_4k.sh <path_to_save_experiment_results> <path_to_mip360_dataset>

Train in 45 Seconds on Tanks&Temple at 1K Resolution

Configuration 7k Training Time 7k test PSNR 30k Training Time 30k test PSNR
train + 4 GPU + Batch Size=8 44s 19.37 3min 30s 21.87
truck + 4 GPU + Batch Size=8 45s 23.79 3min 39s 25.35

Tanks&Temples dataset includes train and truck scenes with resolutions of 980x545 and 979x546, respectively. Utilizing 4 GPUs, we've managed to train on these small scenes to a reasonable quality in just 45 seconds(7k iterations). In the original Gaussian splatting papers, achieving a test PSNR of 18.892 and 23.506 at 7K resolution was considered good on train and truck, respectively. Our results are comparable to these benchmarks.

Reproduction Instructions

Set up the Tanks&Temple and DeepBlending Dataset and Python environment as outlined previously, then execute the following:

   bash examples/train_truck_1k/eval_train_truck_1k.sh <path_to_save_experiment_results> <path_to_tandb_dataset>

(TODO: check these scripts have no side-effects)

Experimental Setup for all experiments statistics above

  • Hardware: 4x 40GB NVIDIA A100 GPUs
  • Interconnect: Fully-connected Bidirectional 25GB/s NVLINK

New features [Please check regularly!]

  • We will release our optimized cuda kernels within gaussian splatting soon for further speed up.
  • We will support gsplat later as another choice of our cuda kernel backend.

Paper and Citation

Our system design, analysis of large-batch training dynamics, and insights from scaling up are all documented in the paper below:

On Scaling Up 3D Gaussian Splatting Training
Hexu Zhao¹, Haoyang Weng¹*, Daohan Lu¹*, Ang Li², Jinyang Li¹, Aurojit Panda¹, Saining Xie¹ (* co-second authors)
¹New York University, ²Pacific Northwest National Laboratory

BibTeX

@misc{zhao2024scaling3dgaussiansplatting,
      title={On Scaling Up 3D Gaussian Splatting Training}, 
      author={Hexu Zhao and Haoyang Weng and Daohan Lu and Ang Li and Jinyang Li and Aurojit Panda and Saining Xie},
      year={2024},
      eprint={2406.18533},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.18533}, 
}

Code Specification

Please use "black" with default settings to format the code if you want to contribute.

Setup

conda install black==24.4.2

License

Distributed under the Apache License Version 2.0 License. See LICENSE.txt for more information.

Reference

  1. Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, July 2023. URL: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.

About

Ongoing research training gaussian splatting at scale by distributed system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%