DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
This repository contains the official implementation of the paper: DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation, which is accepted to ICLR 2025. DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.
Feel free to contact me ([email protected]) or open an issue if you have any questions or suggestions.
- 2025-02-02: Text-conditioned inference instructions are provided.
- 2025-01-29: The source code and pretrained models are released. Happy π Chinese New Year π!
- 2025-01-22: InstructScene is accepted to ICLR 2025.
- Provide detailed instructions for text-conditioned inference.
- Provide detailed instructions for image-conditioned inference and training.
- Implement a Gradio demo.
You may need to modify the specific version of torch
in settings/setup.sh
according to your CUDA version.
There are not restrictions on the torch
version, feel free to use your preferred one.
git clone https://github.com/chenguolin/DiffSplat.git
cd DiffSplat
bash settings/setup.sh
- We use G-Objaverse with about 265K 3D objects and 10.6M rendered images (265K x 40 views, including RGB, normal and depth maps) for
GSRecon
andGSVAE
training. Its subset with about 83K 3D objects provided by LGM is used forDiffSplat
training. Their text descriptions are provided by the latest version of Cap3D (i.e., refined by DiffuRank). - We find the filtering is crucial for the generation quality of
DiffSplat
, and a larger dataset is beneficial for the performance ofGSRecon
andGSVAE
. - We store the dataset in an internal HDFS cluster in this project. Thus, the training code can NOT be directly run on your local machine. Please implement your own dataloading logic referring to our provided dataset & dataloader code.
All pretrained models are available at HuggingFaceπ€.
Model Name | Fine-tined From | #Param. | Link | Note |
---|---|---|---|---|
ElevEst | dinov2_vitb14_reg | 86 M | elevest_gobj265k_b_C25 | (Optional) Single-image elevation estimation |
GSRecon | From scratch | 42M | gsrecon_gobj265k_cnp_even4 | Feed-forward reconstruct per-pixel 3DGS from (RGB, normal, point) maps |
GSVAE (SD) | SD1.5 VAE | 84M | gsvae_gobj265k_sd | |
GSVAE (SDXL) | SDXL fp16 VAE | 84M | gsvae_gobj265k_sdxl_fp16 | fp16-fixed SDXL VAE is more robust |
GSVAE (SD3) | SD3 VAE | 84M | gsvae_gobj265k_sd3 | |
DiffSplat (SD1.5) | SD1.5 | 0.86B | Text-cond: gsdiff_gobj83k_sd15__render Image-cond: gsdiff_gobj83k_sd15_image__render |
Best efficiency |
DiffSplat (PixArt-Sigma) | PixArt-Sigma | 0.61B | Text-cond: gsdiff_gobj83k_pas_fp16__render Image-cond: gsdiff_gobj83k_pas_fp16_image__render |
Best Trade-off |
DiffSplat (SD3.5m) | SD3.5 median | 2.24B | Text-cond: gsdiff_gobj83k_sd35m__render Image-cond: gsdiff_gobj83k_sd35m_image__render |
Best performance |
DiffSplat ControlNet (SD1.5) | From scratch | 361M | Depth: gsdiff_gobj83k_sd15__render__depth Normal: gsdiff_gobj83k_sd15__render__normal Canny: gsdiff_gobj83k_sd15__render__canny |
Note that:
- Pretrained weights will download from HuggingFace and stored in
./out
. - Other pretrained models (such as CLIP, T5, image VAE, etc.) will be downloaded automatically and stored in your HuggingFace cache directory.
- If you face problems in visiting HuggingFace Hub, you can try to set the environment variable
export HF_ENDPOINT=https://hf-mirror.com
.
python3 ./download_ckpt.py --model_type [MODEL_TYPE] [--image_cond]
# `MODEL_TYPE`: choose from "sd15", "pas", "sd35m", "depth", "normal", "canny"
# `--image_cond`: add this flag for downloading image-conditioned models
For example, to download the text-cond SD1.5-based DiffSplat
:
python3 ./download_ckpt.py --model_type sd15
To download the image-cond PixArt-Sigma-based DiffSplat
:
python3 ./download_ckpt.py --model_type pas --image_cond
Note that:
- Model differences may not be significant for simple text prompts. We recommend using
DiffSplat (SD1.5)
for better efficiency,DiffSplat (SD3.5m)
for better performance, andDiffSplat (PixArt-Sigma)
for a better trade-off. - By default,
export HF_HOME=~/.cache/huggingface
,export TORCH_HOME=~/.cache/torch
. You can change theses paths inscripts/infer.sh
. SD3-related models require HuggingFace token for downloading, which is expected to be stored inHF_HOME
. - Outputs will be stored in
./out/<MODEL_NAME>/inference
. - Prompt is specified by
--prompt
(e.g.,a_toy_robot
). Please seperate words by_
and it will be replaced by space in the code automatically. - If
"gif"
is in--output_video_type
, the output will be a.gif
file. Otherwise, it will be a.mp4
file. If"fancy"
is in--output_video_type
, the output video will be in a fancy style that 3DGS scales gradually increase while rotating. --seed
is used for random seed setting.--gpu_id
is used for specifying the GPU device.- Use
--half_precision
forBF16
half-precision inference. It will reduce the memory usage but may slightly affect the quality.
# DiffSplat (SD1.5)
bash scripts/infer.sh src/infer_gsdiff_sd.py configs/gsdiff_sd15.yaml gsdiff_gobj83k_sd15__render \
--prompt a_toy_robot --output_video_type gif \
--gpu_id 0 --seed 0 [--half_precision]
# DiffSplat (PixArt-Sigma)
bash scripts/infer.sh src/infer_gsdiff_pas.py configs/gsdiff_pas.yaml gsdiff_gobj83k_pas_fp16__render \
--prompt a_toy_robot --output_video_type gif \
--gpu_id 0 --seed 0 [--half_precision]
# DiffSplat (SD3.5m)
bash scripts/infer.sh src/infer_gsdiff_sd3.py configs/gsdiff_sd35m_80g.yaml gsdiff_gobj83k_sd35m__render \
--prompt a_toy_robot --output_video_type gif \
--gpu_id 0 --seed 0 [--half_precision]
You will get:
DiffSplat (SD1.5) | DiffSplat (PixArt-Sigma) | DiffSplat (SD3.5m) |
---|---|---|
More Advanced Arguments:
--prompt_file
: instead of using--prompt
,--prompt_file
will read prompts from a.txt
file line by line.- Diffusion configurations:
--scheduler_type
: choose fromddim
,dpmsolver++
,sde-dpmsolver++
, etc.--num_inference_timesteps
: the number of diffusion steps.--guidance_scale
: classifier-free guidance (CFG) scale;1.0
means no CFG.--eta
: specified forDDIM
scheduler; the weight of noise for added noise in diffusion steps.
- Instant3D tricks:
--init_std
,--init_noise_strength
,--init_bg
: initial noise settings, cf. Instant3D Sec. 3.1; NOT used by default, as we found it's not that helpful in our case.
- Others:
--elevation
: elevation for viewing and rendering; not necessary for text-conditioned generation; set to10
by default (from xz-plane to +y axis).--negative_prompt
: empty prompt (""
) by default; used with CFG for better visual quality (e.g., more vibrant colors), but we found it causes lower metric values (such as ImageReward).--save_ply
: save the generated 3DGS as a.ply
file; used with--opacity_threshold_ply
to filter out low-opacity splats for much smaller.ply
file size.--eval_text_cond
: evaluate text-conditioned generation automatically.- ...
Please refer to infer_gsdiff_sd.py, infer_gsdiff_pas.py, and infer_gsdiff_sd3.py for more argument details.
Note that:
- Most of the arguments are the same as text-conditioned generation. The only difference is that you need to specify an image path as condition. Our method support text and image as conditions simultaneously.
Instructions for image-conditioned generation will be provided soon.
Instructions for ControlNet-based generation will be provided soon.
Please refer to train_gsrecon.py.
Instructions for GSRecon
training will be provided soon.
Please refer to train_gsvae.py.
Instructions for GSVAE
training will be provided soon.
Please refer to train_gsdiff_sd.py, train_gsdiff_pas.py, and train_gsdiff_sd3.py.
Instructions for DiffSplat
training will be provided soon.
Please refer to train_gsdiff_sd_controlnet.py and infer_gsdiff_sd.py.
Instructions for ControlNet
training and inference will be provided soon.
We would like to thank the authors of LGM, GRM, and Wonder3D for their great work and generously providing source codes, which inspired our work and helped us a lot in the implementation.
If you find our work helpful, please consider citing:
@inproceedings{lin2025diffsplat,
title={DiffSplat: Repurposing Image Diffusion Models for Scalable 3D Gaussian Splat Generation},
author={Lin, Chenguo and Pan, Panwang and Yang, Bangbang and Li, Zeming and Mu, Yadong},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}