This is the official implementation for Multi3DRefer: Grounding Text Description to Multiple 3D Objects.
This repo contains CUDA implementation, please make sure your GPU compute capability is at least 3.0 or above.
We report the max computing resources usage with batch size 4:
Training | Inference | |
GPU mem usage | 15.2 GB | 11.3 GB |
We recommend the use of miniconda to manage system dependencies.
# create and activate the conda environment
conda create -n m3drefclip python=3.10
conda activate m3drefclip
# install PyTorch 2.0.1
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia
# install PyTorch3D with dependencies
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install pytorch3d -c pytorch3d
# install MinkowskiEngine with dependencies
conda install -c anaconda openblas
pip install -U git+ -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"
# install Python libraries
pip install .
# install CUDA extensions
cd m3drefclip/common_ops
pip install .
Note: Setting up with pip (no conda) requires OpenBLAS to be pre-installed in your system.
# create and activate the virtual environment
virtualenv env
source env/bin/activate
# install PyTorch 2.0.1
pip install torch torchvision
# install PyTorch3D
pip install pytorch3d
# install MinkowskiEngine
pip install MinkowskiEngine
# install Python libraries
pip install .
# install CUDA extensions
cd m3drefclip/common_ops
pip install .
Note: Both ScanRefer and Nr3D datasets requires the ScanNet v2 dataset. Please preprocess it first.
Download the ScanNet v2 dataset (train/val/test), (refer to ScanNet's instruction for more details). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── scannetv2 │ │ ├── scans │ │ │ ├── [scene_id] │ │ │ │ ├── [scene_id]_vh_clean_2.ply │ │ │ │ ├── [scene_id]_vh_clean_2.0.010000.segs.json │ │ │ │ ├── [scene_id].aggregation.json │ │ │ │ ├── [scene_id].txt
Pre-process the data, it converts original meshes and annotations to
data:python dataset/scannetv2/ data=scannetv2 +workers={cpu_count}
Pre-process the multiview features from ENet: Please refer to the 5th instruction in D3Net's repo to generate
or directly download it, then put it underm3drefclip/dataset/scannetv2
Download the ScanRefer dataset (train/val). Also download the test set. The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── scanrefer │ │ ├── metadata │ │ │ ├── ScanRefer_filtered_train.json │ │ │ ├── ScanRefer_filtered_val.json │ │ │ ├── ScanRefer_filtered_test.json
Pre-process the data, "unique/multiple" labels will be added to raw
files for evaluation purpose:python dataset/scanrefer/ data=scanrefer
Download the Nr3D dataset (train/test). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── nr3d │ │ ├── metadata │ │ │ ├── nr3d_train.csv │ │ │ ├── nr3d_test.csv
Pre-process the data, "easy/hard/view-dep/view-indep" labels will be added to raw
files for evaluation purpose:python dataset/nr3d/ data=nr3d
- Downloading the Multi3DRefer dataset (train/val). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── multi3drefer │ │ ├── metadata │ │ │ ├── multi3drefer_train.json │ │ │ ├── multi3drefer_val.json
We pre-trained PointGroup implemented in MINSU3D on ScanNet v2 and use it as the detector. We use coordinates + colors + multi-view features as inputs.
- Download the pre-trained detector. The detector checkpoint file should be organized as follows:
m3drefclip # project root ├── checkpoints │ ├── PointGroup_ScanNet.ckpt
Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.
# log in to WandB
wandb login
# train a model with the pre-trained detector, using predicted object proposals
python data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt
# train a model with the pretrained detector, using GT object proposals
python data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt
# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file
python data={scanrefer/nr3d/multi3drefer} experiment_name={checkpoint_experiment_name} +ckpt_path={ckpt_file_path}
# test a model from a checkpoint and save its predictions
python data={scanrefer/nr3d/multi3drefer} data.inference.split={train/val/test} +ckpt_path={ckpt_file_path} pred_path={predictions_path}
# evaluate predictions
python data={scanrefer/nr3d/multi3drefer} pred_path={predictions_path} data.evaluation.split={train/val/test}
Split | IoU | Unique | Multiple | Overall |
Val | 0.25 | 85.3 | 43.8 | 51.9 |
Val | 0.5 | 77.2 | 36.8 | 44.7 |
Test | 0.25 | 79.8 | 46.9 | 54.3 |
Test | 0.5 | 70.9 | 38.1 | 45.5 |
Split | Easy | Hard | View-dep | View-indep | Overall |
Test | 55.6 | 43.4 | 42.3 | 52.9 | 49.4 |
Split | IoU | ZT w/ D | ZT w/o D | ST w/ D | ST w/o D | MT | Overall |
Val | 0.25 | 39.4 | 81.8 | 34.6 | 53.5 | 43.6 | 42.8 |
Val | 0.5 | 39.4 | 81.8 | 30.6 | 47.8 | 37.9 | 38.4 |
Convert M3DRef-CLIP predictions to ScanRefer benchmark format:
python dataset/scanrefer/ data=scanrefer +pred_path={predictions_path} +output_path={output_file_path}
Please refer to ReferIt3D benchmark to report results.