Skip to content

[arXiv 2023] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

License

Notifications You must be signed in to change notification settings

eltociear/EmbodiedScan

 
 

Repository files navigation


EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Tai Wang*Xiaohan Mao*Chenming Zhu*Runsen XuRuiyuan LyuPeisen LiXiao Chen
Wenwei ZhangKai ChenTianfan XueXihui LiuCewu LuDahua LinJiangmiao Pang
Shanghai AI Laboratory Shanghai Jiao Tong University The University of Hong Kong
The Chinese University of Hong Kong Tsinghua University

🤖 Demo

demo

📋 Contents

  1. About
  2. News
  3. Getting Started
  4. Model and Benchmark
  5. TODO List
  6. Citation
  7. License
  8. Acknowledgements

🏠 About

Dialogue_Teaser
In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild.

🔥 News

  • [2023-12] We release the paper of EmbodiedScan. Please check the webpage and view our demos!

📚 Getting Started

Installation

We test our codes under the following environment:

  • Ubuntu 20.04
  • NVIDIA Driver: 525.147.05
  • CUDA 12.0
  • Python 3.8.18
  • PyTorch 1.11.0+cu113
  • PyTorch3D 0.7.2
  1. Clone this repository.
git clone https://github.com/OpenRobotLab/EmbodiedScan.git
cd EmbodiedScan
  1. Install PyTorch3D.
conda create -n embodiedscan python=3.8 -y  # pytorch3d needs python>3.7
conda activate embodiedscan
# We recommend installing pytorch3d with pre-compiled packages
# For example, to install for Python 3.8, PyTorch 1.11.0 and CUDA 11.3
# For more information, please refer to https://github.com/facebookresearch/pytorch3d/blob/main/INSTALL.md#2-install-wheels-for-linux
pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py38_cu113_pyt1110/download.html
  1. Install EmbodiedScan.
# We plan to make EmbodiedScan easier to install by "pip install EmbodiedScan".
# Please stay tuned for the future official release.
# Make sure you are under ./EmbodiedScan/
pip install -e .

Data Preparation

Please download ScanNet, 3RScan and matterport3d from their official websites.

We will release the demo data, re-organized file structure, post-processing script and annotation files in the near future. Please stay tuned.

Tutorial

We provide a simple tutorial here as a guideline for the basic analysis and visualization of our dataset. Welcome to try and post your suggestions!

📦 Model and Benchmark

We will release the code for model training and benchmark with pretrained checkpoints in the 2024 Q1.

Model Overview

Embodied Perceptron accepts RGB-D sequence with any number of views along with texts as multi-modal input. It uses classical encoders to extract features for each modality and adopts dense and isomorphic sparse fusion with corresponding decoders for different predictions. The 3D features integrated with the text feature can be further used for language-grounded understanding.

Benchmark

Please see the paper for details of our two benchmarks, fundamental 3D perception and language-grounded benchmarks. This dataset is still scaling up and the benchmark is being polished and extended. Please stay tuned for our recent updates.

📝 TODO List

  • Paper and partial code release.
  • Release EmbodiedScan annotation files.
  • Polish dataset APIs and related codes.
  • Release Embodied Perceptron pretrained models.
  • Release codes for baselines and benchmarks.
  • Full release and further updates.

🔗 Citation

If you find our work helpful, please cite:

@article{wang2023embodiedscan,
  author={Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao},
  title={EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI},
  journal={Arxiv},
  year={2023},

If you use our dataset and benchmark, please kindly cite the original datasets involved in our work. BibTex entries are provided below.

Dataset BibTex
@inproceedings{dai2017scannet,
  title={ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes},
  author={Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie{\ss}ner, Matthias},
  booktitle = {Proceedings IEEE Computer Vision and Pattern Recognition (CVPR)},
  year = {2017}
}
@inproceedings{Wald2019RIO,
  title={RIO: 3D Object Instance Re-Localization in Changing Indoor Environments},
  author={Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, Matthias Niessner},
  booktitle={Proceedings IEEE International Conference on Computer Vision (ICCV)},
  year = {2019}
}
@article{Matterport3D,
  title={{Matterport3D}: Learning from {RGB-D} Data in Indoor Environments},
  author={Chang, Angel and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda},
  journal={International Conference on 3D Vision (3DV)},
  year={2017}
}

📄 License

Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

👏 Acknowledgements

  • OpenMMLab: Our dataset code uses MMEngine and our model is built upon MMDetection3D.
  • PyTorch3D: We use some functions supported in PyTorch3D for efficient computations on fundamental 3D data structures.
  • ScanNet, 3RScan, Matterport3D: Our dataset uses the raw data from these datasets.
  • ReferIt3D: We refer to the SR3D's approach to obtaining the language prompt annotations.
  • SUSTechPOINTS: Our annotation tool is developed based on the open-source framework used by SUSTechPOINTS.

About

[arXiv 2023] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 65.1%
  • Python 34.7%
  • Shell 0.2%