This repository is the official implementation of the paper:
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Abdelrhman Werby*, Chenguang Huang*, Martin BΓΌchner*, Abhinav Valada, and Wolfram Burgard.
*Equal contribution.arXiv preprint arXiv:2403.17846, 2024
(Accepted for Robotics: Science and Systems (RSS), Delft, Netherlands, 2024.)
- [29 Aug 2024] We added
hm3dsem_walks
dataset generation and hierarchical scene graph evaluation code.
Please review the updated code structure and newly added dependencies for dataset construction. - [01 Jul 2024] Initial release of HOV-SG including mapping and graph construction engine.
- Clone and set up the HOV-SG repository
git clone https://github.com/hovsg/HOV-SG.git
cd HOV-SG
# set up virtual environment and install habitat-sim afterwards separately to avoid errors.
conda env create -f environment.yaml
conda activate hovsg
conda install habitat-sim -c conda-forge -c aihabitat
# set up the HOV-SG python package
pip install -e .
HOV-SG uses the Open CLIP model to extract features from RGB-D frames. To download the Open CLIP model checkpoint CLIP-ViT-H-14-laion2B-s32B-b79K
please refer to Open CLIP.
mkdir checkpoints
wget https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin?download=true -O checkpoints/temp_open_clip_pytorch_model.bin && mv checkpoints/temp_open_clip_pytorch_model.bin checkpoints/laion2b_s32b_b79k.bin
Another option is to use the OVSeg fine-tuned Open CLIP model, which is available under here:
pip install gdown
gdown --fuzzy https://drive.google.com/file/d/17C9ACGcN7Rk4UT4pYD_7hn3ytTa3pFb5/view -O checkpoints/ovseg_clip.pth
HOV-SG uses SAM to generate class-agnostic masks for the RGB-D frames. To download the SAM model checkpoint sam_v2
execute the following:
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -O checkpoints/sam_vit_h_4b8939.pth
HOV-SG takes posed RGB-D sequences as input. In order to produce hierarchical multi-story scenes we make use of the Habitat 3D Semantics dataset (HM3DSem).
-
Download the Habitat Matterport 3D Semantics dataset.
Make sure that the raw HM3D dataset has the following structure:
βββ hm3d β βββ hm3d_annotated_basis.scene_dataset_config.json # this file is necessary β βββ val β β βββ 00824-Dd4bFSTQ8gi β β βββ Dd4bFSTQ8gi.basis.glb β β βββ Dd4bFSTQ8gi.basis.navmesh β β βββ Dd4bFSTQ8gi.glb β β βββ Dd4bFSTQ8gi.semantic.glb β β βββ Dd4bFSTQ8gi.semantic.txt ... ... ...
We used the following scenes from the Habitat Matterport 3D Semantics dataset in our evaluation:
Show Scenes ID
00824-Dd4bFSTQ8gi
00829-QaLdnwvtxbs
00843-DYehNKdT76V
00861-GLAQ4DNUx5U
00862-LT9Jq6dN3Ea
00873-bxsVRursffK
00877-4ok3usBNeis
00890-6s7QHgap2fW
- Our method requires posed input data. Because of that, we recorded trajectories for each sequence we evaluate on. We provide a script (
hovsg/data/hm3dsem/gen_hm3dsem_walks_from_poses.py
) that turns a set of camera poses (hovsg/data/hm3dsem/metadata/poses
) into a sequence of RGB-D observations using the habitat-sim simulator. The output includes RGB, depth, poses and frame-wise semantic/panoptic ground truth:
# python data/habitat/gen_hm3dsem_from_poses.py --dataset_dir <hm3dsem_dir> --save_dir data/hm3dsem_walks/
# hm3dsem
python data/hm3dsem/gen_hm3dsem_walks_from_poses.py --dataset_dir data/hm3dsem --save_dir data/hm3dsem_walks/ --pose_dir hovsg/data/hm3dsem/metadata/poses
- Secondly, we construct a new hierarchical graph-structured dataset that is called
hm3dsem_walks
that includes ground truth based on all observations recorded. To produce this ground-truth data please execute the following: First, define the following config paths:main.package_path
,main.dataset_path
,main.raw_data_path
, andmain.save_path
underconfig/create_graph.yaml
. For each scene, define themain.scene_id
,main.split
. Next, execute the following to obtain floor-, region-, and object-level ground truth data per scene. We utilize every recorded frame without skipping (see parameterdataset.hm3dsem.gt_skip_frames
) and recommend 128 GB of RAM to compile this as the scenes differ in size:
cd HOV-SG
python hovsg/data/hm3dsem/create_hm3dsem_walks_gt.py
To evaluate semantic segmentation cababilities, we used ScanNet and Replica.
To get an RGBD sequence for ScanNet, download the ScanNet dataset from the official website. The dataset contains RGB-D frames compressed as .sens files. To extract the frames, use the SensReader/python. We used the following scenes from the ScanNet dataset:
Show Scenes ID
scene0011_00
scene0050_00
scene0231_00
scene0378_00
scene0518_00
To get an RGBD sequence for Replica, Instead of the original Replica dataset, download the scanned RGB-D trajectories of the Replica dataset provided by Nice-SLAM. It contains rendered trajectories using the mesh models provided by the original Replica datasets. Download the Replica RGB-D scan dataset using the downloading script in Nice-SLAM.
wget https://cvg-data.inf.ethz.ch/nice-slam/data/Replica.zip -O data/Replica.zip && unzip data/Replica.zip -d data/Replica_RGBD && rm data/Replica.zip
To evaluate against the ground truth semantics labels, you also need also to download the original Replica dataset from the Replica as it contains the ground truth semantics labels as .ply files.
git clone https://github.com/facebookresearch/Replica-Dataset.git data/Replica-Dataset
chmod +x data/Replica-Dataset/download.sh && data/Replica-Dataset/download.sh data/Replica_original
We only used the following scenes from the Replica dataset:
Show Scenes ID
office0
office1
office2
office3
office4
room0
room1
room2
The Data folder should have the following structure:
Show data folder structure
βββ hm3dsem_walks
β βββ val
β β βββ 00824-Dd4bFSTQ8gi
β β β βββ depth
β β β β βββ Dd4bFSTQ8gi-000000.png
β β β β βββ ...
β β β βββ rgb
β β β β βββ Dd4bFSTQ8gi-000000.png
β β β β βββ ...
β β β βββ semantic
β β β β βββ Dd4bFSTQ8gi-000000.png
β β β β βββ ...
β β β βββ pose
β β β β βββ Dd4bFSTQ8gi-000000.png
β β β β βββ ...
| | βββ 00829-QaLdnwvtxbs
| | βββ ..
βββ Replica
β βββ office0
β β βββ results
β β β βββ depth0000.png
β β β βββ ...
β β | βββ rgb0000.png
β β | βββ ...
β β βββ traj.txt
β βββ office1
β βββ ...
βββ ScanNet
β βββ scans
β β βββ scene0011_00
β β β βββ color
β β β β βββ 0.jpg
β β β β βββ ...
β β β βββ depth
β β β β βββ 0.png
β β β β βββ ...
β β β βββ poses
β β β β βββ 0.txt
β β β β βββ ...
β β β βββ internsics
β β β β βββ intrinsics_color.txt
β β β β βββ intrinsics_depth.txt
β β βββ ..
# python application/create_graph.py main.dataset=hm3dsem main.dataset_path=data/hm3dsem_walks/val/00824-Dd4bFSTQ8gi/ main.save_path=data/scene_graphs/00824-Dd4bFSTQ8gi
python application/create_graph.py main.dataset=hm3dsem main.dataset_path=hovsg/data/hm3dsem_walks/val/00824-Dd4bFSTQ8gi/ main.save_path=hovsg/data/scene_graphs/
This will generate a scene graph for the specified RGB-D sequence and save it. The following files are generated:
βββ graph
β βββ floors
β β βββ 0.json
β β βββ 0.ply
β β βββ 1.json
β β βββ ...
β βββ rooms
β β βββ 0_0.json
β β βββ 0_0.ply
β β βββ 0_1.json
β β βββ ...
β βββ objects
β β βββ 0_0_0.json
β β βββ 0_0_0.ply
β β βββ 0_0_1.json
β β βββ ...
β βββ nav_graph
βββ tmp
βββ full_feats.pt
βββ mask_feats.pt
βββ full_pcd.ply
βββ masked_pcd.ply
The graph
folder contains the generated scene graph hierarchy, the first number in the file name represents the floor number, the second number represents the room number, and the third number represents the object number. The tmp
folder holds intermediate results obtained throughout graph construction. The full_feats.pt
and mask_feats.pt
contain the features extracted from the RGBD frames using the Open CLIP and SAM models. the former contains per point features and the latter contains the features for the object masks. The full_pcd.ply
and masked_pcd.ply
contain the point cloud representation of the RGB-D frames and the instance masks of all objects, respectively.
# python application/visualize_graph.py graph_path=data/scene_graphs/hm3dsem/00824-Dd4bFSTQ8gi/graph
python application/visualize_graph.py graph_path=hovsg/data/scene_graphs/hm3dsem/00824-Dd4bFSTQ8gi/graph
In order to test graph queries with HOV-SG, you need to setup an OpenAI API account with the following steps:
- Sign up an OpenAI account, login your account, and bind your account with at least one payment method.
- Get you OpenAI API keys, copy it.
- Open your
~/.bashrc
file, paste a new lineexport OPENAI_KEY=<your copied key>
, save the file, and source it with commandsource ~/.bashrc
. Another way would be to runexport OPENAI_KEY=<your copied key>
in the teminal where you want to run the query code.
python application/visualize_query_graph.py main.graph_path=hovsg/data/scene_graphs/hm3dsem/00824-Dd4bFSTQ8gi/graph
After launching the code, you will be asked to input the hierarchical query. An example is chair in the living room on floor 0
. You can see the visualization of the top 5 target objects and the room it lies in.
python application/semantic_segmentation.py main.dataset=replica main.scene_id=office0 main.dataset=replica main.dataset_path=hovsg/data/Replica_RGBD/Replica/office0 main.save_path=hovsg/data/sem_seg/Replica/office0
python application/eval/evaluate_sem_seg.py main.dataset=replica main.scene_name=office_0 main.feature_map_path=hovsg/data/sem_seg/Replica/office0
- Define the scene identifiers and paths of ground truth and the predicted scene graph in the
config/eval_graph.yaml
. - Run the graph evaluation method:
python application/eval/evaluate_graph.py
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, largescale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting languagegrounded robotic navigation. In this work, we present HOVSG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded indoor robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with openvocabulary features. Our approach is able to represent multistory buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within realworld multi-story environments.
If you find our work useful, please consider citing our paper:
@article{werby23hovsg,
Author = {Abdelrhman Werby and Chenguang Huang and Martin BΓΌchner and Abhinav Valada and Wolfram Burgard},
Title = {Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation},
Year = {2024},
journal = {Robotics: Science and Systems},
}
For academic usage, the code is released under the MIT license. For any commercial purpose, please contact the authors.
This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant number 468878300, the BrainLinks-BrainTools Center of the University of Freiburg, and an academic grant from NVIDIA.