SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li¹ Shijie Li² Lingdong Kong³ Xulei Yang² Junwei Liang^1,4
AI Thrust, HKUST(Guangzhou) I2R, A*STAR National University of Singapore CSE, HKUST


3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, which is essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability.
To overcome these limitations, we introduce SeeGround 👁️, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose to represent 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats.
We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization.
Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness.

📝 Update

[2025.01] The code and model checkpoints have been fully released. Feel free to try it out! 🤗
[2024.12] Introducing SeeGround 👁️, a new framework towards zero-shot 3D visual grounding. For more details, kindly refer to our Project Page and Preprint. 🚀

Table of Content

0. Framework Overview
1. Environment Setup
2. Download Model Weights
3. Download Datasets
4. Data Processing
5. Inference
6. Reproduction
7. License
8. Citation
9. Acknowledgments

0. Framework Overview

Overview of the SeeGround 👁️ framework.

We first use a 2D-VLM to interpret the query, identifying both the target object (e.g., "laptop") and a context-providing anchor (e.g., "chair with floral pattern"). A dynamic viewpoint is then selected based on the anchor’s position, enabling the capture of a 2D rendered image that aligns with the query’s spatial requirements. Using the Object Lookup Table (OLT), we retrieve the 3D bounding boxes of relevant objects, project them onto the 2D image, and apply visual prompts to mark visible objects, filtering out occlusions. The image with prompts, along with the spatial descriptions and query, are then input into the 2D-VLM for precise localization of the target object. Finally, the 2D-VLM outputs the target object’s ID, and we retrieve its 3D bounding box from the OLT to provide the final, accurate 3D position in the scene.

1. Environment Setup

We recommend using the official Docker image for environment setup

docker pull qwenllm/qwenvl

2. Download Model Weights

You can download the qwen2-vl model weights from either of the following sources:

3. Download Datasets

3.1. ScanRefer

Download ScanRefer dataset from official repo, and place it in the following directory:

data/ScanRefer/ScanRefer_filtered_val.json

3.2. Nr3D

Download the Nr3D dataset from the official repo, and place it in the following directory:

data/Nr3D/Nr3D.jsonl

3.3. Vil3dref Preprocessed Data

Download the preprocessed Vil3dref data from vil3dref.

The expected structure should look like this:

referit3d/
.
├── annotations
|   ├── meta_data
|   │   ├── cat2glove42b.json
|   │   ├── scannetv2-labels.combined.tsv
|   │   └── scannetv2_raw_categories.json
│   └── ...
├── ...
└── scan_data
    ├── ...
    ├── instance_id_to_name
    └── pcd_with_global_alignment

4. Data Processing

Download mask3d pred first.

ScanRefer

python prepare_data/object_lookup_table_scanrefer.py

Nr3D

python prepare_data/process_feat_3d.py

python prepare_data/object_lookup_table_nr3d.py

Alternatively, you can download the preprocessed Object Lookup Table.

5. Inference

5.1. Deploying VLM Service

We use vllm to deploy the VLM. It is recommended to run the following command in a tmux session on your server:

python -m vllm.entrypoints.openai.api_server --model /your/qwen2-vl-model/path  --served-model-name Qwen2-VL-72B-Instruct --tensor_parallel_size=8

The --tensor_parallel_size flag controls the number of GPUs required. Adjust it according to your memory resources.

5.2. Generating Anchors & Targets

ScanRefer

python parse_query/generate_query_data_scanrefer.py

Nr3D

python parse_query/generate_query_data_nr3d.py

5.3. Predictions

ScanRefer

python inference/inference_scanrefer.py

Nr3D

python inference/inference_nr3d.py

5.4. Evaluations

ScanRefer

python eval/eval_nr3d.py

Nr3D

python eval/eval_scanrefer.py

6. Reproduction

Qwen2-VL-72B results

7. License

This work is released under the Apache 2.0 license.

8. Citation

If you find this work and code repository helpful, please consider starring it and citing the following paper:

@article{li2024seeground,
  title   = {SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding},
  author  = {Rong Li and Shijie Li and Lingdong Kong and Xulei Yang and Junwei Liang},
  journal = {arXiv preprint arXiv:2412.04383},
  year    = {2024},
}

9. Acknowledgments

We would like to thank the following repositories for their contributions:

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
eval		eval
figs		figs
inference		inference
models		models
parse_query		parse_query
prepare_data		prepare_data
prompts		prompts
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

📝 Update

Table of Content

0. Framework Overview

1. Environment Setup

2. Download Model Weights

3. Download Datasets

3.1. ScanRefer

3.2. Nr3D

3.3. Vil3dref Preprocessed Data

4. Data Processing

5. Inference

5.1. Deploying VLM Service

5.2. Generating Anchors & Targets

5.3. Predictions

5.4. Evaluations

6. Reproduction

7. License

8. Citation

9. Acknowledgments

About

Releases 1

Packages

Languages

License

iris0329/SeeGround

Folders and files

Latest commit

History

Repository files navigation

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

📝 Update

Table of Content

0. Framework Overview

1. Environment Setup

2. Download Model Weights

3. Download Datasets

3.1. ScanRefer

3.2. Nr3D

3.3. Vil3dref Preprocessed Data

4. Data Processing

5. Inference

5.1. Deploying VLM Service

5.2. Generating Anchors & Targets

5.3. Predictions

5.4. Evaluations

6. Reproduction

7. License

8. Citation

9. Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages