GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding

Yue Zhou¹ Mengcheng Lan¹ Xiang Li² Yiping Ke³ Jiang Xue⁴ Litong Feng^5* Wayne Zhang⁵

¹S-Lab, Nanyang Technological University ²CS, University of Reading ³CCDS, Nanyang Technological University ⁴SEIEE, Shanghai Jiaotong University ⁵SenseTime Research

📢 Latest Updates

🌟 We will release the GeoGround demo, code and datasets as soon as possible. 🌟

Abstract

Remote sensing (RS) visual grounding aims to use natural language expression to locate specific objects (in the form of the bounding box or segmentation mask) in RS images, enhancing human interaction with intelligent RS interpretation systems. Early research in this area was primarily based on horizontal bounding boxes (HBBs), but as more diverse RS datasets have become available, tasks involving oriented bounding boxes (OBBs) and segmentation masks have emerged. In practical applications, different targets require different grounding types: HBB can localize an object's position, OBB provides its orientation, and mask depicts its shape. However, existing specialized methods are typically tailored to a single type of RS visual grounding task and are hard to generalize across tasks. In contrast, large vision-language models (VLMs) exhibit powerful multi-task learning capabilities but struggle to handle dense prediction tasks like segmentation. This paper proposes GeoGround, a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks, allowing flexible output selection. Rather than customizing the architecture of VLM, our work aims to elegantly support pixel-level visual grounding output through the Text-Mask technique. We define prompt-assisted and geometry-guided learning to enhance consistency across different signals. To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs. Experimental results show that GeoGround demonstrates strong performance across four RS visual grounding tasks, matching or surpassing the performance of specialized methods on multiple benchmarks.

GeoGround unifies box-level and pixel-level visual grounding tasks in remote sensing.

🏆 Contributions

Framework. We propose GeoGround, a novel VLM framework that unifies box-level and pixel-level RS visual grounding tasks while maintaining its inherent dialogue and image understanding capabilities.
Dataset. We introduce refGeo, the largest RS visual grounding instruction-following dataset, consisting of 161k image-text pairs and 80k RS images, including a new 3D-aware aerial vehicle visual grounding dataset.
Benchmark. We conduct extensive experiments on various RS visual grounding tasks, providing valuable insights for future RS VLM research and opening new avenues for research in RS visual grounding.

💬 Text-Mask & Hybrid Supervision

We propose the Text-Mask paradigm, which distills and compresses the information embedded in the mask into a compact text sequence that can be efficiently learned by VLMs. Additionally, we introduce hybrid supervision, which incorporates prompt-assisted learning (PAL) and geometry-guided learning (GGL) to fine-tune the model using three types of signals, ensuring output consistency and enhancing the model’s understanding of the relationships between different grounding types.

🔍 refGeo Dataset

We introduce refGeo, a large-scale RS visual grounding instruction-following dataset. It consolidates four existing visual grounding datasets from RS and introduces a new aerial vehicle visual grounding dataset (AVVG). AVVG extends traditional 2D visual grounding to a 3D context, enabling VLMs to perceive 3D space from 2D aerial imagery. For each referred object, we provide HBB, OBB, and mask, with the latter automatically generated by the SAM.

🚀 Qualitative and Quantitative results

📷 Referring Expression Comprehension (REC) (HBB)

GeoGround achieves the best performance across all REC benchmarks, surpassing the specialized model on the DIOR-RSVG test set. Benefiting from the wide range of image resolutions and GSD in refGeo, the fine-tuned model showed significant performance improvements on datasets with a high proportion of small objects, such as RSVG and AVVG.

Performance on RSVG benchmark.

📷 Referring Expression Comprehension (REC) (OBB)

The results demonstrate GeoGround's dominance in RS visual grounding tasks based on OBB, further validating the effectiveness of our hybrid supervision approach.

Performance on GeoChat benchmark.

📷 Referring Expression Segmentation (RES)

Unlike the other VLMs, GeoGround does not require the introduction of an additional mask decoder, as it inherently possesses segmentation capabilities. Moreover, we attempt to use SAM to refine the coarse masks generated by GeoGround, which allowed GeoGround to achieve results that match the performance of the best RS referring segmentation model.

Performance on RRSIS-D benchmark.

📷 Generalized Referring Expression Comprehension (GRES) (Multiple Targets)

We present an RS Generalized REC benchmark based on AVVG, which differs from standard REC in that one referring expression may correspond to multiple objects.

📷 Image Captioning & Visual Question Answering (VQA)

Our approach enhances object-level understanding without compromising the holistic image comprehension capabilities of VLMs.

📜 Citation

@misc{zhou2024geoground,
      title={GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding}, 
      author={Yue Zhou and Mengcheng Lan and Xiang Li and Yiping Ke and Xue Jiang and Litong Feng and Wayne Zhang},
      year={2024},
      eprint={2411.11904},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.11904}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding

📢 Latest Updates

Abstract

🏆 Contributions

💬 Text-Mask & Hybrid Supervision

🔍 refGeo Dataset

🚀 Qualitative and Quantitative results

📷 Referring Expression Comprehension (REC) (HBB)

📷 Referring Expression Comprehension (REC) (OBB)

📷 Referring Expression Segmentation (RES)

📷 Generalized Referring Expression Comprehension (GRES) (Multiple Targets)

📷 Image Captioning & Visual Question Answering (VQA)

📜 Citation

🙏 Acknowledgement

About

Releases

Packages

zytx121/GeoGround

Folders and files

Latest commit

History

Repository files navigation

GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding

📢 Latest Updates

Abstract

🏆 Contributions

💬 Text-Mask & Hybrid Supervision

🔍 refGeo Dataset

🚀 Qualitative and Quantitative results

📷 Referring Expression Comprehension (REC) (HBB)

📷 Referring Expression Comprehension (REC) (OBB)

📷 Referring Expression Segmentation (RES)

📷 Generalized Referring Expression Comprehension (GRES) (Multiple Targets)

📷 Image Captioning & Visual Question Answering (VQA)

📜 Citation

🙏 Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages