Chenyang Liu
Β·
Jiafan Zhang
Β·
Keyan Chen
Β·
Man Wang
Β·
Zhengxia Zou
Β·
Zhenwei Shi
This repo is used for recording, and tracking recent Remote Sensing Temporal Vision-Language Models (RS-TVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.
Share us a β if you're interested in this repo. We will continue to track relevant progress and update this repository.
- You are welcome to give us an issue or PR for your RS-TVLM work !!!!! We will record it for next version update of our survey
π₯π₯π₯ Updated on 2024.12.04 π₯π₯π₯
- 2024.12.04: The first version is available.
-
The first survey for Remote Sensing Temporal Vision-Language models.
-
Some public datasets and code links are provided.
Timeline of representative RS-TVLMs:
Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project |
---|---|---|---|---|
Pix4Cap | Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning | ViT-B/32 | Transformer Decoder | code |
Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | ViT-B/32 | Transformer Decoder | code |
Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | N/A |
DetACC | Detection Assisted Change Captioning for Remote Sensing Image | ResNet-101 | Transformer Decoder | N/A |
KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | code |
ChangeMinds | ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing | Swin Transformer | Transformer Decoder | code |
CTMTNet | A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images | ResNet-101 | Transformer Decoder | N/A |
...... |
Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project |
---|---|---|---|---|
change-aware VQA | Change-Aware Visual Question Answering | CNN | RNN | N/A |
CDVQA-Net | Change Detection Meets Visual Question Answering | CNN | RNN | code |
ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | code |
CDchat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP ViT-L/14 | Vicuna-v1.5 | code |
TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | code |
GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Video encoder | LLaVA-NeXT and Video-LLaVA | code |
CDQAG | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | CLIP image Encoder | CLIP Text Encoder | code |
...... |
Model Name | Paper Title | Code/Project |
---|---|---|
ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | code |
...... |
Dataset | Image Size/Resolution | Image pairs | Captions | Annotation | Download Link |
---|---|---|---|---|---|
DUBAI CCD | 50Γ50 (30m) | 500 | 2,500 | Manual | Link |
LEVIR CCD | 256Γ256 (0.5m) | 500 | 2,500 | Manual | Link |
LEVIR-CC | 256Γ256 (0.5m) | 10,077 | 50,385 | Manual | Link |
WHU-CDC | 256Γ256 (0.075m) | 7,434 | 37,170 | Manual | Link |
Dataset | Image Size/Resolution | Image pairs | Captions | Pixel-level Masks | Annotation | Download Link |
---|---|---|---|---|---|---|
LEVIR-MCI | 256Γ256 (0.5m) | 10,077 | 50,385 | 44,380 (building, road) | Manual | Link |
LEVIR-CDC | 256Γ256 (0.5m) | 10,077 | 50,385 | -- (building) | Manual | Link |
WHU-CDC | 256Γ256 (0.075m) | 7,434 | 37,170 | -- (building) | Manual | Link |
Dataset | Temporal Images | Image Resolution | Instruction Samples | Change-related Task | Annotation | Download Link |
---|---|---|---|---|---|---|
CDVQA | 2,968 pairs (bi-temporal) | 0.5m~3m | 122,000 | CVQA | Manual | Link |
ChangeChat-87k | 10,077 pairs (bi-temporal) | 0.5m | 87,195 | CVQA, Grounding | Automated | Link |
GeoLLaVA | 100,000 pairs (bi-temporal) | -- | 100,000 | CVQA | Automated | Link |
TEOChatlas | -- (variable temporal length) | -- | 554,071 | Classification, CVQA, Grounding | Automated | Link |
QVG-360K | 6,810 pairs (bi-temporal) | 0.1m~3m | 360,000 | CVQA, Grounding | Automated | Link |
If you find our survey and repository useful for your research, please consider citing our paper:
@misc{liu2024remotesensingtemporalvisionlanguage,
title={Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey},
author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
year={2024},
eprint={2412.02573},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.02573},
}