[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]
by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^.
- [2023.03.20] As an aspiration of our X-Decoder, we developed OpenSeeD ([Paper][Code]) to enable open-vocabulary segmentation and detection with a single model, Check it out!
- [2023.03.14] We release X-GPT which is an conversational version of our X-Decoder through GPT-3 langchain!
- [2023.03.01] The Segmentation in the Wild Challenge had been launched and ready for submitting results!
- [2023.02.28] We released the SGinW benchmark for our challenge. Welcome to build your own models on the benchmark!
- [2023.02.27] Our X-Decoder has been accepted by CVPR 2023!
- [2023.02.07] We combine X-Decoder (strong image understanding), GPT-3 (strong language understanding) and Stable Diffusion (strong image generation) to make an instructional image editing demo, check it out!
- [2022.12.21] We release inference code of X-Decoder.
- [2022.12.21] We release Focal-T pretrained checkpoint.
- [2022.12.21] We release open-vocabulary segmentation benchmark.
🔺[X-GPT] 🔺[Instruct X-Decoder]
X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!
It achieves:
- State-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets;
- Better or competitive finetuned performance to generalist and specialist models on segmentation and VL tasks;
- Friendly for efficient finetuning and flexible for novel task composition.
It supports:
- One suite of parameters pretrained for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, and Image-Text Retrieval;
- One model architecture finetuned for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, Image-Text Retrieval and Visual Question Answering (with an extra cls head);
- Zero-shot task composition for Region Retrieval, Referring Captioning, Image Editing.
pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
python -m pip install -r requirements.txt
sh install_cococapeval.sh
export DATASET=/pth/to/dataset
To prepare the dataset: DATASET.md
mpirun -n 8 python eval.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/ckpt
Note: Due to zero-padding, filling a single gpu with multiple images may decrease the performance.
# For Segmentation Tasks
python demo/demo_semseg.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/xdecoder_focalt_best_openseg.pt
# For VL Tasks
python demo/demo_captioning.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml --overrides WEIGHT /pth/to/xdecoder_focalt_last_novg.pt
ADE | ADE-full | SUN | SCAN | SCAN40 | Cityscape | BDD | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
model | ckpt | PQ | AP | mIoU | mIoU | mIoU | PQ | mIoU | mIoU | PQ | mAP | mIoU | PQ | mIoU |
X-Decoder | BestSeg Tiny | 19.1 | 10.1 | 25.1 | 6.2 | 35.7 | 30.3 | 38.4 | 22.4 | 37.7 | 18.5 | 50.2 | 16.9 | 47.6 |
- Finetuned ADE 150 (32 epochs)
Model | Task | Log | PQ | mAP | mIoU |
---|---|---|---|---|---|
X-Decoder (davit-d5,Deformable) | PanoSeg | log | 52.4 | 38.7 | 59.1 |
- We appreciate the contructive dicussion with Haotian Zhang
- We build our work on top of Mask2Former
- We build our demos on HuggingFace 🤗 with sponsored GPUs
- We appreciate the discussion with Xiaoyu Xiang during rebuttal
@article{zou2022xdecoder,
author = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee^, Yong Jae and Gao^, Jianfeng},
title = {Generalized Decoding for Pixel, Image and Language},
publisher = {arXiv},
year = {2022},
}