This repository is an official PyTorch implementation of the CVPR 2022 paper "Accelerating DETR Convergence via Semenatics-Aligned Matching".
The recently developed DEtection TRansformer (DETR) has established a new object detection paradigm by eliminating a series of hand-crafted components. However, DETR suffers from extremely slow convergence, which increases the training cost significantly. We observe that the slow convergence can be largely attributed to the complication in matching object queries to encoded image features in DETR's decoder cross-attention modules.
Motivated by this observation, in our paper, we propose SAM-DETR, a Semantic-Aligned-Matching DETR that can greatly accelerates DETR's convergence without sacrificing its accuracy. SAM-DETR addresses the slow convergence issue from two perspectives. First, it projects object queries into the same embedding space as encoded image features, where the matching can be accomplished efficiently with aligned semantics. Second, it explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well. Being like a plug and play, SAM-DETR complements existing convergence solutions well yet only introduces slight computational overhead. Experiments show that the proposed SAM-DETR achieves superior convergence as well as competitive detection accuracy.
At the core of SAM-DETR is a plug-and-play module named "Semantics Aligner" appended ahead of the cross-attention module in each DETR's decoder layer. It also models a learnable reference box for each object query, whose center location is used to generate corresponding position embeddings.
The figure below illustrates the architecture of the appended "Semantics Aligner", which aligns the semantics of "encoded image features" and "object queries" by resampling features from multiple salient points as new object queries.
Being like a plug-and-play, our approach can be easily integrated with existing convergence solutions (e.g., SMCA) in a complementary manner, boosting detection accuracy and convergence speed further.
Please check our CVPR'2022 paper for more details.
You must have NVIDIA GPUs to run the codes.
The implementation codes are developed and tested with the following environment setups:
- Linux
- 8x NVIDIA V100 GPUs (32GB)
- CUDA 10.1
- Python == 3.8
- PyTorch == 1.8.1+cu101, TorchVision == 0.9.1+cu101
- GCC == 7.5.0
- cython, pycocotools, tqdm, scipy
We recommend using the exact setups above. However, other environments (Linux, Python>=3.7, CUDA>=9.2, GCC>=5.4, PyTorch>=1.5.1, TorchVision>=0.6.1) should also work.
First, clone the repository locally:
git clone https://github.com/ZhangGongjie/SAM-DETR.git
We recommend you to use Anaconda to create a conda environment:
conda create -n sam_detr python=3.8 pip
Then, activate the environment:
conda activate sam_detr
Then, install PyTorch and TorchVision (preferably using our recommended setups; CUDA version should match your own encvironment):
conda install pytorch=1.8.1 torchvision=0.9.1 cudatoolkit=10.1 -c pytorch
After that, install other requirements:
conda install cython scipy tqdm
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
[Optional] If you wish to run multi-scale version of SAM-DETR (results not reported in the CVPR paper), you need to compile Deformable Attention, which is used in DETR encoder to generate feature pyramid efficiently. If you don't need multi-scale version of SAM-DETR, you may skip this step.
# Optionally compile CUDA operators of Deformable Attention for multi-scale SAM-DETR
cd SAM-DETR
cd ./models/ops
sh ./make.sh
python test.py # unit test (should see all checking is True)
Please download COCO 2017 dataset and organize them as following:
sam_detr_root/
弩岸岸 data/
弩岸岸 coco/
念岸岸 train2017/
念岸岸 val2017/
弩岸岸 annotations/
念岸岸 instances_train2017.json
弩岸岸 instances_val2017.json
All scripts to reproduce results reported in our CVPR'2022 paper
are stored in ./scripts
.
Taking SAM-DETR-R50 w/ SMCA (12 epochs) for example, to reproduce its results, simply run:
bash scripts/r50_smca_e12_4gpu.sh
Taking SAM-DETR-R50 multiscale w/ SMCA (50 epochs) for example, to reproduce its results, simply run:
bash scripts/r50_ms_smca_e50_8gpu.sh
Reminder: To reproduce results, please make sure the total batch size matches the implementation details described in our paper. For R50 (single-scale)
experiments, we use 4 GPUs with a batch size of 4 on each GPU. For R50 (multi-scale)
experiments, we use 8 GPUs with a batch size of 2 on each GPU. For R50-DC5 (single-scale)
experiments, we use 8 GPUs with a batch size of 1 on each GPU.
To perform training on COCO train2017, run:
python -m torch.distributed.launch \
--nproc_per_node=4 \ # number of GPUs to perform training
--use_env \
main.py \
--batch_size 4 \ # batch_size on each GPU (NOT total batch_size)
--smca \ # integrate with SMCA, remove this line to disable SMCA
--dilation \ # enable DC5, remove this line to disable DC5
--multiscale \ # enable multi-scale, remove this line to disable multiscale
--epochs 50 \ # total number of epochs to train
--lr_drop 40 \ # when to drop learning rate
--output_dir output/xxxx # where to store outputs, remove this line for no storing
More arguments and their explanations are available at main.py
.
To evaluate a model on COCO val2017, simply add --resume
and --eval
arguments:
python -m torch.distributed.launch \
--nproc_per_node=4 \
--use_env \
main.py \
--batch_size 4 \
--smca \
--dilation \
--multiscale \
--epochs 50 \
--lr_drop 40 \
--resume <path/to/checkpoint.pth> \ # trained model weights
--eval \ # this means that only evaluation will be performed
--output_dir output/xxxx
The original DETR models trained for 500 epochs:
Method | Epochs | Params (M) | GFLOPs | AP | URL |
---|---|---|---|---|---|
DETR-R50 | 500 | 41 | 86 | 42.0 | log |
DETR-R50-DC5 | 500 | 41 | 187 | 43.3 | log |
Our proposed SAM-DETR models (results reported in our CVPR paper):
Method | Epochs | Params (M) | GFLOPs | AP | URL |
---|---|---|---|---|---|
SAM-DETR-R50 | 12 | 58 | 100 | 33.1 | model log |
SAM-DETR-R50 w/ SMCA | 12 | 58 | 100 | 36.0 | model log |
SAM-DETR-R50-DC5 | 12 | 58 | 210 | 38.3 | model log |
SAM-DETR-R50-DC5 w/ SMCA | 12 | 58 | 210 | 40.6 | model log |
SAM-DETR-R50 | 50 | 58 | 100 | 39.8 | model log |
SAM-DETR-R50 w/ SMCA | 50 | 58 | 100 | 41.8 | model log |
SAM-DETR-R50-DC5 | 50 | 58 | 210 | 43.3 | model log |
SAM-DETR-R50-DC5 w/ SMCA | 50 | 58 | 210 | 45.0 | model log |
Our proposed multi-scale SAM-DETR models (results to appear in a journal extension):
Method | Epochs | Params (M) | GFLOPs | AP | URL |
---|---|---|---|---|---|
SAM-DETR-R50-MS | 12 | 55 | 203 | 41.1 | model log |
SAM-DETR-R50-MS w/ SMCA | 12 | 55 | 203 | 42.9 | model log |
SAM-DETR-R50-MS | 50 | 55 | 203 | 46.1 | model log |
SAM-DETR-R50-MS w/ SMCA | 50 | 55 | 203 | 47.0 | model log |
Note:
- AP is computed on COCO val2017.
- "DC5" means removing the stride in C5 stage of ResNet and add a dilation of 2 instead.
- The GFLOPs of our models are measured using fvcore on the first 100 images in COCO val2017. GFLOPs varies from input image sizes. There may exist slight variations from actual values.
The implementation codes of SAM-DETR are released under the MIT license.
Please see the LICENSE file for more information.
If you find SAM-DETR useful or inspiring, please consider citing:
@inproceedings{zhang2022-SAMDETR,
title = {Accelerating {DETR} Convergence via Semantic-Aligned Matching},
author = {Zhang, Gongjie and Luo, Zhipeng and Yu, Yingchen and Cui, Kaiwen and Lu, Shijian},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}
Our SAM-DETR is heavily inspired by many outstanding prior works, including DETR, Conditional-DETR, SMCA-DETR, and Deformable DETR. Thank the authors of above projects for open-sourcing their implementation codes!