This is the official code of ICCV 2021 paper:
Residual Attention: A Simple But Effective Method for Multi-Label Recoginition
This package is developed by Mr. Ke Zhu (http://www.lamda.nju.edu.cn/zhuk/) and we have just finished the implementation code of ViT models. If you have any question about the code, please feel free to contact Mr. Ke Zhu ([email protected]). The package is free for academic usage. You can run it at your own risk. For other purposes, please contact Prof. Jianxin Wu (mail to [email protected]).
- Python 3.7
- pytorch 1.6
- torchvision 0.7.0
- pycocotools 2.0
- tqdm 4.49.0, pillow 7.2.0
We expect VOC2007, COCO2014 and Wider-Attribute dataset to have the following structure:
Dataset/
|-- VOCdevkit/
|---- VOC2007/
|------ JPEGImages/
|------ Annotations/
|------ ImageSets/
......
|-- COCO2014/
|---- annotations/
|---- images/
|------ train2014/
|------ val2014/
......
|-- WIDER/
|---- Annotations/
|------ wider_attribute_test.json
|------ wider_attribute_trainval.json
|---- Image/
|------ train/
|------ val/
|------ test/
...
Then directly run the following command to generate json file (for implementation) of these datasets.
python utils/prepare/voc.py --data_path Dataset/VOCdevkit
python utils/prepare/coco.py --data_path Dataset/COCO2014
python utils/prepare/wider.py --data_path Dataset/WIDER
which will automatically result in annotation json files in ./data/voc07, ./data/coco and ./data/wider
We provide prediction demos of our models. The demo images (picked from VCO2007) have already been put into ./utils/demo_images/, you can simply run demo.py by using our CSRA models pretrained on VOC2007:
CUDA_VISIBLE_DEVICES=0 python demo.py --model resnet101 --num_heads 1 --lam 0.1 --dataset voc07 --load_from OUR_VOC_PRETRAINED.pth --img_dir utils/demo_images
which will output like this:
utils/demo_images/000001.jpg prediction: dog,person,
utils/demo_images/000004.jpg prediction: car,
utils/demo_images/000002.jpg prediction: train,
...
We provide pretrained models on Google Drive for validation. ResNet101 trained on ImageNet with CutMix augmentation can be downloaded here.
Dataset | Backbone | Head nums | mAP(%) | Resolution | Download |
---|---|---|---|---|---|
VOC2007 | ResNet-101 | 1 | 94.7 | 448x448 | download |
VOC2007 | ResNet-cut | 1 | 95.2 | 448x448 | download |
COCO | ResNet-101 | 4 | 83.3 | 448x448 | download |
COCO | ResNet-cut | 6 | 85.6 | 448x448 | download |
Wider | VIT_B16_224 | 1 | 89.0 | 224x224 | download |
Wider | VIT_L16_224 | 1 | 90.2 | 224x224 | download |
For voc2007, run the following validation example:
CUDA_VISIBLE_DEVICES=0 python val.py --num_heads 1 --lam 0.1 --dataset voc07 --num_cls 20 --load_from MODEL.pth
For coco2014, run the following validation example:
CUDA_VISIBLE_DEVICES=0 python val.py --num_heads 4 --lam 0.5 --dataset coco --num_cls 80 --load_from MODEL.pth
For wider attribute with ViT models, run the following
CUDA_VISIBLE_DEVICES=0 python val.py --model vit_B16_224 --img_size 224 --num_heads 1 --lam 0.3 --dataset wider --num_cls 14 --load_from ViT_B16_MODEL.pth
CUDA_VISIBLE_DEVICES=0 python val.py --model vit_L16_224 --img_size 224 --num_heads 1 --lam 0.3 --dataset wider --num_cls 14 --load_from ViT_L16_MODEL.pth
To provide pretrained VIT models on Wider-Attribute dataset, we retrain them recently, which has a slightly different performance (~0.1%mAP) from what has been presented in our paper. The structure of the VIT models is the initial VIT version (An image is worth 16x16 words: Transformers for image recognition at scale, link) and the implementation code of the VIT models is derived from http://github.com/rwightman/pytorch-image-models/.
You can run either of these two lines below
CUDA_VISIBLE_DEVICES=0 python main.py --num_heads 1 --lam 0.1 --dataset voc07 --num_cls 20
CUDA_VISIBLE_DEVICES=0 python main.py --num_heads 1 --lam 0.1 --dataset voc07 --num_cls 20 --cutmix CutMix_ResNet101.pth
Note that the first command uses the Official ResNet-101 backbone while the second command uses the ResNet-101 pretrained on ImageNet with CutMix augmentation link (which is supposed to gain better performance).
run the ResNet-101 with 4 heads
CUDA_VISIBLE_DEVICES=0 python main.py --num_heads 6 --lam 0.5 --dataset coco --num_cls 80
run the ResNet-101 (pretrained with CutMix) with 6 heads
CUDA_VISIBLE_DEVICES=0 python main.py --num_heads 6 --lam 0.4 --dataset coco --num_cls 80 --cutmix CutMix_ResNet101.pth
You can feel free to adjust the hyper-parameters such as number of attention heads (--num_heads), or the Lambda (--lam). Still, the default values of them in the above command are supposed to be the best.
run the VIT_B16_224 with 1 heads
CUDA_VISIBLE_DEVICES=0 python main.py --model vit_B16_224 --img_size 224 --num_heads 1 --lam 0.3 --dataset wider --num_cls 14
run the VIT_L16_224 with 1 heads
CUDA_VISIBLE_DEVICES=0,1 python main.py --model vit_L16_224 --img_size 224 --num_heads 1 --lam 0.3 --dataset wider --num_cls 14
Note that the VIT_L16_224 model consume larger GPU space, so we use 2 GPUs to train them.
To avoid confusion, please note the 4 lines of code in Figure 1 (in paper) is only used in test stage (without training), which is our motivation. When our model is end-to-end training and testing, multi-head-attention (H=1, H=2, H=4, etc.) is used with different T values. Also, when H=1 and T=infty, the implementation code of multi-head-attention is exactly the same with Figure 1.
We didn't use any new augmentation such as Autoaugment, RandAugment in our ResNet series models.
We thank Lin Sui (http://www.lamda.nju.edu.cn/suil/) for his initial contribution to this project.