A modular reference PyTorch implementation of Associating Objects with Transformers for Video Object Segmentation (NIPS 2021). [paper]
- High performance: up to 85.5% (R50-AOTL) on YouTube-VOS 2018 and 82.1% (SwinB-AOTL) on DAVIS-2017 Test-dev under standard settings.
- High efficiency: up to 51fps (AOTT) on DAVIS-2017 (480p) and 41fps on YouTube-VOS (1.3x480p)
- Multi-GPU training and inference
- Mixed precision training and inference
- Test-time augmentation: multi-scale and flipping augmentations are supported.
- Python3
- pytorch >= 1.7.0 and torchvision
- opencv-python
- Pillow
Optional (for better efficiency):
- Pytorch Correlation (recommend to install from source instead of using
pip
)
Coming
Pre-trained models and corresponding results reproduced by this project can be found in MODEL_ZOO.md.
-
Prepare datasets:
Please follow the below instruction to prepare datasets in each correspondding folder.
- Static
datasets/Static
: pre-training dataset with static images. A guidance is in AFB-URR.- YouTube-VOS
A commonly-used large-scale VOS dataset.
datasets/YTB/2019
: version 2019, download link.train
is required for training.valid
(6fps) andvalid_all_frames
(30fps, optional) are used for evaluation.datasets/YTB/2018
: version 2018, download link. Onlyvalid
(6fps) andvalid_all_frames
(30fps, optional) are required for this project and used for evaluation.- DAVIS
A commonly-used small-scale VOS dataset.
datasets/DAVIS
: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evluation but not required. -
Prepare ImageNet Pre-trained encoders
Select and download below checkpoints into
pretrain_models
:- MobileNet-V2 (default encoder)
- MobileNet-V3
- ResNet-50
- ResNet-101
- ResNeSt-101
- Swin-Base
The current default training configs are not optimized for encoders larger than ResNet-50. If you want to use larger encoders, we recommond to early stop the main-training stage at 80,000 iteration (100,000 in default) to avoid over-fitting on the seen classes of YouTube-VOS.
-
Training and Evaluation
The example script will train AOTT with 2 stages using 4 GPUs and auto-mixed precision (
--amp
). The first stage is a pre-training stage usingStatic
dataset, and the second stage is main-training stage, which uses bothYouTube-VOS 2019 train
andDAVIS-2017 train
for training, resulting in a model can generalize to different domains (YouTube-VOS and DAVIS) and different frame rate (6fps and 24fps).Notably, you can use only the
YouTube-VOS 2019 train
split in the second stage, which leads to better YouTube-VOS performance on unseen classes.After the training is finished, the example script will evaluate the model on YouTube-VOS and DAVIS, and the results will be packed into Zip files. For calculating scores, please use offical YouTube-VOS servers (2018 server and 2019 server) and offical DAVIS toolkit.
Coming
Wating
@inproceedings{yang2021aot,
title={Associating Objects with Transformers for Video Object Segmentation},
author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2021},
organization={Springer}
}
This project is released under the BSD-3-Clause license. See LICENSE for additional details.