Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2023.1.16] Code and pre-trained models for Action Detection are available!
[2022.11.20] 👀 VideoMAE is integrated into and , supported by @Sayak Paul.
[2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully.
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available!
[2022.10.19] The pre-trained models and scripts on UCF101 are available!
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉
[2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now!
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details.
[2022.4.24] Code and pre-trained models are available now!
[2022.3.24] ~~Code and pre-trained models will be released here.~~ Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-S	224x224	16x2x3	66.8	90.3
VideoMAE	no	ViT-B	224x224	16x2x3	70.8	92.4
VideoMAE	no	ViT-L	224x224	16x2x3	74.3	94.6
VideoMAE	no	ViT-L	224x224	32x1x3	75.4	95.2

✨ Kinetics-400

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-S	224x224	16x5x3	79.0	93.8
VideoMAE	no	ViT-B	224x224	16x5x3	81.5	95.1
VideoMAE	no	ViT-L	224x224	16x5x3	85.2	96.8
VideoMAE	no	ViT-H	224x224	16x5x3	86.6	97.1
VideoMAE	no	ViT-L	320x320	32x4x3	86.1	97.3
VideoMAE	no	ViT-H	320x320	32x4x3	87.4	97.6

✨ AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection.

Method	Extra Data	Extra Label	Backbone	#Frame x Sample Rate	mAP
VideoMAE	Kinetics-400	✗	ViT-S	16x4	22.5
VideoMAE	Kinetics-400	✓	ViT-S	16x4	28.4
VideoMAE	Kinetics-400	✗	ViT-B	16x4	26.7
VideoMAE	Kinetics-400	✓	ViT-B	16x4	31.8
VideoMAE	Kinetics-400	✗	ViT-L	16x4	34.3
VideoMAE	Kinetics-400	✓	ViT-L	16x4	37.0
VideoMAE	Kinetics-400	✗	ViT-H	16x4	36.5
VideoMAE	Kinetics-400	✓	ViT-H	16x4	39.5
VideoMAE	Kinetics-700	✗	ViT-L	16x4	36.1
VideoMAE	Kinetics-700	✓	ViT-L	16x4	39.3

✨ UCF101 & HMDB51

Method	Extra Data	Backbone	UCF101	HMDB51
VideoMAE	no	ViT-B	91.3	62.6
VideoMAE	Kinetics-400	ViT-B	96.1	73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

Name	Name	Last commit message	Last commit date
Latest commit yztongzhan Update PRETRAIN.md Mar 13, 2023 2d0075c · Mar 13, 2023 History 53 Commits
figs	figs	update AVA 2.2 results	Jul 8, 2022
scripts	scripts	Delete data_clean.py	Nov 21, 2022
DATASET.md	DATASET.md	Update DATASET.md	Feb 3, 2023
FINETUNE.md	FINETUNE.md	add ssv2.py and update finetuning script	May 13, 2022
INSTALL.md	INSTALL.md	support UCF101	Oct 19, 2022
LICENSE	LICENSE	update LICENSE	Apr 15, 2022
MODEL_ZOO.md	MODEL_ZOO.md	add ViT-Sand ViT-H	Oct 20, 2022
NOTICE.md	NOTICE.md	update	May 19, 2022
PRETRAIN.md	PRETRAIN.md	Update PRETRAIN.md	Mar 13, 2023
README.md	README.md	Update README.md	Jan 16, 2023
datasets.py	datasets.py	update AVA 2.2 results	Jul 8, 2022
engine_for_finetuning.py	engine_for_finetuning.py	fix multi-clip testing	Aug 8, 2022
engine_for_pretraining.py	engine_for_pretraining.py	code release	Apr 24, 2022
functional.py	functional.py	code release	Apr 24, 2022
kinetics.py	kinetics.py	code release	Apr 24, 2022
masking_generator.py	masking_generator.py	code release	Apr 24, 2022
mixup.py	mixup.py	fix mixup	Jul 16, 2022
modeling_finetune.py	modeling_finetune.py	support UCF101	Oct 19, 2022
modeling_pretrain.py	modeling_pretrain.py	add ViT-Sand ViT-H	Oct 20, 2022
optim_factory.py	optim_factory.py	code release	Apr 24, 2022
rand_augment.py	rand_augment.py	code release	Apr 24, 2022
random_erasing.py	random_erasing.py	code release	Apr 24, 2022
run_class_finetuning.py	run_class_finetuning.py	fix two bugs about training and test	Mar 12, 2023
run_mae_pretraining.py	run_mae_pretraining.py	fix two bugs about training and test	Mar 12, 2023
run_videomae_vis.py	run_videomae_vis.py	fix vis	May 9, 2022
ssv2.py	ssv2.py	add ssv2.py and update finetuning script	May 13, 2022
transforms.py	transforms.py	fix typo in GroupMultiScaleCrop	Jan 27, 2023
utils.py	utils.py	code release	Apr 24, 2022
video_transforms.py	video_transforms.py	code release	Apr 24, 2022
vis.sh	vis.sh	code release	Apr 24, 2022
volume_transforms.py	volume_transforms.py	code release	Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

📰 News

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

⚡️ A Simple, Efficient and Strong Baseline in SSVP

😮 High performance, but NO extra data required

🚀 Main Results

✨ Something-Something V2

✨ Kinetics-400

✨ AVA 2.2

✨ UCF101 & HMDB51

🔨 Installation

➡️ Data Preparation

🔄 Pre-training

⤴️ Fine-tuning with pre-trained models

📍Model Zoo

👀 Visualization

☎️ Contact

👍 Acknowledgements

🔒 License

✏️ Citation

About

Releases

Packages

Languages

License

cytsinghua/VideoMAE

Folders and files

Latest commit

History

Repository files navigation

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

📰 News

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

⚡️ A Simple, Efficient and Strong Baseline in SSVP

😮 High performance, but NO extra data required

🚀 Main Results

✨ Something-Something V2

✨ Kinetics-400

✨ AVA 2.2

✨ UCF101 & HMDB51

🔨 Installation

➡️ Data Preparation

🔄 Pre-training

⤴️ Fine-tuning with pre-trained models

📍Model Zoo

👀 Visualization

☎️ Contact

👍 Acknowledgements

🔒 License

✏️ Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages