By Hongwei Xue*, Yuchong Sun*, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo.
This repo is the official pytorch implementation of CLIP-ViP: Adapting Image-Text Pre-training to Video-Language Representation Learning, accepted by ICLR 2023. CLIP-ViP is a video-language model which is based on a pre-trained image-text model CLIP then further pre-trained (post-pretraining) on a large-scale video-text dataset HD-VILA-100M. This repo consists of the code of CLIP-ViP model, the post-pretraining method, finetuning on text-to-video retrieval.
We provide a Docker image for easier reproduction: tiankaihang/azureml_docker:horovod
We use mixed-precision training hence GPUs with Tensor Cores are recommended.
-
Download Data.
Download HD-VILA-100M and other required data following the instruction of HD-VILA. Also download auxiliary captions from: Azure Blob Link
-
Download pretrained models.
We release the CLIP-ViP model under two settings:
CLIP-ViP-B/32: Azure Blob Link
CLIP-ViP-B/16: Azure Blob Link
#inside the container
horovodrun -np $NUM_GPUS python src/pretrain/run_pretrain.py \
--config $CONFIG_PATH
$CONFIG_PATH
should be set to one of the .json config files available at src/configs/pretrain. Currently, pretrain_vip_base_32.json
and pretrain_vip_base_16.json
are supported
- add
is_train
HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_train
- without
is_train
HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./
### Text-to-Video Retrieval Finetuning
1. setting for accelerate (only at the first time)
```bash
accelerate config --config_file [path/to/store/config_file]
-- using the different number of GPUs: open the yaml.file and change the number of processes
- wandb setting
create a project called "clipvip" in your wandb
wandb login
wandb online
- running the code
cd ./CLIP-ViP
bash src/tasks/run.sh
$CONFIG_PATH
should be set to one of the .json config files available at src/configs postfixed with _retrieval.json
. For example, you can use src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json
for finetuning CLIP-ViP-B/32 on MSRVTT retrieval. For model, currently, pretrain_vip_base_32.json
and pretrain_vip_base_16.json
are supported. For dataset, MSR-VTT, DiDemo, LSMDC, ActivityNet Captions are supported.
- test the code
accelerate launch --config_file /home/s2/dayelee/accel.yaml ./run_video_retrieval_val.py --config /home/s2/dayelee/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir /home/s2/dayelee --num_train_epochs 5 --output_dir /shared/s2/lab01/dayelee_store/checkpoints/msrvtt --train_batch_size 16 --e2e_weights_path /home/s2/dayelee/dayelee_store/checkpoints/msrvtt_16_gpu_2/epoch_5_bs_16_lr_1e-06_31/model_best.pt
make sure that you do not use --is_train
when testing and enter the --e2e_weights_path
for loading a model checkpoint
If you find the code and pre-trained models useful for your research, please consider citing our paper:
@article{xue2022clip,
title={CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment},
author={Xue, Hongwei and Sun, Yuchong and Liu, Bei and Fu, Jianlong and Song, Ruihua and Li, Houqiang and Luo, Jiebo},
journal={arXiv preprint arXiv:2209.06430},
year={2022}
}
@inproceedings{xue2022advancing,
title={Advancing high-resolution video-language representation with large-scale video transcriptions},
author={Xue, Hongwei and Hang, Tiankai and Zeng, Yanhong and Sun, Yuchong and Liu, Bei and Yang, Huan and Fu, Jianlong and Guo, Baining},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5036--5045},
year={2022}
}
The code is based on HD-VILA.