Skip to content

Latest commit

 

History

History
 
 

CLIP-ViP

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CLIP-ViP (ICLR 2023)

PWC PWC PWC PWC

By Hongwei Xue*, Yuchong Sun*, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo.

This repo is the official pytorch implementation of CLIP-ViP: Adapting Image-Text Pre-training to Video-Language Representation Learning, accepted by ICLR 2023. CLIP-ViP is a video-language model which is based on a pre-trained image-text model CLIP then further pre-trained (post-pretraining) on a large-scale video-text dataset HD-VILA-100M. This repo consists of the code of CLIP-ViP model, the post-pretraining method, finetuning on text-to-video retrieval.

Requirements

We provide a Docker image for easier reproduction: tiankaihang/azureml_docker:horovod

We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

  1. Download Data.

    Download HD-VILA-100M and other required data following the instruction of HD-VILA. Also download auxiliary captions from: Azure Blob Link

  2. Download pretrained models.

    We release the CLIP-ViP model under two settings:

    CLIP-ViP-B/32: Azure Blob Link

    CLIP-ViP-B/16: Azure Blob Link

Pre-training

#inside the container
horovodrun -np $NUM_GPUS python src/pretrain/run_pretrain.py \
    --config $CONFIG_PATH

$CONFIG_PATH should be set to one of the .json config files available at src/configs/pretrain. Currently, pretrain_vip_base_32.json and pretrain_vip_base_16.json are supported

Text-to-Video Retrieval Finetuning (Hyejin)

for training:

  • add is_train
HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_train

for testing:

  • without is_train

HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./




### Text-to-Video Retrieval Finetuning

1. setting for accelerate (only at the first time)

```bash
accelerate config --config_file [path/to/store/config_file] 

-- using the different number of GPUs: open the yaml.file and change the number of processes

  1. wandb setting

create a project called "clipvip" in your wandb

wandb login 
wandb online
  1. running the code
cd ./CLIP-ViP
bash src/tasks/run.sh

$CONFIG_PATH should be set to one of the .json config files available at src/configs postfixed with _retrieval.json. For example, you can use src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json for finetuning CLIP-ViP-B/32 on MSRVTT retrieval. For model, currently, pretrain_vip_base_32.json and pretrain_vip_base_16.json are supported. For dataset, MSR-VTT, DiDemo, LSMDC, ActivityNet Captions are supported.

  1. test the code
accelerate launch --config_file /home/s2/dayelee/accel.yaml ./run_video_retrieval_val.py --config /home/s2/dayelee/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir /home/s2/dayelee --num_train_epochs 5 --output_dir /shared/s2/lab01/dayelee_store/checkpoints/msrvtt --train_batch_size 16 --e2e_weights_path /home/s2/dayelee/dayelee_store/checkpoints/msrvtt_16_gpu_2/epoch_5_bs_16_lr_1e-06_31/model_best.pt

make sure that you do not use --is_train when testing and enter the --e2e_weights_path for loading a model checkpoint

Citation

If you find the code and pre-trained models useful for your research, please consider citing our paper:

@article{xue2022clip,
  title={CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment},
  author={Xue, Hongwei and Sun, Yuchong and Liu, Bei and Fu, Jianlong and Song, Ruihua and Li, Houqiang and Luo, Jiebo},
  journal={arXiv preprint arXiv:2209.06430},
  year={2022}
}

@inproceedings{xue2022advancing,
  title={Advancing high-resolution video-language representation with large-scale video transcriptions},
  author={Xue, Hongwei and Hang, Tiankai and Zeng, Yanhong and Sun, Yuchong and Liu, Bei and Yang, Huan and Fu, Jianlong and Guo, Baining},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5036--5045},
  year={2022}
}

Acknowledgements

The code is based on HD-VILA.