Name		Name	Last commit message	Last commit date
parent directory ..
data		data
docker		docker
final_report		final_report
progress_report_results		progress_report_results
src		src
src_emotion		src_emotion
Data Preprocessing.ipynb		Data Preprocessing.ipynb
LICENSE		LICENSE
MSRVTT-README.md		MSRVTT-README.md
README.md		README.md
compress.py		compress.py
compress_msrvtt.bash		compress_msrvtt.bash
download_msrvtt.bash		download_msrvtt.bash
launch_container.sh		launch_container.sh
preprocessing.py		preprocessing.py
run_video_retrieval.ipynb		run_video_retrieval.ipynb
run_video_retrieval_demo.ipynb		run_video_retrieval_demo.ipynb
run_video_retrieval_dy.py		run_video_retrieval_dy.py
run_video_retrieval_hj.py		run_video_retrieval_hj.py
run_video_retrieval_hj_demo.py		run_video_retrieval_hj_demo.py
run_video_retrieval_tsne.ipynb		run_video_retrieval_tsne.ipynb
run_video_retrieval_val.py		run_video_retrieval_val.py
setup.sh		setup.sh

README.md

CLIP-ViP (ICLR 2023)

By Hongwei Xue*, Yuchong Sun*, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo.

This repo is the official pytorch implementation of CLIP-ViP: Adapting Image-Text Pre-training to Video-Language Representation Learning, accepted by ICLR 2023. CLIP-ViP is a video-language model which is based on a pre-trained image-text model CLIP then further pre-trained (post-pretraining) on a large-scale video-text dataset HD-VILA-100M. This repo consists of the code of CLIP-ViP model, the post-pretraining method, finetuning on text-to-video retrieval.

Requirements

We provide a Docker image for easier reproduction: tiankaihang/azureml_docker:horovod

We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

Download Data.

Download HD-VILA-100M and other required data following the instruction of HD-VILA. Also download auxiliary captions from: Azure Blob Link
Download pretrained models.

We release the CLIP-ViP model under two settings:

CLIP-ViP-B/32: Azure Blob Link

CLIP-ViP-B/16: Azure Blob Link

Pre-training

#inside the container
horovodrun -np $NUM_GPUS python src/pretrain/run_pretrain.py \
    --config $CONFIG_PATH

$CONFIG_PATH should be set to one of the .json config files available at src/configs/pretrain. Currently, pretrain_vip_base_32.json and pretrain_vip_base_16.json are supported

Text-to-Video Retrieval Finetuning (Hyejin)

for training:

add is_train

HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_train

for testing:

without is_train

HOROVOD_CACHE_CAPACITY=4096 horovodrun -np 2 python run_video_retrieval_hj.py --config /data2/Hyejin/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./




### Text-to-Video Retrieval Finetuning

1. setting for accelerate (only at the first time)

```bash
accelerate config --config_file [path/to/store/config_file]

-- using the different number of GPUs: open the yaml.file and change the number of processes

wandb setting

create a project called "clipvip" in your wandb

wandb login 
wandb online

running the code

cd ./CLIP-ViP
bash src/tasks/run.sh

$CONFIG_PATH should be set to one of the .json config files available at src/configs postfixed with _retrieval.json. For example, you can use src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json for finetuning CLIP-ViP-B/32 on MSRVTT retrieval. For model, currently, pretrain_vip_base_32.json and pretrain_vip_base_16.json are supported. For dataset, MSR-VTT, DiDemo, LSMDC, ActivityNet Captions are supported.

test the code

accelerate launch --config_file /home/s2/dayelee/accel.yaml ./run_video_retrieval_val.py --config /home/s2/dayelee/1517_XPretrain/CLIP-ViP/src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir /home/s2/dayelee --num_train_epochs 5 --output_dir /shared/s2/lab01/dayelee_store/checkpoints/msrvtt --train_batch_size 16 --e2e_weights_path /home/s2/dayelee/dayelee_store/checkpoints/msrvtt_16_gpu_2/epoch_5_bs_16_lr_1e-06_31/model_best.pt

make sure that you do not use --is_train when testing and enter the --e2e_weights_path for loading a model checkpoint

Citation

If you find the code and pre-trained models useful for your research, please consider citing our paper:

@article{xue2022clip,
  title={CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment},
  author={Xue, Hongwei and Sun, Yuchong and Liu, Bei and Fu, Jianlong and Song, Ruihua and Li, Houqiang and Luo, Jiebo},
  journal={arXiv preprint arXiv:2209.06430},
  year={2022}
}

@inproceedings{xue2022advancing,
  title={Advancing high-resolution video-language representation with large-scale video transcriptions},
  author={Xue, Hongwei and Hang, Tiankai and Zeng, Yanhong and Sun, Yuchong and Liu, Bei and Yang, Huan and Fu, Jianlong and Guo, Baining},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5036--5045},
  year={2022}
}

Acknowledgements

The code is based on HD-VILA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIP-ViP

CLIP-ViP

README.md

CLIP-ViP (ICLR 2023)

Requirements

Getting Started

General

Pre-training

Text-to-Video Retrieval Finetuning (Hyejin)

for training:

for testing:

Citation

Acknowledgements

Files

CLIP-ViP

Directory actions

More options

Directory actions

More options

Latest commit

History

CLIP-ViP

Folders and files

parent directory

README.md

CLIP-ViP (ICLR 2023)

Requirements

Getting Started

General

Pre-training

Text-to-Video Retrieval Finetuning (Hyejin)

for training:

for testing:

Citation

Acknowledgements