✨This is the official implementation of ICCV 2023 paper Accurate and Fast Compressed Video Captioning.
In this work, we propose an end-to-end video captioning method based on compressed domain information from the encoded H.264 videos. Our approach aims to accurately generate captions for compressed videos in a fast and efficient manner.
By releasing this code, we hope to facilitate further research and development in the field of compressed video processing. If you find this work useful in your own research, please consider citing our paper as a reference.
To run the code, please install the dependency libraries by using the following command:
sudo apt update && sudo apt install default-jre -y # required by pycocoevalcap
pip3 install -r requirements.txt
Additionally, you will need to install the compressed video reader as described in the README.md of AcherStyx/Compressed-Video-Reader.
Our model is based on the pretrained CLIP. You can run the following script to download the weights before training to avoid any network issues:
sudo apt update && sudo apt install aria2 -y # install aria2
bash model_zoo/download_model.sh
This will download the CLIP model to model_zoo/clip_model
. Note that this directory is hard-coded in our code.
We have conducted experiments on three video caption datasets: MSRVTT, MSVD, and VATEX. The datasets are stored in the dataset
folder under the project root. For detailed instructions on downloading and preparing the training data, please refer to dataset/README.md.
The training is configured using YAML, and all the configurations are listed in configs/compressed_video
. You can use the following commands to run the experiments:
# msrvtt
python3 mm_video/run_net.py --cfg configs/compressed_video/msrvtt_captioning.yaml
# msvd
python3 mm_video/run_net.py --cfg configs/compressed_video/msvd_captioning.yaml
# vatex
python3 mm_video/run_net.py --cfg configs/compressed_video/vatex_captioning.yaml
By default, the logs and results will be saved to ./log/<experiment_name>/
. The loss and metrics are visualized using tensorboard.
TBD