Skip to content
/ UniVTG Public

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

License

Notifications You must be signed in to change notification settings

showlab/UniVTG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

32659ac Β· May 8, 2024
Nov 1, 2023
Jul 31, 2023
Aug 7, 2023
Aug 9, 2023
Aug 21, 2023
Aug 21, 2023
Aug 21, 2023
Aug 6, 2023
Dec 19, 2023
Oct 16, 2023
Aug 7, 2023
Aug 9, 2023
Aug 24, 2023
May 8, 2024
Sep 24, 2023
Aug 7, 2023
Nov 3, 2023
Aug 10, 2023

Repository files navigation

UniVTG (ICCV'23)

PWC PWC

[arXiv] Open in Spaces Tweet

TL; DR: The first video temporal grounding pretraining model, unifying diverse temporal annotations to power moment retrieval (interval), highlight detection (curve) and video summarization (point).

UniVTG

πŸ“’ News

  • [2023.10.15] Upload the Clip teacher scripts to create scalable pseudo annotations.
  • [2023.8.22] Code cleaning, add training/inference instruction, upload all downstream checkpoints.
  • [2023.8.6] Create the Huggingface space demo!
  • [2023.7.31] We release the arXiv paper, codes, checkpoints, and gradio demo.

πŸ“ Todo

  • Connect UniVTG with LLM e.g., ChatGPT.
  • Upload all downstream checkpoints.
  • Upload all pretraining checkpoints.

🌟 Run on your video:

To power practical usage, we release the following checkpoints:

can be run on a single GPU with less than 4GB memory, highly efficient, less than 1 sec to perform temporal grounding even a 10 minutes long video.

Video Enc. Text Enc. Pretraining Fine-tuning Checkpoints
CLIP-B/16 CLIP-B/16 4M - Google Drive
CLIP-B/16 CLIP-B/16 4M QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo Google Drive

Download checkpoint and put it in the dir results/omni.

Download the example videos from here and put it under examples/

Run python3 main_gradio.py --resume ./results/omni/model_best.ckpt

[ Youtube video ]Youtube video
[ Egocentric video ]Egocentric video
[ Charades video ]Charades video

βš™οΈ Preparation

Please find instructions in install.md to setup environment and datasets.

πŸ“¦ Model Zoo

Download checkpoints in model.md to reproduce the benchmark results.

πŸš€ Training & Inference

We use slurm for job running, you may need to slightly modify the code to adapt your environment if you do not use slurm system.

Pretraining (multi-gpu)

Large-scale pretraining: bash scripts/pretrain.sh

Multi-datasets co-training: bash scripts/cotrain.sh

Downstream (single-gpu)

Indicate --resume to init model by pretraining weight. Refer to our model zoo for detailed parameter settings

Training: bash scripts/qvhl_pretrain.sh

Indicate --eval_init and --n_epoch=0 to evaluate selected checkpoint --resume.

Inference: bash scripts/qvhl_inference.sh

CLIP teacher to create scalable pseudo labels

  1. Download the openimages v6 class list from https://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv.

  2. Convert it as json by python3 teacher/csv2json.py then extract the textual class features by python3 teacher/label2feature.py

  3. (Before this, you should have extracted the video features of the video) Run the script to generate pseudo labels python3 teacher/clip2labels.py

🎨 Visualization

If you want to draw visualizations like our paper, you can simply run python3 plot/qvhl.py to generate corresponding figures by providing the prediction jsons (you can download them in Model Zoo).

visualization

πŸŽ“ Citation

If you find our work helps, please cite our paper.

@misc{lin2023univtg,
      title={UniVTG: Towards Unified Video-Language Temporal Grounding}, 
      author={Kevin Qinghong Lin and Pengchuan Zhang and Joya Chen and Shraman Pramanick and Difei Gao and Alex Jinpeng Wang and Rui Yan and Mike Zheng Shou},
      year={2023},
      eprint={2307.16715},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

βœ‰οΈ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via [email protected] or open an issue.

😊 Acknowledgement

This codebase is based on moment_detr, HERO_Video_Feature_Extractor, UMT.

We thank the authors for their open-source contributions.