Chrono: A Simple Blueprint for Representing Time in MLLMs

Authors: Boris Meinardus, Hector Garcia Rodriguez, Anil Batra, Anna Rohrbach, Marcus Rohrbach
Paper: arxiv

The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. Interestingly, we find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM. Through extensive ablations across different MLLM architectures, finetuning and zero-shot settings, and different datasets, we achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA.

Code structure

# data & data preprocessing
./mr_BLIP_data

# pretrained checkpoints
./mr_BLIP_checkpoints

# mr_BLIP code
./lavis/

# running scripts for mr_BLIP training and inference
./run_scripts

Setup

Install Dependencies

(Optional) Creating conda environment

conda create -n mrBlip python=3.8
conda activate mrBlip

build from source

pip install -r requirements.txt

Download Pretrained Models

We train Mr. BLIP on QVHighlights, Charades-STA, and ActivityNet Captions and provide the checkpoints. Download the checkpoints and put them under /mr_BLIP_checkpoints.

Dataset Preparation

We test our model on:

Please download original MR data and preprocess them via our scripts.

Training and Inference

We provide Mr. BLIP training and inference script examples as follows.

And please refer to dataset page to customize your data path.

You might want to update the config files for the respective runs to fit on your machine. They are currently set to run on 8 A100-80GB GPUs. You can simply reduce the batch size, reduce the number of frames, or apply a frame level embeddings aggregation (32 frame tokens -> 1 token) to fit on a smaller GPU.

1) QVH Finetuning

sh run_scripts/mr_BLIP/train/qvh.sh

2) Charades-STA Finetuning

sh run_scripts/mr_BLIP/train/charades.sh

3) ANet Captions Finetuning

sh run_scripts/mr_BLIP/train/anet.sh

4) QVH Evaluation

Should roughly return:

	[email protected]	[email protected]	mIoU	[email protected]	[email protected]
Mr. BLIP	76.16	62.63	70.32	68.50	55.06

sh run_scripts/mr_BLIP/eval/qvh.sh

5) Charades-STA Evaluation

Should roughly return:

	[email protected]	[email protected]	mIoU
Mr. BLIP	69.31	49.29	58.63

sh run_scripts/mr_BLIP/eval/charades.sh

6) ANet Captions Evaluation

Should roughly return:

	[email protected]	[email protected]	mIoU
Mr. BLIP	53.79	35.47	51.52

sh run_scripts/mr_BLIP/eval/anet.sh

Acknowledgments

We thank the developers of LAVIS and BLIP-2 for their public code release.

Reference

Please cite our paper if you use our models in your works:

@article{meinardus2025chronosimpleblueprintrepresenting,
      title={Chrono: A Simple Blueprint for Representing Time in MLLMs}, 
      author={Boris Meinardus and Hector Garcia Rodriguez and Anil Batra and Anna Rohrbach and Marcus Rohrbach},
      year={2025},
      eprint={2406.18113},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.18113}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
docs		docs
lavis		lavis
mr_BLIP_data		mr_BLIP_data
run_scripts/mr_BLIP		run_scripts/mr_BLIP
standalone_eval		standalone_eval
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chrono: A Simple Blueprint for Representing Time in MLLMs

Code structure

Setup

Install Dependencies

Download Pretrained Models

Dataset Preparation

Training and Inference

1) QVH Finetuning

2) Charades-STA Finetuning

3) ANet Captions Finetuning

4) QVH Evaluation

5) Charades-STA Evaluation

6) ANet Captions Evaluation

Acknowledgments

Reference

About

Releases

Packages

Languages

License

sudo-Boris/mr-Blip

Folders and files

Latest commit

History

Repository files navigation

Chrono: A Simple Blueprint for Representing Time in MLLMs

Code structure

Setup

Install Dependencies

Download Pretrained Models

Dataset Preparation

Training and Inference

1) QVH Finetuning

2) Charades-STA Finetuning

3) ANet Captions Finetuning

4) QVH Evaluation

5) Charades-STA Evaluation

6) ANet Captions Evaluation

Acknowledgments

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages