GitHub - YiwengXie/Chat-Video

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Pytorch implementation of the following paper:

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Junke Wang^1,2,Dongdong Chen³,YiWeng Xie^1,2,Chong Luo⁴,Xiyang Dai³, Lu Yuan³, Zuxuan Wu^1,2†,Yu-Gang Jiang^1,2†

¹Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University.
²Shanghai Collaborative Innovation Center on Intelligent Visual Computing.
³Microsoft Cloud + AI, ⁴Microsoft Research Asia.
† denotes correspoinding authors.

Introduction

🔥 We upgrade ChatVideo by integrating more powerful tracking model (SAM2), and captioning model (LLaVA-Next).
🚀 ChatVideo allows ChatGPT to watch videos for you, for the first time, it enables the instance-level understanding in videos by detectiong, tracking, and captioning the tracklets.

Case Studies

ChatVideo examples

ChatVideo for Appearance Understanding.

ChatVideo for Motion Understanding.

ChatVideo for Audio Understanding

Installation

1. Clone the Repository

git clone https://github.com/yiwengxie/Chat-Video.git
cd Chat-Video

2. Create Conda Environment

Firstly, create the conda env:

conda create --name chat python=3.10

# Install pytorch:
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# Groundeding-DINO and SAM2
cd projects/GroundedSAM2/
pip install -e .
pip install --no-build-isolation -e grounding_dino

# LLaVA NeXT
cd projects/LLaVA_NeXT/
pip install -e ".[train]"

# Instal other dependencies:
pip install -r requirements.txt

3. Download Checkpoints

cd projects/GroundedSAM2/checkpoints
bash download_ckpts.sh

cd ../gdino_checkpoints
bash download_ckpts.sh

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Acknowledgements

We appreciate the open source of the following projects: SAM2, Grounded-SAM2, LLaVA-Next, Whisper, UNINEXT, BLIP2.

Citation

If you find this repository helpful, please consider citing:

@article{wang2023chatvideo,
    title={ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System},
    author={Wang, Junke and Chen, Dongdong and Luo, Chong and Dai, Xiyang and Yuan, Lu and Wu, Zuxuan and Jiang, Yu-Gang},
    journal={arXiv preprint arXiv:2304.14407},
    year={2023}
}

@inproceedings{wang2022omnivl,
  title={Omnivl: One foundation model for image-language and video-language tasks},
  author={Wang, Junke and Chen, Dongdong and Wu, Zuxuan and Luo, Chong and Zhou, Luowei and Zhao, Yucheng and Xie, Yujia and Liu, Ce and Jiang, Yu-Gang and Yuan, Lu},
  booktitle={NeurIPS},
  year={2022}
}

@article{wang2023omnitracker,
  title={Omnitracker: Unifying object tracking by tracking-with-detection},
  author={Wang, Junke and Chen, Dongdong and Wu, Zuxuan and Luo, Chong and Dai, Xiyang and Yuan, Lu and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2303.12079},
  year={2023}
}

@inproceedings{wang2024omnivid,
  title={Omnivid: A generative framework for universal video understanding},
  author={Wang, Junke and Chen, Dongdong and Luo, Chong and He, Bo and Yuan, Lu and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={CVPR},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
cocoapi		cocoapi
detectron2		detectron2
projects		projects
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Introduction

Case Studies

Installation

1. Clone the Repository

2. Create Conda Environment

3. Download Checkpoints

License

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

YiwengXie/Chat-Video

Folders and files

Latest commit

History

Repository files navigation

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Introduction

Case Studies

Installation

1. Clone the Repository

2. Create Conda Environment

3. Download Checkpoints

License

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages