Awesome-Video-Understanding-with-LLM TBD: Taxonomy Video Understanding LLM as A Controller Title Date Code Data Venue Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 06/2023 code - VALLEY: Video Assistant with Large Language model Enhanced abilitY 06/2023 code - VLog: Video as a Long Document - demo - Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions 04/2023 code - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System 04/2023 project page - VideoChat: Chat-Centric Video Understanding 2023/05 code demo VideoLLM: Modeling Video Sequence with Large Language Models 05/2023 code - Self-Chained Image-Language Model for Video Localization and Question Answering 05/2023 code - [Learning Video Representations from Large Language Models] (https://arxiv.org/abs/2212.04501) 12/2022 code(https://github.com/facebookresearch/lavila) - Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language 04/2022 project page - CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos 03/2023 code - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration 06/2023 code - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video 02/2023 code - Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering 03/2023 code - MIMIC-IT: Multi-Modal In-Context Instruction Tuning 06/2023 code - Garbage in, garbage out: Zero-shot detection of crime using Large Language Models 07/2023 code - A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot 05/2023 - - Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering 04/2023 - - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models 06/2023 - - Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners 05/2022 code - Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction 05/2023 - - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 07/2023 code - - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 06/2023 code - - VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 05/2023 code - - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension 07/2023 code - - FunQA: Towards Surprising Video Comprehension 06/2023 code - - End-to-end Models Title Date Code Data Venue Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding 06/2023 code - LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning 06/2023 code - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 04/2023 code - Video Generation Title Date Code Data Venue Generative Pretraining in Multimodality 07/2023 - - NExT-GPT: Any-to-Any Multimodal LLM 09/2023 - -