Highlights
- Pro
Stars
Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
Official code release of "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition"
SkyReels V1: The first and most advanced open-source human-centric video foundation model
Motion-Controllable Video Diffusion via Warped Noise
Simple Controlnet module for CogvideoX model.
"VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos"
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Official Implementation of paper "MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion"
[CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.
[ECCV 2024] DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting
[ICLR 2025] Official implementation of MotionClone: Training-Free Motion Cloning for Controllable Video Generation
A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
[ECCV 2024] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.
Investigating CoT Reasoning in Autoregressive Image Generation
Frontier Multimodal Foundation Models for Image and Video Understanding
VideoWorld is a simple generative model that learns purely from unlabeled videos—much like how babies learn by observing their environment
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[arXiv'24] [Image-to-Scene on a 4090(24G)] VistaDream: Sampling multiview consistent images for single-view scene reconstruction
The official implementation of ”RepVideo: Rethinking Cross-Layer Representation for Video Generation“
JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf
Aligning pretrained language models with instruction data generated by themselves.
The code for the paper C3: Zero-shot Text-to-SQL with ChatGPT
The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"
[ARXIV'25] GameFactory: Creating New Games with Generative Interactive Videos
Official implementation of the paper "Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content".