Stars
[ECCV 2024] DragAnything: Motion Control for Anything using Entity Representation
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
A machine learning framework for reconstructing articulated 3D animals from images
[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"
[ICCV 2023] ProPainter: Improving Propagation and Transformer for Video Inpainting
Long Context Transfer from Language to Vision
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
1.5−3.0× lossless training or pre-training speedup. An off-the-shelf, easy-to-implement algorithm for the efficient training of foundation visual backbones.
(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator
Multilingual Medicine: Model, Dataset, Benchmark, Code
TALL: Temporal Activity Localization via Language Query
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
[SIGGRAPH'24] 2D Gaussian Splatting for Geometrically Accurate Radiance Fields
CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)
[ECCV 2024 Oral] MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
This repository contains curated prompts aimed at maximizing the effectiveness of Sora for generating videos.
An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN).