Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
README.md		README.md
visual.png		visual.png

Repository files navigation

Awesome Autoregressive Visual Generation Models

A curated list of recent autoregressive models for image/video generation, editing, restoration, etc (only focusing on next-set prediction paradigm).

Table of Contents

Video Generation
Long Video / Film Generation
Image Generation
Controllable Image Generation
Multimodal Generation
Survey
Others

Video Generation

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models (5 Dec 2024)
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (28 Oct 2024)
ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction(7 Oct 2024)
VideoPoet: A Large Language Model for Zero-Shot Video Generation (21 Dec 2023 & ICML 2024)
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities (9 Nov 2023 & CVPR 2024)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation (9 Oct 2023 & ICLR 2024)
MOSO: Decomposing MOtion, Scene and Object for Video Prediction (7 Mar 2023 & CVPR 2023)
MaskViT: Masked Visual Pre-Training for Video Prediction (23 Jun 2022)
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation (23 Nov 2022 & CVPR 2023)
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (29 May 2022 & ICLR 2023)
VideoGPT: Video Generation using VQ-VAE and Transformers (20 Apr 2021)

Long Video / Film Generation

Advancing Auto-Regressive Continuation for Video Frames (4 Dec 2024)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation (27 Oct 2024)
Loong: Generating Minute-level Long Videos with Autoregressive Language Models (3 Oct 2024)
Phenaki: Variable length video generation from open domain textual descriptions (5 Oct 2022 & ICLR 2022)
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer (7 Apr 2022 & ECCV 2022)

Image Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation (3 Dec 2024)
Normalizing Flows are Capable Generative Models (9 Dec 2024)
Liquid: Language Models are Scalable Multi-modal Generators (5 Dec 2024)
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis (5 Dec 2024)
ZipAR: Accelerating Auto-Regressive Image Generation through Spatial Locality (5 Dec 2024)
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization (26 Nov 2024 & NeurIPS 2024)
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient (26 Nov 2024)
Factorized Visual Tokenization and Generation (25 Nov 2024)
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs (24 Nov 2024)
High-Resolution Image Synthesis via Next-Token Prediction (22 Nov 2024)
Continuous Speculative Decoding for Autoregressive Image Generation (18 Nov 2024)
CART: Compositional Auto-Regressive Transformer for Image Generation (15 Nov 2024)
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation (15 Nov 2024)
Randomized Autoregressive Visual Generation (1 Nov 2024)
Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction (24 Oct 2024)
Elucidating the design space of language models for image generation (21 Oct 2024)
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities (18 Oct 2024)
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (17 Oct 2024)
Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance (17 Oct 2024)
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective (16 Oct 2024)
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer (14 Oct 2024)
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation (10 Oct 2024)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding (4 Oct 2024)
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding (2 Oct 2024)
AiM: Scalable Autoregressive Image Generation with Mamba (22 Aug 2024)
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (5 Aug 2024)
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling (2 Aug 2024)
MARS:Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis (10 Jul 2024)
Autoregressive Image Generation without Vector Quantization (17 Jun 2024 & NeurIPS 2024)
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (10 Jun 2024)
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (3 Apr 2024 & NeurIPS 2024)
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (20 May 2023 & CVPR 2023)
Muse: Text-to-image generation via masked generative transformers (2 Jan 2023 & ICML 2023)
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis (6 Dec 2022 & CVPR 2024)
Vector-quantized Image Modeling with Improved VQGAN (9 Oct 2022 & ICLR 2022)
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation (19 Seq 2022 & NeurIPS 2022)
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (22 Jun 2022 & ICLR 2024)
Autoregressive Image Generation using Residual Quantization (3 Mar 2022 & CVPR 2022)
MaskGIT: Masked Generative Image Transformer (8 Feb 2022 & CVPR 2022)
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer (9 Jun 2022 & NeurIPS 2022)
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis (19 Aug 2021 & NeurIPS 2021)
Zero-Shot Text-to-Image Generation (24 Feb 2021 & ICML 2021)
Taming Transformers for High-Resolution Image Synthesis (17 Dec 2020 & CVPR 2021)

Controllable Image Generation

CAR: Controllable Autoregressive Modeling for Visual Generation (7 Oct 2024)
ControlAR: Controllable Image Generation with Autoregressive Models (3 Oct 2024)

Multimodal Visual Generation

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation (12 Nov 2024)
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (17 Oct 2024)
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation (17 Oct 2024)
Emu3: Next-Token Prediction is All You Need (27 Sep 2024)
MIO: A Foundation Model on Multimodal Tokens (26 Sep 2024)
MonoFormer: One Transformer for Both Diffusion and Autoregression (24 Sep 2024)
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (6 Sep 2024)
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (20 Aug 2024)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (22 Aug 2024)
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation (8 Jul 2024)
X-VILA: Cross-Modality Alignment for Large Language Model (29 May 2024)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (16 May 2024)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (22 Apr 2024)
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (27 Mar 2024)
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling (19 Feb 2024)
World Model on Million-Length Video And Language With Blockwise RingAttention (13 Feb 2024)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (5 Feb 2024 & ICML 2024)
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (18 Jan 2024)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (18 Jan 2024)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (28 Dec 2023)
Emu2: Generative Multimodal Models are In-Context Learners (20 Dec 2023 & CVPR 2024)
Gemini: A Family of Highly Capable Multimodal Models (19 Dec 2023)
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (14 Dec 2023)
DreamLLM: Synergistic Multimodal Comprehension and Creation (26 Sep 2023 & ICLR 2024)
NExT-GPT: Any-to-Any Multimodal LLM (11 Sep 2023 & ICML 2024)
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (9 Sep 2023 & ICLR 2024)
Emu: Generative Pretraining in Multimodality (11 Jul 2023 & ICLR 2024)
CoDi: Any-to-Any Generation via Composable Diffusion (19 May 2023 & NeurIPS 2023)

Survey

A Survey on Vision Autoregressive Model (13 Nov 2024)
Autoregressive Models in Vision: A Survey (8 Nov 2024)

Others

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory (Mar 2022 & CVPR 2022)

About

a collection of awesome autoregressive visual generation models

Report repository

Releases

No releases published