A curated list of recent autoregressive models for image/video generation, editing, restoration, etc (only focusing on next-set prediction paradigm).
- Video Generation
- Long Video / Film Generation
- Image Generation
- Controllable Image Generation
- Multimodal Generation
- Survey
- Others
-
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models (5 Dec 2024)
-
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (28 Oct 2024)
-
ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction(7 Oct 2024)
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation (21 Dec 2023 & ICML 2024)
-
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities (9 Nov 2023 & CVPR 2024)
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation (9 Oct 2023 & ICLR 2024)
-
MOSO: Decomposing MOtion, Scene and Object for Video Prediction (7 Mar 2023 & CVPR 2023)
-
MaskViT: Masked Visual Pre-Training for Video Prediction (23 Jun 2022)
-
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation (23 Nov 2022 & CVPR 2023)
-
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (29 May 2022 & ICLR 2023)
-
VideoGPT: Video Generation using VQ-VAE and Transformers (20 Apr 2021)
-
Advancing Auto-Regressive Continuation for Video Frames (4 Dec 2024)
-
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation (27 Oct 2024)
-
Loong: Generating Minute-level Long Videos with Autoregressive Language Models (3 Oct 2024)
-
Phenaki: Variable length video generation from open domain textual descriptions (5 Oct 2022 & ICLR 2022)
-
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer (7 Apr 2022 & ECCV 2022)
-
Taming Scalable Visual Tokenizer for Autoregressive Image Generation (3 Dec 2024)
-
Normalizing Flows are Capable Generative Models (9 Dec 2024)
-
Liquid: Language Models are Scalable Multi-modal Generators (5 Dec 2024)
-
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis (5 Dec 2024)
-
ZipAR: Accelerating Auto-Regressive Image Generation through Spatial Locality (5 Dec 2024)
-
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization (26 Nov 2024 & NeurIPS 2024)
-
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient (26 Nov 2024)
-
Factorized Visual Tokenization and Generation (25 Nov 2024)
-
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs (24 Nov 2024)
-
High-Resolution Image Synthesis via Next-Token Prediction (22 Nov 2024)
-
Continuous Speculative Decoding for Autoregressive Image Generation (18 Nov 2024)
-
CART: Compositional Auto-Regressive Transformer for Image Generation (15 Nov 2024)
-
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation (15 Nov 2024)
-
Randomized Autoregressive Visual Generation (1 Nov 2024)
-
Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction (24 Oct 2024)
-
Elucidating the design space of language models for image generation (21 Oct 2024)
-
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities (18 Oct 2024)
-
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (17 Oct 2024)
-
Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance (17 Oct 2024)
-
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective (16 Oct 2024)
-
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer (14 Oct 2024)
-
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation (10 Oct 2024)
-
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding (4 Oct 2024)
-
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding (2 Oct 2024)
-
AiM: Scalable Autoregressive Image Generation with Mamba (22 Aug 2024)
-
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (5 Aug 2024)
-
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling (2 Aug 2024)
-
MARS:Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis (10 Jul 2024)
-
Autoregressive Image Generation without Vector Quantization (17 Jun 2024 & NeurIPS 2024)
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (10 Jun 2024)
-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (3 Apr 2024 & NeurIPS 2024)
-
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (20 May 2023 & CVPR 2023)
-
Muse: Text-to-image generation via masked generative transformers (2 Jan 2023 & ICML 2023)
-
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis (6 Dec 2022 & CVPR 2024)
-
Vector-quantized Image Modeling with Improved VQGAN (9 Oct 2022 & ICLR 2022)
-
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation (19 Seq 2022 & NeurIPS 2022)
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (22 Jun 2022 & ICLR 2024)
-
Autoregressive Image Generation using Residual Quantization (3 Mar 2022 & CVPR 2022)
-
MaskGIT: Masked Generative Image Transformer (8 Feb 2022 & CVPR 2022)
-
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer (9 Jun 2022 & NeurIPS 2022)
-
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis (19 Aug 2021 & NeurIPS 2021)
-
Zero-Shot Text-to-Image Generation (24 Feb 2021 & ICML 2021)
-
Taming Transformers for High-Resolution Image Synthesis (17 Dec 2020 & CVPR 2021)
-
CAR: Controllable Autoregressive Modeling for Visual Generation (7 Oct 2024)
-
ControlAR: Controllable Image Generation with Autoregressive Models (3 Oct 2024)
-
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation (12 Nov 2024)
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (17 Oct 2024)
-
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation (17 Oct 2024)
-
Emu3: Next-Token Prediction is All You Need (27 Sep 2024)
-
MIO: A Foundation Model on Multimodal Tokens (26 Sep 2024)
-
MonoFormer: One Transformer for Both Diffusion and Autoregression (24 Sep 2024)
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (6 Sep 2024)
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (20 Aug 2024)
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (22 Aug 2024)
-
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation (8 Jul 2024)
-
X-VILA: Cross-Modality Alignment for Large Language Model (29 May 2024)
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models (16 May 2024)
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (22 Apr 2024)
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (27 Mar 2024)
-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling (19 Feb 2024)
-
World Model on Million-Length Video And Language With Blockwise RingAttention (13 Feb 2024)
-
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (5 Feb 2024 & ICML 2024)
-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (18 Jan 2024)
-
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (18 Jan 2024)
-
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (28 Dec 2023)
-
Emu2: Generative Multimodal Models are In-Context Learners (20 Dec 2023 & CVPR 2024)
-
Gemini: A Family of Highly Capable Multimodal Models (19 Dec 2023)
-
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (14 Dec 2023)
-
DreamLLM: Synergistic Multimodal Comprehension and Creation (26 Sep 2023 & ICLR 2024)
-
NExT-GPT: Any-to-Any Multimodal LLM (11 Sep 2023 & ICML 2024)
-
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (9 Sep 2023 & ICLR 2024)
-
Emu: Generative Pretraining in Multimodality (11 Jul 2023 & ICLR 2024)
-
CoDi: Any-to-Any Generation via Composable Diffusion (19 May 2023 & NeurIPS 2023)
-
A Survey on Vision Autoregressive Model (13 Nov 2024)
-
Autoregressive Models in Vision: A Survey (8 Nov 2024)
- Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory (Mar 2022 & CVPR 2022)