Skip to content

[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation

License

Notifications You must be signed in to change notification settings

YangLing0818/VideoTetris

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

VideoTetris: Towards Compositional Text-To-Video Generation

           

This repo contains the official implementation of our VideoTetris.

VideoTetris: Towards Compositional Text-To-Video Generation
Ye Tian, Ling Yang*, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui
(* Equal Contribution and Corresponding Author)
Peking University, Kuaishou Technology

Introduction

VideoTetris is a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Our demonstrations include successful examples of videos spanning from 10s, 30s to 2 minutes, and can be extended for even longer durations.

News and Todo List

  • [2024.6.7] Paper VideoTetris released
  • Release the inference code of VideoTetris with VideoCrafter2
  • Release the checkpoint of our long compositonal video generation
  • Release VideoTetris with KLing/FIFO-Diffusion

Training and Inference

(TODO)

Example Results

We only provide some example results here, more detailed results can be found in the project page.

A cute brown dog on the left and a sleepy cat on the right are napping in the sun.
@16 Frames
A cheerful farmer and a hardworking blacksmith are building a barn.
@16 Frames
One cute brown squirrel, on a pile of hazelnuts, cinematic.
------> transitions to
Two cute brown squirrels, on a pile of hazelnuts, cinematic.
------> transitions to
Three cute brown squirrels, on a pile of hazelnuts, cinematic.
------> transitions to
Four cute brown squirrels, on a pile of hazelnuts, cinematic.
@80 Frames
A cute brown squirrel, on a pile of hazelnuts, cinematic.
------> transitions to
A cute brown squirrel and a cute white squirrel, on a pile of hazelnuts, cinematic.
@240 Frames

Citation

@article{tian2024videotetris,
  title={VideoTetris: Towards Compositional Text-to-Video Generation},
  author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},
  journal={arXiv preprint arXiv:2401.},
  year={2024}
}

About

[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published