This repo contains the official implementation of our VideoTetris.
VideoTetris: Towards Compositional Text-To-Video Generation
Ye Tian, Ling Yang*, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui
(* Equal Contribution and Corresponding Author)
Peking University, Kuaishou Technology
VideoTetris is a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Our demonstrations include successful examples of videos spanning from 10s, 30s to 2 minutes, and can be extended for even longer durations.
- [2024.6.7] Paper VideoTetris released
- Release the inference code of VideoTetris with VideoCrafter2
- Release the checkpoint of our long compositonal video generation
- Release VideoTetris with KLing/FIFO-Diffusion
(TODO)
We only provide some example results here, more detailed results can be found in the project page.
A cute brown dog on the left and a sleepy cat on the right are napping in the sun. @16 Frames |
A cheerful farmer and a hardworking blacksmith are building a barn. @16 Frames |
@article{tian2024videotetris,
title={VideoTetris: Towards Compositional Text-to-Video Generation},
author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},
journal={arXiv preprint arXiv:2401.},
year={2024}
}