This repo contains the official implementation of our VideoTetris (NeurIPS 2024).
VideoTetris: Towards Compositional Text-To-Video Generation
Ye Tian, Ling Yang*, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Pengfei Wan, Di Zhang, Bin Cui
(* Equal Contribution and Corresponding Author)
Peking University, Kuaishou Technology
VideoTetris is a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Our demonstrations include successful examples of videos spanning from 10s, 30s to 2 minutes, and can be extended for even longer durations.
We provide the inference code of our VideoTetris for compositional video generation based on VideoCrafter2. You can download the pretrained model from Hugging Face and put it in checkpoints/base_512_v2/model.ckpt
. Then run the following command:
cd short
conda create -n videocrafter python=3.8.5
conda activate videocrafter
pip install -r requirements.txt
You can then plan the regions for different sub-objects in a json file like prompts/demo_videotetris.json
. The regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the prompts/demo_videotetris.json
for an example. And the final planning json should be like:
{
{
"basic_prompt": "A cat on the left and a dog on the right are napping in the sun.",
"sub_objects":[
"A cute orange cat.",
"A cute dog."
],
"layout_boxes":[
[0, 0, 0.5, 1],
[0.5, 0, 1, 1]
]
},
}
In this case, we first define the basic prompt, and then specify the sub-objects and their corresponding regions, resulting in a video with a left cat and a right dog.
sh scripts/run_text2video_from_layout.sh
You can specify the input json file run_text2video_from_layout.sh
script.
cd long
conda create -n st2v python=3.10
conda activate st2v
pip install -r requirements.txt
We put our VideoTetris-long model finetuned on our filtered dataset on Hugging Face. You can download the weights and put it in the directory through:
wget https://huggingface.co/tyfeld/VideoTetris-long/resolve/main/model-step=6000-v1.ckpt
You can then plan the regions for different sub-objects in a json file like prompts/prompt.json. You should specify the video chunk index, prompt, sub-objects and layout boxes for each video chunk.
Video Chunk Meaning: As the long video is autoregressively generated by 8 frames for each chunk, a video with 80 frames will be autoregressively generated with (80-8)/8 = 9 rounds. And every chunk means the expanding 8 frames generated in one round.
The regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the prompts/prompt.json for an example. And the final planning json should be like:
[
{
"video_chunk_index": 0,
"prompt": "A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.",
"sub_objects": [
"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic."
],
"layout_boxes":[
[0, 0, 1, 1]
]
},
{
"video_chunk_index": 4,
"prompt": "A cute brown squirrel and a cute white squirrel in Antarctica, on a pile of hazelnuts cinematic",
"sub_objects": [
"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.",
"A cute white squirrel in Antarctica, on a pile of hazelnuts cinematic."
],
"layout_boxes":[
[0.5, 0, 1, 1],
[0, 0, 0.5, 1]
]
}
]
cd t2v_enhanced
python inference_videotetris.py --num_frames 80
We only provide some example results here, more detailed results can be found in the project page.
A cute brown dog on the left and a sleepy cat on the right are napping in the sun. @16 Frames |
A cheerful farmer and a hardworking blacksmith are building a barn. @16 Frames |
@article{tian2024videotetris,
title={VideoTetris: Towards Compositional Text-to-Video Generation},
author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},
journal={arXiv preprint arXiv:2406.04277},
year={2024}
}