Skip to content

Commit a85b34e

Browse files
authored
[refactor] CogVideoX followups + tiled decoding support (huggingface#9150)
* refactor context parallel cache; update torch compile time benchmark * add tiling support * make style * remove num_frames % 8 == 0 requirement * update default num_frames to original value * add explanations + refactor * update torch compile example * update docs * update * clean up if-statements * address review comments * add test for vae tiling * update docs * update docs * update docstrings * add modeling test for cogvideox transformer * make style
1 parent 5ffbe14 commit a85b34e

File tree

6 files changed

+529
-175
lines changed

6 files changed

+529
-175
lines changed

docs/source/en/api/pipelines/cogvideox.md

+20-23
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,7 @@
1515

1616
# CogVideoX
1717

18-
<!-- TODO: update paper with ArXiv link when ready. -->
19-
20-
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.
18+
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
2119

2220
The abstract from the paper is:
2321

@@ -43,43 +41,42 @@ from diffusers import CogVideoXPipeline
4341
from diffusers.utils import export_to_video
4442

4543
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda")
46-
prompt = (
47-
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
48-
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
49-
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
50-
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
51-
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
52-
"atmosphere of this unique musical performance."
53-
)
54-
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
55-
export_to_video(video, "output.mp4", fps=8)
5644
```
5745

58-
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
46+
Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
5947

6048
```python
61-
pipeline.transformer.to(memory_format=torch.channels_last)
62-
pipeline.vae.to(memory_format=torch.channels_last)
49+
pipe.transformer.to(memory_format=torch.channels_last)
6350
```
6451

6552
Finally, compile the components and run inference:
6653

6754
```python
68-
pipeline.transformer = torch.compile(pipeline.transformer)
69-
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
55+
pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
7056

71-
# CogVideoX works very well with long and well-described prompts
57+
# CogVideoX works well with long and well-described prompts
7258
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
73-
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
59+
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
7460
```
7561

76-
The [benchmark](TODO: link) results on an 80GB A100 machine are:
62+
The [benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
7763

7864
```
79-
Without torch.compile(): Average inference time: TODO seconds.
80-
With torch.compile(): Average inference time: TODO seconds.
65+
Without torch.compile(): Average inference time: 96.89 seconds.
66+
With torch.compile(): Average inference time: 76.27 seconds.
8167
```
8268

69+
### Memory optimization
70+
71+
CogVideoX requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
72+
73+
- `pipe.enable_model_cpu_offload()`:
74+
- Without enabling cpu offloading, memory usage is `33 GB`
75+
- With enabling cpu offloading, memory usage is `19 GB`
76+
- `pipe.vae.enable_tiling()`:
77+
- With enabling cpu offloading and tiling, memory usage is `11 GB`
78+
- `pipe.vae.enable_slicing()`
79+
8380
## CogVideoXPipeline
8481

8582
[[autodoc]] CogVideoXPipeline

0 commit comments

Comments
 (0)