You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/cogvideox.md
+20-23
Original file line number
Diff line number
Diff line change
@@ -15,9 +15,7 @@
15
15
16
16
# CogVideoX
17
17
18
-
<!-- TODO: update paper with ArXiv link when ready. -->
19
-
20
-
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.
18
+
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
21
19
22
20
The abstract from the paper is:
23
21
@@ -43,43 +41,42 @@ from diffusers import CogVideoXPipeline
# CogVideoX works very well with long and well-described prompts
57
+
# CogVideoX works well with long and well-described prompts
72
58
prompt ="A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
73
-
video =pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
59
+
video =pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
74
60
```
75
61
76
-
The [benchmark](TODO: link) results on an 80GB A100 machine are:
62
+
The [benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
77
63
78
64
```
79
-
Without torch.compile(): Average inference time: TODO seconds.
80
-
With torch.compile(): Average inference time: TODO seconds.
65
+
Without torch.compile(): Average inference time: 96.89 seconds.
66
+
With torch.compile(): Average inference time: 76.27 seconds.
81
67
```
82
68
69
+
### Memory optimization
70
+
71
+
CogVideoX requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
72
+
73
+
-`pipe.enable_model_cpu_offload()`:
74
+
- Without enabling cpu offloading, memory usage is `33 GB`
75
+
- With enabling cpu offloading, memory usage is `19 GB`
76
+
-`pipe.vae.enable_tiling()`:
77
+
- With enabling cpu offloading and tiling, memory usage is `11 GB`
0 commit comments