diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index ab42a9a1b00e..9022ca669f3f 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -52,6 +52,8 @@ title: Image-to-image - local: using-diffusers/inpaint title: Inpainting + - local: using-diffusers/text-img2vid + title: Text or image-to-video - local: using-diffusers/depth2img title: Depth-to-image title: Tasks @@ -323,6 +325,8 @@ title: Text-to-image - local: api/pipelines/stable_diffusion/img2img title: Image-to-image + - local: api/pipelines/stable_diffusion/svd + title: Image-to-video - local: api/pipelines/stable_diffusion/inpaint title: Inpainting - local: api/pipelines/stable_diffusion/depth2img diff --git a/docs/source/en/api/attnprocessor.md b/docs/source/en/api/attnprocessor.md index e445f44e2405..3c0ee0563f07 100644 --- a/docs/source/en/api/attnprocessor.md +++ b/docs/source/en/api/attnprocessor.md @@ -20,14 +20,14 @@ An attention processor is a class for applying different types of attention mech ## AttnProcessor2_0 [[autodoc]] models.attention_processor.AttnProcessor2_0 -## FusedAttnProcessor2_0 -[[autodoc]] models.attention_processor.FusedAttnProcessor2_0 +## AttnAddedKVProcessor +[[autodoc]] models.attention_processor.AttnAddedKVProcessor -## LoRAAttnProcessor -[[autodoc]] models.attention_processor.LoRAAttnProcessor +## AttnAddedKVProcessor2_0 +[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0 -## LoRAAttnProcessor2_0 -[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0 +## CrossFrameAttnProcessor +[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor ## CustomDiffusionAttnProcessor [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor @@ -35,26 +35,29 @@ An attention processor is a class for applying different types of attention mech ## CustomDiffusionAttnProcessor2_0 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0 -## AttnAddedKVProcessor -[[autodoc]] models.attention_processor.AttnAddedKVProcessor +## CustomDiffusionXFormersAttnProcessor +[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor -## AttnAddedKVProcessor2_0 -[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0 +## FusedAttnProcessor2_0 +[[autodoc]] models.attention_processor.FusedAttnProcessor2_0 + +## LoRAAttnProcessor +[[autodoc]] models.attention_processor.LoRAAttnProcessor + +## LoRAAttnProcessor2_0 +[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0 ## LoRAAttnAddedKVProcessor [[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor -## XFormersAttnProcessor -[[autodoc]] models.attention_processor.XFormersAttnProcessor - ## LoRAXFormersAttnProcessor [[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor -## CustomDiffusionXFormersAttnProcessor -[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor - ## SlicedAttnProcessor [[autodoc]] models.attention_processor.SlicedAttnProcessor ## SlicedAttnAddedKVProcessor [[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor + +## XFormersAttnProcessor +[[autodoc]] models.attention_processor.XFormersAttnProcessor diff --git a/docs/source/en/api/pipelines/stable_diffusion/svd.md b/docs/source/en/api/pipelines/stable_diffusion/svd.md new file mode 100644 index 000000000000..87a9c2a5be86 --- /dev/null +++ b/docs/source/en/api/pipelines/stable_diffusion/svd.md @@ -0,0 +1,43 @@ + + +# Stable Video Diffusion + +Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. + +The abstract from the paper is: + +*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.* + + + +To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide. + +
+ +Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints! + +
+ +## Tips + +Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. + +Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. + +## StableVideoDiffusionPipeline + +[[autodoc]] StableVideoDiffusionPipeline + +## StableVideoDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/text_to_video.md b/docs/source/en/api/pipelines/text_to_video.md index b47d6fa72bae..7522264e0b58 100644 --- a/docs/source/en/api/pipelines/text_to_video.md +++ b/docs/source/en/api/pipelines/text_to_video.md @@ -167,6 +167,12 @@ Here are some sample outputs: +## Tips + +Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. + +Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. + Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. diff --git a/docs/source/en/using-diffusers/text-img2vid.md b/docs/source/en/using-diffusers/text-img2vid.md new file mode 100644 index 000000000000..56cc85f0a87a --- /dev/null +++ b/docs/source/en/using-diffusers/text-img2vid.md @@ -0,0 +1,497 @@ + + +# Text or image-to-video + +Driven by the success of text-to-image diffusion models, generative video models are able to generate short clips of video from a text prompt or an initial image. These models extend a pretrained diffusion model to generate videos by adding some type of temporal and/or spatial convolution layer to the architecture. A mixed dataset of images and videos are used to train the model which learns to output a series of video frames based on the text or image conditioning. + +This guide will show you how to generate videos, how to configure video model parameters, and how to control video generation. + +## Popular models + +> [!TIP] +> Discover other cool and trending video generation models on the Hub [here](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending)! + +[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/), [AnimateDiff](https://huggingface.co/guoyww/animatediff), and [ModelScopeT2V](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos. + +### Stable Video Diffusion + +[SVD](../api/pipelines/svd) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd) guide. + +Begin by loading the [`StableVideoDiffusionPipeline`] and passing an initial image to generate a video from. + +```py +import torch +from diffusers import StableVideoDiffusionPipeline +from diffusers.utils import load_image, export_to_video + +pipeline = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" +) +pipeline.enable_model_cpu_offload() + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = torch.manual_seed(42) +frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0] +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+
+ +
initial image
+
+
+ +
generated video
+
+
+ +### I2VGen-XL + +[I2VGen-XL](../api/pipelines/i2vgenxl) is a diffusion model that can generate higher resolution videos than SVD and it is also capable of accepting text prompts in addition to images. The model is trained with two hierarchical encoders (detail and global encoder) to better capture low and high-level details in images. These learned details are used to train a video diffusion model which refines the video resolution and details in the generated video. + +You can use I2VGen-XL by loading the [`I2VGenXLPipeline`], and passing a text and image prompt to generate a video. + +```py +import torch +from diffusers import I2VGenXLPipeline +from diffusers.utils import export_to_gif, load_image + +pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16") +pipeline.enable_model_cpu_offload() + +image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" +image = load_image(image_url).convert("RGB") + +prompt = "Papers were floating in the air on a table in the library" +negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" +generator = torch.manual_seed(8888) + +frames = pipeline( + prompt=prompt, + image=image, + num_inference_steps=50, + negative_prompt=negative_prompt, + guidance_scale=9.0, + generator=generator +).frames[0] +export_to_gif(frames, "i2v.gif") +``` + +
+
+ +
initial image
+
+
+ +
generated video
+
+
+ +### AnimateDiff + +[AnimateDiff](../api/pipelines/animatediff) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into "video models". + +Start by loading a [`MotionAdapter`]. + +```py +import torch +from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter +from diffusers.utils import export_to_gif + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) +``` + +Then load a finetuned Stable Diffusion model with the [`AnimateDiffPipeline`]. + +```py +pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16) +scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, +) +pipeline.scheduler = scheduler +pipeline.enable_vae_slicing() +pipeline.enable_model_cpu_offload() +``` + +Create a prompt and generate the video. + +```py +output = pipeline( + prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", + negative_prompt="bad quality, worse quality, low resolution", + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=torch.Generator("cpu").manual_seed(49), +) +frames = output.frames[0] +export_to_gif(frames, "animation.gif") +``` + +
+ +
+ +### ModelscopeT2V + +[ModelscopeT2V](../api/pipelines/text_to_video) adds spatial and temporal convolutions and attention to a UNet, and it is trained on image-text and video-text datasets to enhance what it learns during training. The model takes a prompt, encodes it and creates text embeddings which are denoised by the UNet, and then decoded by a VQGAN into a video. + + + +ModelScopeT2V generates watermarked videos due to the datasets it was trained on. To use a watermark-free model, try the [cerspense/zeroscope_v2_76w](https://huggingface.co/cerspense/zeroscope_v2_576w) model with the [`TextToVideoSDPipeline`] first, and then upscale it's output with the [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) checkpoint using the [`VideoToVideoSDPipeline`]. + + + +Load a ModelScopeT2V checkpoint into the [`DiffusionPipeline`] along with a prompt to generate a video. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers.utils import export_to_video + +pipeline = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") +pipeline.enable_model_cpu_offload() +pipeline.enable_vae_slicing() + +prompt = "Confident teddy bear surfer rides the wave in the tropics" +video_frames = pipeline(prompt).frames[0] +export_to_video(video_frames, "modelscopet2v.mp4", fps=10) +``` + +
+ +
+ +## Configure model parameters + +There are a few important parameters you can configure in the pipeline that'll affect the video generation process and quality. Let's take a closer look at what these parameters do and how changing them affects the output. + +### Number of frames + +The `num_frames` parameter determines how many video frames are generated per second. A frame is an image that is played in a sequence of other frames to create motion or a video. This affects video length because the pipeline generates a certain number of frames per second (check a pipeline's API reference for the default value). To increase the video duration, you'll need to increase the `num_frames` parameter. + +```py +import torch +from diffusers import StableVideoDiffusionPipeline +from diffusers.utils import load_image, export_to_video + +pipeline = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16" +) +pipeline.enable_model_cpu_offload() + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = torch.manual_seed(42) +frames = pipeline(image, decode_chunk_size=8, generator=generator, num_frames=25).frames[0] +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+
+ +
num_frames=14
+
+
+ +
num_frames=25
+
+
+ +### Guidance scale + +The `guidance_scale` parameter controls how closely aligned the generated video and text prompt or initial image is. A higher `guidance_scale` value means your generated video is more aligned with the text prompt or initial image, while a lower `guidance_scale` value means your generated video is less aligned which could give the model more "creativity" to interpret the conditioning input. + + + +SVD uses the `min_guidance_scale` and `max_guidance_scale` parameters for applying guidance to the first and last frames respectively. + + + +```py +import torch +from diffusers import I2VGenXLPipeline +from diffusers.utils import export_to_gif, load_image + +pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16") +pipeline.enable_model_cpu_offload() + +image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" +image = load_image(image_url).convert("RGB") + +prompt = "Papers were floating in the air on a table in the library" +negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" +generator = torch.manual_seed(0) + +frames = pipeline( + prompt=prompt, + image=image, + num_inference_steps=50, + negative_prompt=negative_prompt, + guidance_scale=1.0, + generator=generator +).frames[0] +export_to_gif(frames, "i2v.gif") +``` + +
+
+ +
guidance_scale=9.0
+
+
+ +
guidance_scale=1.0
+
+
+ +### Negative prompt + +A negative prompt deters the model from generating things you don’t want it to. This parameter is commonly used to improve overall generation quality by removing poor or bad features such as “low resolution” or “bad details”. + +```py +import torch +from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter +from diffusers.utils import export_to_gif + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) + +pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16) +scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, +) +pipeline.scheduler = scheduler +pipeline.enable_vae_slicing() +pipeline.enable_model_cpu_offload() + +output = pipeline( + prompt="360 camera shot of a sushi roll in a restaurant", + negative_prompt="Distorted, discontinuous, ugly, blurry, low resolution, motionless, static", + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=torch.Generator("cpu").manual_seed(0), +) +frames = output.frames[0] +export_to_gif(frames, "animation.gif") +``` + +
+
+ +
no negative prompt
+
+
+ +
negative prompt applied
+
+
+ +### Model-specific parameters + +There are some pipeline parameters that are unique to each model such as adjusting the motion in a video or adding noise to the initial image. + + + + +Stable Video Diffusion provides additional micro-conditioning for the frame rate with the `fps` parameter and for motion with the `motion_bucket_id` parameter. Together, these parameters allow for adjusting the amount of motion in the generated video. + +There is also a `noise_aug_strength` parameter that increases the amount of noise added to the initial image. Varying this parameter affects how similar the generated video and initial image are. A higher `noise_aug_strength` also increases the amount of motion. To learn more, read the [Micro-conditioning](../using-diffusers/svd#micro-conditioning) guide. + + + + +Text2Video-Zero computes the amount of motion to apply to each frame from randomly sampled latents. You can use the `motion_field_strength_x` and `motion_field_strength_y` parameters to control the amount of motion to apply to the x and y-axes of the video. The parameters `t0` and `t1` are the timesteps to apply motion to the latents. + + + + +## Control video generation + +Video generation can be controlled similar to how text-to-image, image-to-image, and inpainting can be controlled with a [`ControlNetModel`]. The only difference is you need to use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] so each frame attends to the first frame. + +### Text2Video-Zero + +Text2Video-Zero video generation can be conditioned on pose and edge images for even greater control over a subject's motion in the generated video or to preserve the identity of a subject/object in the video. You can also use Text2Video-Zero with [InstructPix2Pix](../api/pipelines/pix2pix) for editing videos with text. + + + + +Start by downloading a video and extracting the pose images from it. + +```py +from huggingface_hub import hf_hub_download +from PIL import Image +import imageio + +filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4" +repo_id = "PAIR/Text2Video-Zero" +video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) + +reader = imageio.get_reader(video_path, "ffmpeg") +frame_count = 8 +pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] +``` + +Load a [`ControlNetModel`] for pose estimation and a checkpoint into the [`StableDiffusionControlNetPipeline`]. Then you'll use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet and ControlNet. + +```py +import torch +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor + +model_id = "runwayml/stable-diffusion-v1-5" +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + model_id, controlnet=controlnet, torch_dtype=torch.float16 +).to("cuda") + +pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) +pipeline.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) +``` + +Fix the latents for all the frames, and then pass your prompt and extracted pose images to the model to generate a video. + +```py +latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) + +prompt = "Darth Vader dancing in a desert" +result = pipeline(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images +imageio.mimsave("video.mp4", result, fps=4) +``` + + + + +Download a video and extract the edges from it. + +```py +from huggingface_hub import hf_hub_download +from PIL import Image +import imageio + +filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4" +repo_id = "PAIR/Text2Video-Zero" +video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) + +reader = imageio.get_reader(video_path, "ffmpeg") +frame_count = 8 +pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] +``` + +Load a [`ControlNetModel`] for canny edge and a checkpoint into the [`StableDiffusionControlNetPipeline`]. Then you'll use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet and ControlNet. + +```py +import torch +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor + +model_id = "runwayml/stable-diffusion-v1-5" +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + model_id, controlnet=controlnet, torch_dtype=torch.float16 +).to("cuda") + +pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) +pipeline.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) +``` + +Fix the latents for all the frames, and then pass your prompt and extracted edge images to the model to generate a video. + +```py +latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) + +prompt = "Darth Vader dancing in a desert" +result = pipeline(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images +imageio.mimsave("video.mp4", result, fps=4) +``` + + + + +InstructPix2Pix allows you to use text to describe the changes you want to make to the video. Start by downloading and reading a video. + +```py +from huggingface_hub import hf_hub_download +from PIL import Image +import imageio + +filename = "__assets__/pix2pix video/camel.mp4" +repo_id = "PAIR/Text2Video-Zero" +video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) + +reader = imageio.get_reader(video_path, "ffmpeg") +frame_count = 8 +video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] +``` + +Load the [`StableDiffusionInstructPix2PixPipeline`] and set the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet. + +```py +import torch +from diffusers import StableDiffusionInstructPix2PixPipeline +from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor + +pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("timbrooks/instruct-pix2pix", torch_dtype=torch.float16).to("cuda") +pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3)) +``` + +Pass a prompt describing the change you want to apply to the video. + +```py +prompt = "make it Van Gogh Starry Night style" +result = pipeline(prompt=[prompt] * len(video), image=video).images +imageio.mimsave("edited_video.mp4", result, fps=4) +``` + + + + +## Optimize + +Video generation requires a lot of memory because you're generating many video frames at once. You can reduce your memory requirements at the expense of some inference speed. Try: + +1. offloading pipeline components that are no longer needed to the CPU +2. feed-forward chunking runs the feed-forward layer in a loop instead of all at once +3. break up the number of frames the VAE has to decode into chunks instead of decoding them all at once + +```diff +- pipeline.enable_model_cpu_offload() +- frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0] ++ pipeline.enable_model_cpu_offload() ++ pipeline.unet.enable_forward_chunking() ++ frames = pipeline(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] +``` + +If memory is not an issue and you want to optimize for speed, try wrapping the UNet with [`torch.compile`](../optimization/torch2.0#torchcompile). + +```diff +- pipeline.enable_model_cpu_offload() ++ pipeline.to("cuda") ++ pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) +```