Misc fixes to diffusers blog post (huggingface#743)

* Mix fixes to diffusers blog post * Add torch.no_grad() when decoding latents * Update stable_diffusion.md Co-authored-by: Pedro Cuenca <[email protected]>
csl122 · Dec 18, 2022 · a66f4ab · a66f4ab
1 parent a24e068
commit a66f4ab
Showing 1 changed file with 14 additions and 21 deletions.
diff --git a/stable_diffusion.md b/stable_diffusion.md
@@ -70,7 +70,7 @@ Now, let's get started by generating some images 🎨.
 
 ### License
 
-Before using the model, you need to accept the model [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) in order to download and use the weights.  
+Before using the model, you need to accept the model [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) in order to download and use the weights. **Note: the license does not need to be explicitly accepted through the UI anymore**.
 
 The license is designed to mitigate the potential harmful effects of such a powerful machine learning system. 
 We request users to **read the license entirely and carefully**. Here we offer a summary:
@@ -80,29 +80,21 @@ We request users to **read the license entirely and carefully**. Here we offer a
 
 ### Usage
 
-First, you should install `diffusers==0.4.0` to run the following code snippets:
+First, you should install `diffusers==0.10.2` to run the following code snippets:
 
 ```bash
-pip install diffusers==0.4.0 transformers scipy ftfy
+pip install diffusers==0.10.2 transformers scipy ftfy accelerate
 ```
 
-In this post we'll use model version `v1-4`, so you'll need to  visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
-Once you have requested access, make sure to pass your user token as:
-
-```py
-YOUR_TOKEN="/your/huggingface/hub/token"
-```
-
-After that one-time setup out of the way, we can proceed with Stable Diffusion inference.
+In this post we'll use model version [`v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4), but you can also use other versions of the model such as 1.5, 2, and 2.1 with minimal code changes.
 
 The Stable Diffusion model can be run in inference with just a couple of lines using the [`StableDiffusionPipeline`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py) pipeline. The pipeline sets up everything you need to generate images from text with 
 a simple `from_pretrained` function call.
 
 ```python
 from diffusers import StableDiffusionPipeline
 
-# get your token at https://huggingface.co/settings/tokens
-pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=YOUR_TOKEN)
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
 ```
 
 If a GPU is available, let's move it to one!
@@ -114,23 +106,23 @@ pipe.to("cuda")
 **Note**: If you are limited by GPU memory and have less than 10GB of GPU RAM available, please
 make sure to load the `StableDiffusionPipeline` in float16 precision instead of the default
 float32 precision as done above.
+
 You can do so by loading the weights from the `fp16` branch and by telling `diffusers` to expect the 
 weights to be in float16 precision:
 
 ```python
 import torch
 from diffusers import StableDiffusionPipeline
 
-# get your token at https://huggingface.co/settings/tokens
-pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=YOUR_TOKEN)
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16)
 ```
 
 To run the pipeline, simply define the prompt and call `pipe`.
 
 ```python
 prompt = "a photograph of an astronaut riding a horse"
 
-image = pipe(prompt)["sample"][0]
+image = pipe(prompt).images[0]
 
 # you can save the image with
 # image.save(f"astronaut_rides_horse.png")
@@ -153,7 +145,7 @@ print(result)
 
 ```json
 {
-    'sample': [<PIL.Image.Image image mode=RGB size=512x512>],
+    'images': [<PIL.Image.Image image mode=RGB size=512x512>],
     'nsfw_content_detected': [False]
 }
 ```
@@ -362,14 +354,14 @@ from transformers import CLIPTextModel, CLIPTokenizer
 from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
 
 # 1. Load the autoencoder model which will be used to decode the latents into image space. 
-vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_auth_token=YOUR_TOKEN)
+vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
 
 # 2. Load the tokenizer and text encoder to tokenize and encode the text. 
 tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
 text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
 
 # 3. The UNet model for generating the latents.
-unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_auth_token=YOUR_TOKEN)
+unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
 ```
 
 Now instead of loading the pre-defined scheduler, we load the [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/71ba8aec55b52a7ba5a1ff1db1265ffdd3c65ea2/src/diffusers/schedulers/scheduling_lms_discrete.py#L26) with some fitting parameters.
@@ -476,7 +468,7 @@ for t in tqdm(scheduler.timesteps):
     # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
     latent_model_input = torch.cat([latents] * 2)
 
-    latent_model_input = scheduler.scale_model_input(latent_model_input)
+    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
 
     # predict the noise residual
     with torch.no_grad():
@@ -496,7 +488,8 @@ We now use the `vae` to decode the generated `latents` back into the image.
 ```python
 # scale and decode the image latents with vae
 latents = 1 / 0.18215 * latents
-image = vae.decode(latents).sample
+with torch.no_grad():
+    image = vae.decode(latents).sample
 ```
 
 And finally, let's convert the image to PIL so we can display or save it.