Skip to content

Commit

Permalink
Misc fixes to diffusers blog post (huggingface#743)
Browse files Browse the repository at this point in the history
* Mix fixes to diffusers blog post

* Add torch.no_grad() when decoding latents

* Update stable_diffusion.md

Co-authored-by: Pedro Cuenca <[email protected]>
  • Loading branch information
osanseviero and pcuenca authored Dec 18, 2022
1 parent a24e068 commit a66f4ab
Showing 1 changed file with 14 additions and 21 deletions.
35 changes: 14 additions & 21 deletions stable_diffusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Now, let's get started by generating some images 🎨.

### License

Before using the model, you need to accept the model [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) in order to download and use the weights.
Before using the model, you need to accept the model [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) in order to download and use the weights. **Note: the license does not need to be explicitly accepted through the UI anymore**.

The license is designed to mitigate the potential harmful effects of such a powerful machine learning system.
We request users to **read the license entirely and carefully**. Here we offer a summary:
Expand All @@ -80,29 +80,21 @@ We request users to **read the license entirely and carefully**. Here we offer a

### Usage

First, you should install `diffusers==0.4.0` to run the following code snippets:
First, you should install `diffusers==0.10.2` to run the following code snippets:

```bash
pip install diffusers==0.4.0 transformers scipy ftfy
pip install diffusers==0.10.2 transformers scipy ftfy accelerate
```

In this post we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
Once you have requested access, make sure to pass your user token as:

```py
YOUR_TOKEN="/your/huggingface/hub/token"
```

After that one-time setup out of the way, we can proceed with Stable Diffusion inference.
In this post we'll use model version [`v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4), but you can also use other versions of the model such as 1.5, 2, and 2.1 with minimal code changes.

The Stable Diffusion model can be run in inference with just a couple of lines using the [`StableDiffusionPipeline`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py) pipeline. The pipeline sets up everything you need to generate images from text with
a simple `from_pretrained` function call.

```python
from diffusers import StableDiffusionPipeline

# get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=YOUR_TOKEN)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
```

If a GPU is available, let's move it to one!
Expand All @@ -114,23 +106,23 @@ pipe.to("cuda")
**Note**: If you are limited by GPU memory and have less than 10GB of GPU RAM available, please
make sure to load the `StableDiffusionPipeline` in float16 precision instead of the default
float32 precision as done above.

You can do so by loading the weights from the `fp16` branch and by telling `diffusers` to expect the
weights to be in float16 precision:

```python
import torch
from diffusers import StableDiffusionPipeline

# get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=YOUR_TOKEN)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16)
```

To run the pipeline, simply define the prompt and call `pipe`.

```python
prompt = "a photograph of an astronaut riding a horse"

image = pipe(prompt)["sample"][0]
image = pipe(prompt).images[0]

# you can save the image with
# image.save(f"astronaut_rides_horse.png")
Expand All @@ -153,7 +145,7 @@ print(result)

```json
{
'sample': [<PIL.Image.Image image mode=RGB size=512x512>],
'images': [<PIL.Image.Image image mode=RGB size=512x512>],
'nsfw_content_detected': [False]
}
```
Expand Down Expand Up @@ -362,14 +354,14 @@ from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

# 1. Load the autoencoder model which will be used to decode the latents into image space.
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_auth_token=YOUR_TOKEN)
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")

# 2. Load the tokenizer and text encoder to tokenize and encode the text.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_auth_token=YOUR_TOKEN)
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
```

Now instead of loading the pre-defined scheduler, we load the [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/71ba8aec55b52a7ba5a1ff1db1265ffdd3c65ea2/src/diffusers/schedulers/scheduling_lms_discrete.py#L26) with some fitting parameters.
Expand Down Expand Up @@ -476,7 +468,7 @@ for t in tqdm(scheduler.timesteps):
# expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
latent_model_input = torch.cat([latents] * 2)

latent_model_input = scheduler.scale_model_input(latent_model_input)
latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

# predict the noise residual
with torch.no_grad():
Expand All @@ -496,7 +488,8 @@ We now use the `vae` to decode the generated `latents` back into the image.
```python
# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
image = vae.decode(latents).sample
with torch.no_grad():
image = vae.decode(latents).sample
```

And finally, let's convert the image to PIL so we can display or save it.
Expand Down

0 comments on commit a66f4ab

Please sign in to comment.