Merge branch 'main' into rl

antoche · Nov 14, 2022 · c901889 · c901889
2 parents 915c41e + 33d7e89
commit c901889
Show file tree

Hide file tree

Showing 55 changed files with 2,501 additions and 1,187 deletions.
diff --git a/.github/workflows/pr_tests.yml b/.github/workflows/pr_tests.yml
@@ -136,7 +136,7 @@ jobs:
     - name: Run fast PyTorch tests on M1 (MPS)
       shell: arch -arch arm64 bash {0}
       run: |
-        ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps tests/
+        ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
 
     - name: Failure short reports
       if: ${{ failure() }}

diff --git a/README.md b/README.md
@@ -428,7 +428,7 @@ If you just want to play around with some web demos, you can try out the followi
 <p>
 
 **Schedulers**: Algorithm class for both **inference** and **training**.
-The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
+The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training. Also known as **Samplers**.
 *Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
 
 <p align="center">

diff --git a/docs/source/api/pipelines/ddim.mdx b/docs/source/api/pipelines/ddim.mdx
@@ -20,7 +20,8 @@ The abstract of the paper is the following:
 
 Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
 
-The original codebase of this paper can be found [here](https://github.com/ermongroup/ddim).
+The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
+For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
 
 ## Available Pipelines:
 

diff --git a/docs/source/api/pipelines/latent_diffusion.mdx b/docs/source/api/pipelines/latent_diffusion.mdx
@@ -33,10 +33,15 @@ The original codebase can be found [here](https://github.com/CompVis/latent-diff
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_latent_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) | *Text-to-Image Generation* | - |
+| [pipeline_latent_diffusion_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py) | *Super Resolution* | - |
 
 ## Examples:
 
 
 ## LDMTextToImagePipeline
 [[autodoc]] pipelines.latent_diffusion.pipeline_latent_diffusion.LDMTextToImagePipeline
     - __call__
+
+## LDMSuperResolutionPipeline
+[[autodoc]] pipelines.latent_diffusion.pipeline_latent_diffusion_superresolution.LDMSuperResolutionPipeline
+    - __call__
diff --git a/docs/source/api/schedulers.mdx b/docs/source/api/schedulers.mdx
@@ -16,7 +16,7 @@ Diffusers contains multiple pre-built schedule functions for the diffusion proce
 
 ## What is a scheduler?
 
-The schedule functions, denoted *Schedulers* in the library take in the output of a trained model, a sample which the diffusion process is iterating on, and a timestep to return a denoised sample.
+The schedule functions, denoted *Schedulers* in the library take in the output of a trained model, a sample which the diffusion process is iterating on, and a timestep to return a denoised sample. That's why schedulers may also be called *Samplers* in other diffusion models implementations.
 
 - Schedulers define the methodology for iteratively adding noise to an image or for updating a sample based on model outputs.
     - adding noise in different manners represent the algorithmic processes to train a diffusion model by adding noise to images.

diff --git a/docs/source/using-diffusers/img2img.mdx b/docs/source/using-diffusers/img2img.mdx
@@ -33,7 +33,7 @@ url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/st
 
 response = requests.get(url)
 init_image = Image.open(BytesIO(response.content)).convert("RGB")
-init_image = init_image.resize((768, 512))
+init_image.thumbnail((768, 768))
 
 prompt = "A fantasy landscape, trending on artstation"
 

diff --git a/examples/community/README.md b/examples/community/README.md
@@ -179,9 +179,20 @@ images = pipe.inpaint(prompt=prompt, init_image=init_image, mask_image=mask_imag
 As shown above this one pipeline can run all both "text-to-image", "image-to-image", and "inpainting" in one pipeline.
 
 ### Long Prompt Weighting Stable Diffusion
-
-The Pipeline lets you input prompt without 77 token length limit. And you can increase words weighting by using "()" or decrease words weighting by using "[]"
-The Pipeline also lets you use the main use cases of the stable diffusion pipeline in a single class.
+Features of this custom pipeline:
+- Input a prompt without the 77 token length limit.
+- Includes tx2img, img2img. and inpainting pipelines.
+- Emphasize/weigh part of your prompt with parentheses as so: `a baby deer with (big eyes)`
+- De-emphasize part of your prompt as so: `a [baby] deer with big eyes`
+- Precisely weigh part of your prompt as so: `a baby deer with (big eyes:1.3)`
+
+Prompt weighting equivalents:
+- `a baby deer with` == `(a baby deer with:1.0)`
+- `(big eyes)` == `(big eyes:1.1)`
+- `((big eyes))` == `(big eyes:1.21)`
+- `[big eyes]` == `(big eyes:0.91)`
+
+You can run this custom pipeline as so:
 
 #### pytorch
 

diff --git a/examples/community/clip_guided_stable_diffusion.py b/examples/community/clip_guided_stable_diffusion.py
@@ -5,7 +5,14 @@
 from torch import nn
 from torch.nn import functional as F
 
-from diffusers import AutoencoderKL, DiffusionPipeline, LMSDiscreteScheduler, PNDMScheduler, UNet2DConditionModel
+from diffusers import (
+    AutoencoderKL,
+    DDIMScheduler,
+    DiffusionPipeline,
+    LMSDiscreteScheduler,
+    PNDMScheduler,
+    UNet2DConditionModel,
+)
 from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion import StableDiffusionPipelineOutput
 from torchvision import transforms
 from transformers import CLIPFeatureExtractor, CLIPModel, CLIPTextModel, CLIPTokenizer
@@ -56,7 +63,7 @@ def __init__(
         clip_model: CLIPModel,
         tokenizer: CLIPTokenizer,
         unet: UNet2DConditionModel,
-        scheduler: Union[PNDMScheduler, LMSDiscreteScheduler],
+        scheduler: Union[PNDMScheduler, LMSDiscreteScheduler, DDIMScheduler],
         feature_extractor: CLIPFeatureExtractor,
     ):
         super().__init__()
@@ -123,7 +130,7 @@ def cond_fn(
         # predict the noise residual
         noise_pred = self.unet(latent_model_input, timestep, encoder_hidden_states=text_embeddings).sample
 
-        if isinstance(self.scheduler, PNDMScheduler):
+        if isinstance(self.scheduler, (PNDMScheduler, DDIMScheduler)):
             alpha_prod_t = self.scheduler.alphas_cumprod[timestep]
             beta_prod_t = 1 - alpha_prod_t
             # compute predicted original sample from predicted noise also called
@@ -176,6 +183,7 @@ def __call__(
         num_inference_steps: Optional[int] = 50,
         guidance_scale: Optional[float] = 7.5,
         num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
         clip_guidance_scale: Optional[float] = 100,
         clip_prompt: Optional[Union[str, List[str]]] = None,
         num_cutouts: Optional[int] = 4,
@@ -275,6 +283,20 @@ def __call__(
         # scale the initial noise by the standard deviation required by the scheduler
         latents = latents * self.scheduler.init_noise_sigma
 
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+
         for i, t in enumerate(self.progress_bar(timesteps_tensor)):
             # expand the latents if we are doing classifier free guidance
             latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
@@ -306,7 +328,7 @@ def __call__(
                 )
 
             # compute the previous noisy sample x_t -> x_t-1
-            latents = self.scheduler.step(noise_pred, t, latents).prev_sample
+            latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
 
         # scale and decode the image latents with vae
         latents = 1 / 0.18215 * latents

diff --git a/examples/dreambooth/train_dreambooth_flax.py b/examples/dreambooth/train_dreambooth_flax.py
@@ -327,22 +327,6 @@ def main():
     if args.seed is not None:
         set_seed(args.seed)
 
-    if jax.process_index() == 0:
-        if args.push_to_hub:
-            if args.hub_model_id is None:
-                repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)
-            else:
-                repo_name = args.hub_model_id
-            repo = Repository(args.output_dir, clone_from=repo_name)
-
-            with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore:
-                if "step_*" not in gitignore:
-                    gitignore.write("step_*\n")
-                if "epoch_*" not in gitignore:
-                    gitignore.write("epoch_*\n")
-        elif args.output_dir is not None:
-            os.makedirs(args.output_dir, exist_ok=True)
-
     rng = jax.random.PRNGKey(args.seed)
 
     if args.with_prior_preservation:

diff --git a/examples/unconditional_image_generation/train_unconditional.py b/examples/unconditional_image_generation/train_unconditional.py
@@ -11,10 +11,12 @@
 from accelerate import Accelerator
 from accelerate.logging import get_logger
 from datasets import load_dataset
-from diffusers import DDPMPipeline, DDPMScheduler, UNet2DModel
+from diffusers import DDPMPipeline, DDPMScheduler, UNet2DModel, __version__
 from diffusers.optimization import get_scheduler
 from diffusers.training_utils import EMAModel
+from diffusers.utils import deprecate
 from huggingface_hub import HfFolder, Repository, whoami
+from packaging import version
 from torchvision.transforms import (
     CenterCrop,
     Compose,
@@ -28,6 +30,7 @@
 
 
 logger = get_logger(__name__)
+diffusers_version = version.parse(version.parse(__version__).base_version)
 
 
 def _extract_into_tensor(arr, timesteps, broadcast_shape):
@@ -406,7 +409,11 @@ def transforms(examples):
                     scheduler=noise_scheduler,
                 )
 
-                generator = torch.manual_seed(0)
+                deprecate("todo: remove this check", "0.10.0", "when the most used version is >= 0.8.0")
+                if diffusers_version < version.parse("0.8.0"):
+                    generator = torch.manual_seed(0)
+                else:
+                    generator = torch.Generator(device=pipeline.device).manual_seed(0)
                 # run pipeline in inference (sample random noise and denoise)
                 images = pipeline(
                     generator=generator,

diff --git a/scripts/convert_ldm_original_checkpoint_to_diffusers.py b/scripts/convert_ldm_original_checkpoint_to_diffusers.py
@@ -112,9 +112,9 @@ def assign_to_checkpoint(
             continue
 
         # Global renaming happens here
-        new_path = new_path.replace("middle_block.0", "mid.resnets.0")
-        new_path = new_path.replace("middle_block.1", "mid.attentions.0")
-        new_path = new_path.replace("middle_block.2", "mid.resnets.1")
+        new_path = new_path.replace("middle_block.0", "mid_block.resnets.0")
+        new_path = new_path.replace("middle_block.1", "mid_block.attentions.0")
+        new_path = new_path.replace("middle_block.2", "mid_block.resnets.1")
 
         if additional_replacements is not None:
             for replacement in additional_replacements:
@@ -175,15 +175,16 @@ def convert_ldm_checkpoint(checkpoint, config):
         attentions = [key for key in input_blocks[i] if f"input_blocks.{i}.1" in key]
 
         if f"input_blocks.{i}.0.op.weight" in checkpoint:
-            new_checkpoint[f"downsample_blocks.{block_id}.downsamplers.0.conv.weight"] = checkpoint[
+            new_checkpoint[f"down_blocks.{block_id}.downsamplers.0.conv.weight"] = checkpoint[
                 f"input_blocks.{i}.0.op.weight"
             ]
-            new_checkpoint[f"downsample_blocks.{block_id}.downsamplers.0.conv.bias"] = checkpoint[
+            new_checkpoint[f"down_blocks.{block_id}.downsamplers.0.conv.bias"] = checkpoint[
                 f"input_blocks.{i}.0.op.bias"
             ]
+            continue
 
         paths = renew_resnet_paths(resnets)
-        meta_path = {"old": f"input_blocks.{i}.0", "new": f"downsample_blocks.{block_id}.resnets.{layer_in_block_id}"}
+        meta_path = {"old": f"input_blocks.{i}.0", "new": f"down_blocks.{block_id}.resnets.{layer_in_block_id}"}
         resnet_op = {"old": "resnets.2.op", "new": "downsamplers.0.op"}
         assign_to_checkpoint(
             paths, new_checkpoint, checkpoint, additional_replacements=[meta_path, resnet_op], config=config
@@ -193,18 +194,18 @@ def convert_ldm_checkpoint(checkpoint, config):
             paths = renew_attention_paths(attentions)
             meta_path = {
                 "old": f"input_blocks.{i}.1",
-                "new": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}",
+                "new": f"down_blocks.{block_id}.attentions.{layer_in_block_id}",
             }
             to_split = {
                 f"input_blocks.{i}.1.qkv.bias": {
-                    "key": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.key.bias",
-                    "query": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.query.bias",
-                    "value": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.value.bias",
+                    "key": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.key.bias",
+                    "query": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.query.bias",
+                    "value": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.value.bias",
                 },
                 f"input_blocks.{i}.1.qkv.weight": {
-                    "key": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.key.weight",
-                    "query": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.query.weight",
-                    "value": f"downsample_blocks.{block_id}.attentions.{layer_in_block_id}.value.weight",
+                    "key": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.key.weight",
+                    "query": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.query.weight",
+                    "value": f"down_blocks.{block_id}.attentions.{layer_in_block_id}.value.weight",
                 },
             }
             assign_to_checkpoint(

diff --git a/scripts/convert_original_stable_diffusion_to_diffusers.py b/scripts/convert_original_stable_diffusion_to_diffusers.py
@@ -30,6 +30,9 @@
 from diffusers import (
     AutoencoderKL,
     DDIMScheduler,
+    DPMSolverMultistepScheduler,
+    EulerAncestralDiscreteScheduler,
+    EulerDiscreteScheduler,
     LDMTextToImagePipeline,
     LMSDiscreteScheduler,
     PNDMScheduler,
@@ -647,7 +650,7 @@ def convert_ldm_clip_checkpoint(checkpoint):
         "--scheduler_type",
         default="pndm",
         type=str,
-        help="Type of scheduler to use. Should be one of ['pndm', 'lms', 'ddim']",
+        help="Type of scheduler to use. Should be one of ['pndm', 'lms', 'ddim', 'euler', 'euler-ancest', 'dpm']",
     )
     parser.add_argument(
         "--extract_ema",
@@ -686,6 +689,16 @@ def convert_ldm_clip_checkpoint(checkpoint):
         )
     elif args.scheduler_type == "lms":
         scheduler = LMSDiscreteScheduler(beta_start=beta_start, beta_end=beta_end, beta_schedule="scaled_linear")
+    elif args.scheduler_type == "euler":
+        scheduler = EulerDiscreteScheduler(beta_start=beta_start, beta_end=beta_end, beta_schedule="scaled_linear")
+    elif args.scheduler_type == "euler-ancestral":
+        scheduler = EulerAncestralDiscreteScheduler(
+            beta_start=beta_start, beta_end=beta_end, beta_schedule="scaled_linear"
+        )
+    elif args.scheduler_type == "dpm":
+        scheduler = DPMSolverMultistepScheduler(
+            beta_start=beta_start, beta_end=beta_end, beta_schedule="scaled_linear"
+        )
     elif args.scheduler_type == "ddim":
         scheduler = DDIMScheduler(
             beta_start=beta_start,

diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py
@@ -35,6 +35,7 @@
         DDPMPipeline,
         KarrasVePipeline,
         LDMPipeline,
+        LDMSuperResolutionPipeline,
         PNDMPipeline,
         RePaintPipeline,
         ScoreSdeVePipeline,

diff --git a/src/diffusers/models/attention.py b/src/diffusers/models/attention.py
@@ -557,6 +557,9 @@ def _sliced_attention(self, query, key, value, sequence_length, dim):
         return hidden_states
 
     def _memory_efficient_attention_xformers(self, query, key, value):
+        query = query.contiguous()
+        key = key.contiguous()
+        value = value.contiguous()
         hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=None)
         hidden_states = self.reshape_batch_dim_to_heads(hidden_states)
         return hidden_states