From aae27262f408957ff53c64b3a18581959e8fd8e0 Mon Sep 17 00:00:00 2001 From: Harutatsu Akiyama Date: Sun, 30 Jul 2023 14:37:10 +1000 Subject: [PATCH] [SDXL-IP2P] Add gif for demonstrating training processes (#4342) * [SDXL-IP2P] Add gif for demonstrating training processes * [SDXL-IP2P] Add gif for demonstrating training processes * [SDXL-IP2P] Change gif to URLs * [SDXL-IP2P] Add URLs in case gif now show --------- Co-authored-by: Harutatsu Akiyama --- examples/instruct_pix2pix/README_sdxl.md | 45 ++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/examples/instruct_pix2pix/README_sdxl.md b/examples/instruct_pix2pix/README_sdxl.md index db3267b18d57..3d521916b47b 100644 --- a/examples/instruct_pix2pix/README_sdxl.md +++ b/examples/instruct_pix2pix/README_sdxl.md @@ -146,3 +146,48 @@ Particularly, `image_guidance_scale` and `guidance_scale` can have a profound im on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example). If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). + +## Compare between SD and SDXL + +We aim to understand the differences resulting from the use of SD-1.5 and SDXL-0.9 as pretrained models. To achieve this, we trained on the [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) using both of these pretrained models. The training script is as follows: + +```bash +export MODEL_NAME="runwayml/stable-diffusion-v1-5" or "stabilityai/stable-diffusion-xl-base-0.9" +export DATASET_ID="fusing/instructpix2pix-1000-samples" + +CUDA_VISIBLE_DEVICES=1 python train_instruct_pix2pix.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_ID \ + --use_ema \ + --enable_xformers_memory_efficient_attention \ + --resolution=512 --random_flip \ + --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ + --max_train_steps=15000 \ + --checkpointing_steps=5000 --checkpoints_total_limit=1 \ + --learning_rate=5e-05 --lr_warmup_steps=0 \ + --conditioning_dropout_prob=0.05 \ + --seed=42 \ + --val_image_url="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \ + --validation_prompt="make it in Japan" \ + --report_to=wandb +``` + +We discovered that compared to training with SD-1.5 as the pretrained model, SDXL-0.9 results in a lower training loss value (SD-1.5 yields 0.0599, SDXL scores 0.0254). Moreover, from a visual perspective, the results obtained using SDXL demonstrated fewer artifacts and a richer detail. Notably, SDXL starts to preserve the structure of the original image earlier on. + +The following two GIFs provide intuitive visual results. We observed, for each step, what kind of results could be achieved using the image +

+ input for make it Japan +

+with "make it in Japan” as the prompt. It can be seen that SDXL starts preserving the details of the original image earlier, resulting in higher fidelity outcomes sooner. + +* SD-1.5: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd_ip2p_training_val_img_progress.gif + +

+ input for make it Japan +

+ +* SDXL: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_ip2p_training_val_img_progress.gif + +

+ input for make it Japan +