SD tutorial and benchmarks (deepspeedai#90)

Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Reza Yazdani <[email protected]>
CharlesxrWu · Nov 9, 2022 · 79b56af · 79b56af
1 parent edc5cd2
commit 79b56af
Show file tree

Hide file tree

Showing 15 changed files with 333 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,4 @@ build.txt
 .theia
 .cache
 __pycache__
+mii/version.py
diff --git a/docs/images/sd-hero-dark.png b/docs/images/sd-hero-dark.png
diff --git a/docs/images/sd-hero.png b/docs/images/sd-hero.png
diff --git a/examples/benchmark/txt2img/README.md b/examples/benchmark/txt2img/README.md
@@ -0,0 +1,201 @@
+# Stable Diffusion Image Generation under 1 second w. DeepSpeed MII
+
+<div align="center">
+ <img src="../../../docs/images/sd-hero.png#gh-light-mode-only">
+ <img src="../../../docs/images/sd-hero-dark.png#gh-dark-mode-only">
+</div>
+
+In this tutorial you will learn how to deploy [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4) with state-of-the-art performance optimizations from [DeepSpeed Inference](https://github.com/microsoft/deepspeed) and [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii). In addition to deploying we will perform several performance evaluations.
+
+The performance results above utilized NVIDIA GPUs from Azure: [ND96amsr\_A100\_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series) (NVIDIA A100-80GB) and [ND96asr\_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series) (A100-40GB). We have also used MII-Public with NVIDIA RTX-A6000 GPUs and will include those results at a future date.
+
+## Outline
+* [Optimizations for Stable Diffusion with DeepSpeed-MII](#optimizations)
+* [Environment and dependency setup](#environment-and-dependency-setup)
+* [Evaluation methodology](#evaluation-methodology)
+* [Deploy baseline Stable Diffusion with diffusers](#deploy-baseline-stable-diffusion-with-diffusers)
+* [Deploy Stable Diffusion with MII-Public](#deploy-mii-with-MII-Public)
+* [Deploy Stable Diffusion with MII-Azure](#deploy-mii-with-MII-Azure)
+
+## Stable Diffusion Optimizations with DeepSpeed-MII
+
+DeepSpeed-MII will automatically inject a wide range of optimizations from DeepSpeed-Inference to accelerate Stable Diffusion Deployment. We list the optimizations below:
+
+1. FlashAttention for UNet cross-attention
+    * The implementation is adapted from [Triton](https://github.com/openai/triton)'s FlashAttention and further tuned to accelerate Stable Diffusion specific scenarios.
+2. UNet channel-last memory format
+    * Faster convolution performance using NHWC data layout
+    * Removal of NHWC <--> NCHW data layout conversion through NHWC implementation of missing operators
+3. Custom CUDA implementations of:
+    * LayerNorm
+    * Cross-attention
+4. [CUDA Graph](https://developer.nvidia.com/blog/cuda-graphs/) for VAE, UNet, and CLIP encoder
+5. Custom CUDA implementation of:
+   * GroupNorm
+   * Fusion across multiple elementwise operators
+6. Partial UNet INT8 quantization via [ZeroQuant](https://arxiv.org/abs/2206.01861)
+7. Exploitation of coarse grained computation sparsity
+
+The first four optimizations are available via MII-Public, while the rest are available via MII-Aure (see here to read more about MII-Public and MII-Azure). In the rest of this tutorial, we will show how you can deploy Stable Diffusion with both MII-Public and MII-Azure.
+
+Keep an eye on the [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) repo and this tutorial for further updates and a deeper dive into these and future performance optimizations.
+
+## Environment and dependency setup
+
+Install [DeepSpeed](https://pypi.org/project/deepspeed/) and [DeepSpeed-MII](https://pypi.org/project/mii/) via pip. For this tutorial you'll want to include the "sd" extra with DeepSpeed, this will add a few extra dependencies to enable the optimizations in this tutorial.
+
+```bash
+pip install deepspeed[sd] deepspeed-mii
+```
+
+> **Note**
+> The DeepSpeed version used in the rest of this tutorial uses [this branch](https://github.com/microsoft/DeepSpeed/pull/2491) which will be merged into master and released as part of DeepSpeed v0.7.5 later this week.
+
+In order to check your DeepSpeed install is setup correctly run `ds_report` from your command line. This will show what versions of DeepSpeed, PyTorch, and nvcc will be used at runtime. The bottom half of `ds_report` is show below for our setup:
+
+```
+DeepSpeed general environment info:
+torch install path ............... ['/usr/local/lib/python3.9/dist-packages/torch']
+torch version .................... 1.12.1+cu116
+torch cuda version ............... 11.6
+torch hip version ................ None
+nvcc version ..................... 11.6
+deepspeed install path ........... ['/usr/local/lib/python3.9/dist-packages/deepspeed']
+deepspeed info ................... 0.7.5, unknown, unknown
+deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6
+```
+
+You can see we are running PyTorch 1.12.1 built against CUDA 11.6 and our NVCC version of 11.6 is properly aligned with the installed torch version, this alignment is highly recommended by both us and NVIDIA.
+
+The [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) which includes `nvcc` is required for DeepSpeed. All of our custom CUDA/C++ ops for this tutorial will be compiled the first time you run the model in under 30 seconds. By comparison some similar projects can take upwards of 30+ minutes to compile and get running.
+
+Some additional environment context for reproducibility:
+* deepspeed==0.7.5
+* deepspeed-mii==0.0.3
+* torch==1.12.1+cu116
+* diffusers==0.7.1
+* transformers==4.24.0
+* triton==2.0.0.dev20221005
+* Ubuntu 20.04.4 LTS
+* Python 3.9.15
+
+## Evaluation methodology
+
+The evaluations in this tutorial are done with the full `diffusers` end-to-end pipeline. The primary steps of the pipeline include CLIP, UNet, VAE, safety checker, and PIL conversion. We see that in `diffusers` v0.7.1 the resulting image tensors are moved from GPU to CPU for both safety checking and PIL conversion ([essentially this code block](https://github.com/huggingface/diffusers/blob/v0.7.1/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L406-L420)). We observe between 40-60 ms for the safety checker and PIL conversion in our setup but see these times can vary more dramatically in shared CPU environments.
+
+We run the same text prompt for single and multi-batch cases. In addition, we run each method for 10 trials and report the median value.
+
+## Deploy baseline Stable Diffusion with diffusers
+
+Let's first deploy the baseline Stable Diffusion from the [diffusers tutorial](https://github.com/huggingface/diffusers#text-to-image-generation-with-stable-diffusion). In this example we will show performance on a NVIDIA A100-40GB GPU from Azure. We've modified their example to use an explicit auth token for downloading the model, you can get your auth token from your account on the [Hugging Face Hub](https://huggingface.co/settings/tokens). If you do not already have one, you can create a token by going to your [Hugging Face Settings](https://huggingface.co/settings/tokens) and clicking on the `New Token` button. You will also need to accept the license of [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to be able to download it.
+
+Going forward we will refer to [baseline-sd.py](baseline-sd.py) to run and benchmark a non-MII accelerated baseline.
+
+We utilize the `StableDiffusionPipeline` from diffusers to download and setup the model and move it to our GPU via:
+
+```python
+hf_auth_key = "hf_xxxxxxxxxxx"
+pipe = diffusers.StableDiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4",
+    use_auth_token=hf_auth_key,
+    torch_dtype=torch.float16,
+    revision="fp16").to("cuda")
+```
+
+In general we're able to use this `pipe` to generate an image from text prompts, here is an example:
+
+```python
+image = pipe("a photo of an astronaut riding a horse on mars").images[0]
+image.save("horse-on-mars.png")
+```
+
+We use the `diffusers` [StableDiffusionPipeline](https://huggingface.co/docs/diffusers/v0.7.0/en/api/pipelines/stable_diffusion#diffusers.StableDiffusionPipeline.__call__) defaults for image height/width (512x512) and number of inference steps (50). Changing these values will impact the the latency/throughput you will see.
+
+For your convenience we've created a runnable script that sets up the pipeline, runs an example, and runs a benchmarks. You can run this example via:
+
+```bash
+export HF_AUTH_TOKEN=hf_xxxxxxxx
+python baseline-sd.py
+```
+
+We've created a helper benchmark utility in [utils.py](utils.py) that adds basic timing around each image generation, prints the results, and saves the images.
+
+You can modify the `baseline-sd.py` script to use different batch sizes, in this case we will run batch size 1 to evaluate a latency sensitive scenario.
+
+Here is what we observe in terms of A100-40GB performance over 10 trials with the same prompt:
+
+```
+trial=0, time_taken=2.2166
+trial=1, time_taken=2.2143
+trial=2, time_taken=2.2205
+trial=3, time_taken=2.2106
+trial=4, time_taken=2.2105
+trial=5, time_taken=2.2254
+trial=6, time_taken=2.2044
+trial=7, time_taken=2.2304
+trial=8, time_taken=2.2078
+trial=9, time_taken=2.2097
+median duration: 2.2125
+```
+
+## Deploy Stable diffusion with MII-Public
+
+MII-Public improves latency by up to 1.8x compared to several baselines (see image at top of tutorial). To create a MII-Public deployment, simply provide your Hugging Face auth key in an `mii_config` and tell MII what model and task you want to deploy in the `mii.deploy` API.
+
+```python
+import mii
+
+mii_config = {
+    "dtype": "fp16",
+    "hf_auth_token": "hf_xxxxxxxxxxxxxxx"
+}
+
+mii.deploy(task='text-to-image',
+           model="CompVis/stable-diffusion-v1-4",
+           deployment_name="sd_deploy",
+           mii_config=mii_config)
+```
+
+The above code will deploy Stable Diffusion on your local machine using the DeepSpeed inference open-source optimizations listed above. It will keep the deployment **persistent** and expose a gRPC interface for you to make repeated queries via command-line or from custom applications. See below for how to make queries to your MII deployment:
+
+```python
+import mii
+generator = mii.mii_query_handle("sd_deploy")
+prompt = "a photo of an astronaut riding a horse on mars"
+image = generator.query({'query': prompt}).images[0]
+image.save("horse-on-mars.png")
+```
+
+We've packaged up all that you need to deploy, query, and tear down an SD MII deployment in [mii-sd.py](mii-sd.py) which we will refer to going forward. You can run this example via:
+
+```bash
+export HF_AUTH_TOKEN=hf_xxxxxxxx
+python mii-sd.py
+```
+
+We use the same helper benchmark utility in [utils.py](utils.py) as we did in the baseline to evaluate the MII deployment.
+
+Similar to baseline you can modify the `mii-sd.py` script to use different batch sizes, for comparison purposes we run with batch size 1 to evaluate a latency sensitive scenario.
+
+Here is what we observe in terms of A100-40GB performance over 10 trials with the same prompt:
+
+```
+trial=0, time_taken=1.3935
+trial=1, time_taken=1.2091
+trial=2, time_taken=1.2110
+trial=3, time_taken=1.2068
+trial=4, time_taken=1.2064
+trial=5, time_taken=1.2002
+trial=6, time_taken=1.2063
+trial=7, time_taken=1.2062
+trial=8, time_taken=1.2069
+trial=9, time_taken=1.2063
+median duration: 1.2065
+```
+## Deploy Stable Diffusion with MII-Azure
+
+Continue to watch this space for updates in the coming weeks on how to get access to MII-Azure. We will be providing two options for users to access these optimizations: (1) Azure VM Image and (2) AzureML endpoint deployments.
+
+An Azure VM image will be released via the [Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/) in late November 2022. This VM will have MII-Azure pre-installed with all the required components to get started with models like Stable Diffusion and other MII supported models. The VM image itself will be be free, you only pay for the Azure compute costs associated with the instance(s) you use it with.
+
+A more comprehensive [AzureML (AML)](https://azure.microsoft.com/en-us/free/machine-learning/) deployment with MII-Azure is also in the works to make deploying on AML with MII-Azure quick and easy to use. Keep watching our MII repo for more updates on this release.
diff --git a/examples/benchmark/txt2img/baseline-sd.py b/examples/benchmark/txt2img/baseline-sd.py
@@ -0,0 +1,28 @@
+import os
+import torch
+import diffusers
+from utils import benchmark
+
+# Get HF auth key from environment or replace with key
+hf_auth_key = os.environ["HF_AUTH_TOKEN"]
+
+trials = 10
+batch_size = 1
+save_path = "."
+
+# Setup the stable diffusion pipeline via the diffusers pipeline api
+pipe = diffusers.StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4",
+                                                         use_auth_token=hf_auth_key,
+                                                         torch_dtype=torch.float16,
+                                                         revision="fp16").to("cuda")
+
+# Create batch size number of prompts
+prompts = ["a photo of an astronaut riding a horse on mars"] * batch_size
+
+# Example usage of diffusers pipeline
+results = pipe(prompts)
+for idx, img in enumerate(results.images):
+    img.save(os.path.join(save_path, f"baseline-img{idx}.png"))
+
+# Evaluate performance of pipeline
+benchmark(pipe, prompts, save_path, trials, "baseline")
diff --git a/examples/benchmark/txt2img/mii-sd.py b/examples/benchmark/txt2img/mii-sd.py
@@ -0,0 +1,31 @@
+import os
+import mii
+from utils import benchmark
+
+# Get HF auth key from environment or replace with key
+hf_auth_key = os.environ["HF_AUTH_TOKEN"]
+
+trials = 10
+batch_size = 1
+save_path = "."
+deploy_name = "sd_deploy"
+
+# Deploy Stable Diffusion w. MII
+mii_config = {"dtype": "fp16", "hf_auth_token": hf_auth_key}
+mii.deploy(task='text-to-image',
+           model="CompVis/stable-diffusion-v1-4",
+           deployment_name=deploy_name,
+           mii_config=mii_config)
+
+# Example usage of MII deployment
+pipe = mii.mii_query_handle(deploy_name)
+prompts = {"query": ["a photo of an astronaut riding a horse on mars"] * batch_size}
+results = pipe.query(prompts)
+for idx, img in enumerate(results.images):
+    img.save(os.path.join(save_path, f"mii-img{idx}.png"))
+
+# Evaluate performance of MII
+benchmark(pipe.query, prompts, save_path, trials, "mii")
+
+# Tear down the persistent deployment
+mii.terminate(deploy_name)
diff --git a/examples/benchmark/txt2img/requirements.txt b/examples/benchmark/txt2img/requirements.txt
@@ -0,0 +1,3 @@
+deepspeed>=0.7.4
+deepspeed-mii>=0.0.3
+diffusers>=0.6.0
diff --git a/examples/benchmark/txt2img/utils.py b/examples/benchmark/txt2img/utils.py
@@ -0,0 +1,36 @@
+import os
+import torch
+import time
+import deepspeed
+import mii
+import numpy
+import diffusers
+import transformers
+
+from packaging import version
+
+assert version.parse(diffusers.__version__) >= version.parse('0.7.1'), "diffusers must be 0.7.1+"
+assert version.parse(mii.__version__) >= version.parse("0.0.3"), "mii must be 0.0.3+"
+assert version.parse(deepspeed.__version__) >= version.parse("0.7.5"), "deepspeed must be 0.7.5+"
+assert version.parse(transformers.__version__) >= version.parse("4.24.0"), "transformers must be 4.24.0+"
+
+
+def benchmark(func, inputs, save_path=".", trials=5, tag="", save=True):
+    # Turn off the tqdm progress bar
+    if hasattr(func, "set_progress_bar_config"):
+        func.set_progress_bar_config(disable=True)
+
+    durations = []
+    for trial in range(trials):
+        torch.cuda.synchronize()
+        start = time.perf_counter()
+        with torch.inference_mode():
+            results = func(inputs)
+        torch.cuda.synchronize()
+        duration = time.perf_counter() - start
+        durations.append(duration)
+        print(f"trial={trial}, time_taken={duration:.4f}")
+        if save:
+            for idx, img in enumerate(results.images):
+                img.save(os.path.join(save_path, f"{tag}-trial{trial}-img{idx}.png"))
+    print(f"median duration: {numpy.median(durations):.4f}")
diff --git a/mii/__init__.py b/mii/__init__.py
@@ -7,3 +7,9 @@
 
 from .config import MIIConfig
 from .grpc_related.proto import modelresponse_pb2_grpc
+
+__version__ = "0.0.0"
+try:
+    from .version import __version__
+except ImportError:
+    pass
diff --git a/mii/config.py b/mii/config.py
@@ -14,6 +14,7 @@ class MIIConfig(BaseModel):
     hf_auth_token: str = None
     replace_with_kernel_inject: bool = True
     profile_model_time: bool = False
+    skip_model_check: bool = False
 
     @validator('dtype')
     def dtype_valid(cls, value):

diff --git a/mii/constants.py b/mii/constants.py
@@ -107,4 +107,4 @@ class ModelProvider(enum.Enum):
 
 MII_MODEL_PATH_DEFAULT = "/tmp/mii_models"
 
-GRPC_MAX_MSG_SIZE = 2**30  # 1GB
+GRPC_MAX_MSG_SIZE = 2**27  # ~100MB
diff --git a/mii/deployment.py b/mii/deployment.py
@@ -77,9 +77,11 @@ def deploy(task,
         assert set(deployment_name) <= allowed_chars, "AML deployment names can only contain a-z, A-Z, 0-9, and '-'"
 
     task = mii.utils.get_task(task)
-    mii.utils.check_if_task_and_model_is_valid(task, model)
-    if enable_deepspeed:
-        mii.utils.check_if_task_and_model_is_supported(task, model)
+
+    if not mii_config.skip_model_check:
+        mii.utils.check_if_task_and_model_is_valid(task, model)
+        if enable_deepspeed:
+            mii.utils.check_if_task_and_model_is_supported(task, model)
 
     if enable_deepspeed:
         logger.info(

diff --git a/mii/models/providers/diffusers.py b/mii/models/providers/diffusers.py
@@ -15,4 +15,5 @@ def diffusers_provider(model_path, model_name, task_name, mii_config):
                                                  use_auth_token=mii_config.hf_auth_token,
                                                  **kwargs)
     pipeline = pipeline.to(f"cuda:{local_rank}")
+    pipeline.set_progress_bar_config(disable=True)
     return pipeline
diff --git a/mii/server_client.py b/mii/server_client.py
@@ -15,6 +15,7 @@
 from mii.utils import logger, kwarg_dict_to_proto
 from mii.grpc_related.proto import modelresponse_pb2, modelresponse_pb2_grpc
 from mii.models.utils import ImageResponse
+from mii.constants import GRPC_MAX_MSG_SIZE
 
 
 def mii_query_handle(deployment_name):
@@ -238,7 +239,13 @@ def print_helper(self, args):
     def _initialize_grpc_client(self):
         channels = []
         for i in range(self.num_gpus):
-            channel = grpc.aio.insecure_channel(f'localhost:{self.port_number + i}')
+            channel = grpc.aio.insecure_channel(f'localhost:{self.port_number + i}',
+                                                options=[
+                                                    ('grpc.max_send_message_length',
+                                                     GRPC_MAX_MSG_SIZE),
+                                                    ('grpc.max_receive_message_length',
+                                                     GRPC_MAX_MSG_SIZE)
+                                                ])
             stub = modelresponse_pb2_grpc.ModelResponseStub(channel)
             channels.append(channel)
             self.stubs.append(stub)

diff --git a/setup.py b/setup.py
@@ -63,8 +63,19 @@ def command_exists(cmd):
     # None of the above, probably installing from source
     version_str += f'+{git_hash}'
 
+# write out installed version
+with open("mii/version.py", 'w') as fd:
+    fd.write(f"__version__ = '{version_str}'\n")
+
+# Parse README.md to make long_description for PyPI page.
+thisdir = os.path.abspath(os.path.dirname(__file__))
+with open(os.path.join(thisdir, 'README.md'), encoding='utf-8') as fin:
+    readme_text = fin.read()
+
 setup(name="deepspeed-mii",
       version=version_str,
+      long_description=readme_text,
+      long_description_content_type='text/markdown',
       description='deepspeed mii',
       author='DeepSpeed Team',
       author_email='[email protected]',
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,3 +12,4 @@ build.txt @@
     .theia
     .cache
     __pycache__
+    mii/version.py
Original file line number	Diff line number	Diff line change
Expand Up		@@ -107,4 +107,4 @@ class ModelProvider(enum.Enum):

		MII_MODEL_PATH_DEFAULT = "/tmp/mii_models"

		GRPC_MAX_MSG_SIZE = 2**30 # 1GB
		GRPC_MAX_MSG_SIZE = 2**27 # ~100MB