[Serve][Doc] Multiplexing doc (ray-project#35701)

chwjbn · May 26, 2023 · e92e554 · e92e554
1 parent 3d124cc
commit e92e554
Show file tree

Hide file tree

Showing 5 changed files with 138 additions and 1 deletion.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -284,6 +284,7 @@ parts:
               - file: serve/managing-java-deployments
               - file: serve/migration
               - file: serve/direct-ingress
+              - file: serve/model-multiplexing
           - file: serve/architecture
           - file: serve/tutorials/index
             sections:

diff --git a/doc/source/serve/doc_code/multiplexed.py b/doc/source/serve/doc_code/multiplexed.py
@@ -0,0 +1,40 @@
+# __serve_deployment_example_begin__
+
+from ray import serve
+import aioboto3
+import torch
+import starlette
+
+
+@serve.deployment
+class ModelInferencer:
+    def __init__(self):
+        self.bucket_name = "my_bucket"
+
+    @serve.multiplexed(max_num_models_per_replica=3)
+    async def get_model(self, model_id: str):
+        session = aioboto3.Session()
+        async with session.resource("s3") as s3:
+            obj = await s3.Bucket(self.bucket_name)
+            await obj.download_file(f"{model_id}/model.pt", f"model_{model_id}.pt")
+            return torch.load(f"model_{model_id}.pt")
+
+    async def __call__(self, request: starlette.requests.Request):
+        model_id = serve.get_multiplexed_model_id()
+        model = await self.get_model(model_id)
+        return model.forward(torch.rand(64, 3, 512, 512))
+
+
+entry = ModelInferencer.bind()
+
+# __serve_deployment_example_end__
+
+serve.run(entry)
+
+# __serve_request_send_example_begin__
+import requests  # noqa: E402
+
+resp = requests.get(
+    "http://localhost:8000", headers={"serve_multiplexed_model_id": str("1")}
+)
+# __serve_request_send_example_end__
diff --git a/doc/source/serve/model-multiplexing.md b/doc/source/serve/model-multiplexing.md
@@ -0,0 +1,71 @@
+# Model Multiplexing
+
+This section helps you understand how to write multiplexed deployment by using the `serve.multiplexed` and `serve.get_multiplexed_model_id` APIs.
+
+This is an experimental feature and the API may change in the future. You are welcome to try it out and give us feedback!
+
+## Why model multiplexing?
+
+Model multiplexing is a technique used to efficiently serve multiple models with similar input types from a pool of replicas.. Traffic is routed to the corresponding model based on the request header. To serve multiple models with a pool of replicas to optimize cost. This is useful in cases where you might have many models with the same shape but different weights that are sparsely invoked. If any replica for the deployment has the model loaded, incoming traffic for that model (based on request header) will automatically be routed to that replica avoiding unnecessary load time.
+
+## Wrting a multiplexed deployment
+
+To write a multiplexed deployment, use the `serve.multiplexed` and `serve.get_multiplexed_model_id` APIs.
+
+Assuming you have multiple Torch models inside an aws s3 bucket with the following structure:
+```
+s3://my_bucket/1/model.pt
+s3://my_bucket/2/model.pt
+s3://my_bucket/3/model.pt
+s3://my_bucket/4/model.pt
+...
+```
+
+Define a multiplexed deployment:
+```{literalinclude} doc_code/multiplexed.py
+:language: python
+:start-after: __serve_deployment_example_begin__
+:end-before: __serve_deployment_example_end__
+```
+
+:::{note}
+The `serve.multiplexed` API also has a `max_num_models_per_replica` parameter. Use it to configure how many models to load in a single replica. If the number of models is larger than `max_num_models_per_replica`, Serve uses the LRU policy to evict the least recently used model.
+:::
+
+:::{tip}
+This code example uses the Pytorch Model object. You can also define your own model class and use it here. To release resources when the model is evicted, implement the `__del__` method. Ray Serve internally calls the `__del__` method to release resources when the model is evicted.
+:::
+
+
+`serve.get_multiplexed_model_id` is used to retrieve the model id from the request header, and the model_id is then passed into the `get_model` function. If the model id is not found in the replica, Serve will load the model from the s3 bucket and cache it in the replica. If the model id is found in the replica, Serve will return the cached model.
+
+:::{note}
+Internally, serve router will route the traffic to the corresponding replica based on the model id in the request header.
+If all replicas holding the model are over-subscribed, ray serve sends the request to a new replica that doesn't have the model loaded. The replica will load the model from the s3 bucket and cache it.
+:::
+
+To send a request to a specific model, include the field `serve_multiplexed_model_id` in the request header, and set the value to the model ID to which you want to send the request.
+```{literalinclude} doc_code/multiplexed.py
+:language: python
+:start-after: __serve_request_send_example_begin__
+:end-before: __serve_request_send_example_end__
+```
+
+:::{note}
+`serve_multiplexed_model_id` is required in the request header, and the value should be the model id you want to send the request to.
+:::
+
+After you run the above code, you should see the following lines in the deployment logs:
+```
+INFO 2023-05-24 01:19:03,853 default_Model default_Model#EjYmnQ CUpzhwUUNw / default replica.py:442 - Started executing request CUpzhwUUNw
+INFO 2023-05-24 01:19:03,854 default_Model default_Model#EjYmnQ CUpzhwUUNw / default multiplex.py:131 - Loading model '1'.
+INFO 2023-05-24 01:19:04,859 default_Model default_Model#EjYmnQ CUpzhwUUNw / default replica.py:542 - __CALL__ OK 1005.8ms
+```
+
+If you continue to load more models and exceed the `max_num_models_per_replica`, the least recently used model will be evicted and you will see the following lines in the deployment logs::
+```
+INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default replica.py:442 - Started executing request WzjTbJvbPN
+INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default multiplex.py:145 - Unloading model '3'.
+INFO 2023-05-24 01:19:15,988 default_Model default_Model#rimNjA WzjTbJvbPN / default multiplex.py:131 - Loading model '4'.
+INFO 2023-05-24 01:19:16,993 default_Model default_Model#rimNjA WzjTbJvbPN / default replica.py:542 - __CALL__ OK 1005.7ms
+```
diff --git a/doc/source/serve/production-guide/monitoring.md b/doc/source/serve/production-guide/monitoring.md
@@ -307,8 +307,32 @@ The following metrics are exposed by Ray Serve:
    * - ``serve_http_request_latency_ms`` [*]
      - * route
        * application
-
      - The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy).
+   * - ``serve_multiplexed_model_load_latency_s``
+     - * deployment
+       * replica
+       * application
+     - The time it takes to load a model.
+   * - ``serve_multiplexed_model_unload_latency_s``
+     - * deployment
+       * replica
+       * application
+     - The time it takes to unload a model.
+   * - ``serve_num_multiplexed_models``
+     - * deployment
+       * replica
+       * application
+     - The number of models loaded on the current replica.
+   * - ``serve_multiplexed_models_unload_counter``
+     - * deployment
+       * replica
+       * application
+     - The number of times models unloaded on the current replica.
+   * - ``serve_multiplexed_models_load_counter``
+     - * deployment
+       * replica
+       * application
+     - The number of times models loaded on the current replica.
 ```
 [*] - only available when using HTTP calls
 [**] - only available when using Python `ServeHandle` calls

diff --git a/doc/source/serve/user-guide.md b/doc/source/serve/user-guide.md
@@ -16,3 +16,4 @@ This user guide will help you navigate the Ray Serve project and show you how to
 - [Experimental Java API](managing-java-deployments)
 - [1.x to 2.x API Migration Guide](migration)
 - [Experimental gRPC Support](direct-ingress)
+- [Model Multiplexing](model-multiplexing)