[serve] Rename to use replicas, not workers (ray-project#11822)

bobwenx · Nov 10, 2020 · 1d158dd · 1d158dd
1 parent 9b8218a
commit 1d158dd
Show file tree

Hide file tree

Showing 14 changed files with 172 additions and 170 deletions.
diff --git a/doc/source/serve/advanced.rst b/doc/source/serve/advanced.rst
@@ -16,7 +16,7 @@ the properties of a particular backend.
 Scaling Out
 ===========
 
-To scale out a backend to multiple workers, simply configure the number of replicas.
+To scale out a backend to many instances, simply configure the number of replicas.
 
 .. code-block:: python
 
@@ -27,14 +27,14 @@ To scale out a backend to multiple workers, simply configure the number of repli
   config = {"num_replicas": 2}
   client.update_backend_config("my_scaled_endpoint_backend", config)
 
-This will scale up or down the number of workers that can accept requests.
+This will scale up or down the number of replicas that can accept requests.
 
 Using Resources (CPUs, GPUs)
 ============================
 
-To assign hardware resources per worker, you can pass resource requirements to
+To assign hardware resources per replica, you can pass resource requirements to
 ``ray_actor_options``.
-By default, each worker requires one CPU.
+By default, each replica requires one CPU.
 To learn about options to pass in, take a look at :ref:`Resources with Actor<actor-resource-guide>` guide.
 
 For example, to create a backend where each replica uses a single GPU, you can do the
@@ -49,7 +49,7 @@ Fractional Resources
 --------------------
 
 The resources specified in ``ray_actor_options`` can also be *fractional*.
-This allows you to flexibly share resources between workers.
+This allows you to flexibly share resources between replicas.
 For example, if you have two models and each doesn't fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each.
 The same could be done to multiplex over CPUs.
 

diff --git a/doc/source/serve/architecture.rst b/doc/source/serve/architecture.rst
@@ -20,7 +20,7 @@ There are three kinds of actors that are created to make up a Serve instance:
   destroying other actors. Serve API calls like :mod:`client.create_backend <ray.serve.api.Client.create_backend>`,
   :mod:`client.create_endpoint <ray.serve.api.Client.create_endpoint>` make remote calls to the Controller.
 - Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
-  server that accepts incoming requests, forwards them to the worker replicas, and
+  server that accepts incoming requests, forwards them to replicas, and
   responds once they are completed.
 - Worker Replica: Worker replicas actually execute the code in response to a
   request. For example, they may contain an instantiation of an ML model. Each
@@ -36,16 +36,16 @@ When an HTTP request is sent to the router, the follow things happen:
 - One or more :ref:`backends <serve-backend>` is selected to handle the request given the :ref:`traffic
   splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
   are placed on a queue.
-- For each request in a backend queue, an available worker replica is looked up
-  and the request is sent to it. If there are no available worker replicas (there
+- For each request in a backend queue, an available replica is looked up
+  and the request is sent to it. If there are no available replicas (there
   are more than ``max_concurrent_queries`` requests outstanding), the request
   is left in the queue until an outstanding request is finished.
 
-Each worker maintains a queue of requests and processes one batch of requests at
+Each replica maintains a queue of requests and processes one batch of requests at
 a time. By default the batch size is 1, you can increase the batch size <ref> to
 increase throughput. If the handler (the function for the backend or
-``__call__``) is ``async``, worker will not wait for the handler to run;
-otherwise, worker will block until the handler returns.
+``__call__``) is ``async``, the replica will not wait for the handler to run;
+otherwise, the replica will block until the handler returns.
 
 FAQ
 ---
@@ -54,14 +54,14 @@ How does Serve handle fault tolerance?
 
 Application errors like exceptions in your model evaluation code are caught and
 wrapped. A 500 status code will be returned with the traceback information. The
-worker replica will be able to continue to handle requests.
+replica will be able to continue to handle requests.
 
 Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
 reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
 actors crashes, those actors will be automatically restarted on another
 available machine. All data in the Controller (routing policies, backend
 configurations, etc) is checkpointed to the Ray. Transient data in the
-router and the worker replica (like network connections and internal request
+router and the replica (like network connections and internal request
 queues) will be lost upon failure.
 
 How does Serve ensure horizontal scalability and availability?
@@ -72,14 +72,14 @@ should be able to reach Serve and send requests to any models via any of the
 servers.
 
 This architecture ensures horizontal scalability for Serve. You can scale the
-router by adding more nodes and scale the model workers by increasing the number
+router by adding more nodes and scale the model by increasing the number
 of replicas.
 
 How do ServeHandles work?
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
 :mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
-request is sent from one via worker replica to another via the handle, the
+request is sent from one via replica to another via the handle, the
 requests go through the same data path as incoming HTTP requests. This enables
 the same backend selection and batching procedures to happen. ServeHandles are
 often used to implement :ref:`model composition <serve-model-composition>`.
@@ -91,4 +91,4 @@ What happens to large requests?
 Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
 store. Small request objects are directly sent between actors via network
 call. Larger request objects (100KiB+) are written to a distributed shared
-memory store and the worker can read them via zero-copy read.
+memory store and the replica can read them via zero-copy read.
diff --git a/doc/source/serve/key-concepts.rst b/doc/source/serve/key-concepts.rst
@@ -24,7 +24,7 @@ A backend is defined using :mod:`client.create_backend <ray.serve.api.Client.cre
 Use a function when your response is stateless and a class when you might need to maintain some state (like a model).
 When using a class, you can specify arguments to be passed to the constructor in :mod:`client.create_backend <ray.serve.api.Client.create_backend>`, shown below.
 
-A backend consists of a number of *replicas*, which are individual copies of the function or class that are started in separate worker processes.
+A backend consists of a number of *replicas*, which are individual copies of the function or class that are started in separate Ray Workers (processes).
 
 .. code-block:: python
 

diff --git a/doc/source/serve/tutorials/batch.rst b/doc/source/serve/tutorials/batch.rst
@@ -63,7 +63,7 @@ are specifying the maximum batch size via ``config={"max_batch_size": 4}``. This
 configuration option limits the maximum possible batch size sent to the backend.
 
 .. note::
-    Ray Serve performs *opportunistic batching*. When a worker is free to evaluate
+    Ray Serve performs *opportunistic batching*. When a replica is free to evaluate
     the next batch, Ray Serve will look at the pending queries and take
     ``max(number_of_pending_queries, max_batch_size)`` queries to form a batch.
     You can provide :mod:`batch_wait_timeout <ray.serve.BackendConfig>` to override

diff --git a/python/ray/serve/api.py b/python/ray/serve/api.py
@@ -166,7 +166,7 @@ def update_backend_config(
             config_options(dict, serve.BackendConfig): Backend config options
                 to update. Either a BackendConfig object or a dict mapping
                 strings to values for the following supported options:
-                - "num_replicas": number of worker processes to start up that
+                - "num_replicas": number of processes to start up that
                 will handle requests to this backend.
                 - "max_batch_size": the maximum number of requests that will
                 be processed in one batch by this backend.
@@ -221,7 +221,7 @@ def create_backend(
             config (dict, serve.BackendConfig, optional): configuration options
                 for this backend. Either a BackendConfig, or a dictionary
                 mapping strings to values for the following supported options:
-                - "num_replicas": number of worker processes to start up that
+                - "num_replicas": number of processes to start up that
                 will handle requests to this backend.
                 - "max_batch_size": the maximum number of requests that will
                 be processed in one batch by this backend.

diff --git a/python/ray/serve/backend_worker.py b/python/ray/serve/backend_worker.py
@@ -87,8 +87,8 @@ async def wait_for_batch(self) -> List[Query]:
         return batch
 
 
-def create_backend_worker(func_or_class: Union[Callable, Type[Callable]]):
-    """Creates a worker class wrapping the provided function or class."""
+def create_backend_replica(func_or_class: Union[Callable, Type[Callable]]):
+    """Creates a replica class wrapping the provided function or class."""
 
     if inspect.isfunction(func_or_class):
         is_function = True
@@ -98,7 +98,7 @@ def create_backend_worker(func_or_class: Union[Callable, Type[Callable]]):
         assert False, "func_or_class must be function or class."
 
     # TODO(architkulkarni): Add type hints after upgrading cloudpickle
-    class RayServeWrappedWorker(object):
+    class RayServeWrappedReplica(object):
         def __init__(self, backend_tag, replica_tag, init_args,
                      backend_config: BackendConfig, controller_name: str):
             # Set the controller name so that serve.connect() will connect to
@@ -109,8 +109,8 @@ def __init__(self, backend_tag, replica_tag, init_args,
             else:
                 _callable = func_or_class(*init_args)
 
-            self.backend = RayServeWorker(backend_tag, replica_tag, _callable,
-                                          backend_config, is_function)
+            self.backend = RayServeReplica(backend_tag, replica_tag, _callable,
+                                           backend_config, is_function)
 
         async def handle_request(self, request):
             return await self.backend.handle_request(request)
@@ -121,8 +121,9 @@ def update_config(self, new_config: BackendConfig):
         def ready(self):
             pass
 
-    RayServeWrappedWorker.__name__ = "RayServeWorker_" + func_or_class.__name__
-    return RayServeWrappedWorker
+    RayServeWrappedReplica.__name__ = "RayServeReplica_{}".format(
+        func_or_class.__name__)
+    return RayServeWrappedReplica
 
 
 def wrap_to_ray_error(exception: Exception) -> RayTaskError:
@@ -140,7 +141,7 @@ def ensure_async(func: Callable) -> Callable:
     return sync_to_async(func)
 
 
-class RayServeWorker:
+class RayServeReplica:
     """Handles requests with the provided callable."""
 
     def __init__(self, backend_tag: str, replica_tag: str, _callable: Callable,
@@ -172,8 +173,8 @@ def __init__(self, backend_tag: str, replica_tag: str, _callable: Callable,
         self.error_counter.set_default_tags({"backend": self.backend_tag})
 
         self.restart_counter = metrics.Count(
-            "backend_worker_starts",
-            description=("The number of time this replica workers "
+            "backend_replica_starts",
+            description=("The number of time this replica "
                          "has been restarted due to failure."),
             tag_keys=("backend", "replica_tag"))
         self.restart_counter.set_default_tags({
@@ -288,7 +289,7 @@ async def invoke_batch(self, request_item_list: List[Query]) -> List[Any]:
             if not isinstance(result_list, Iterable) or isinstance(
                     result_list, (dict, set)):
                 error_message = ("RayServe expects an ordered iterable object "
-                                 "but the worker returned a {}".format(
+                                 "but the replica returned a {}".format(
                                      type(result_list)))
                 raise RayServeException(error_message)
 

diff --git a/python/ray/serve/config.py b/python/ray/serve/config.py
@@ -30,7 +30,7 @@ class BackendMetadata:
 class BackendConfig(BaseModel):
     """Configuration options for a backend, to be set by the user.
 
-    :param num_replicas: The number of worker processes to start up that will
+    :param num_replicas: The number of processes to start up that will
         handle requests to this backend. Defaults to 0.
     :type num_replicas: int, optional
     :param max_batch_size: The maximum number of requests that will be
@@ -81,7 +81,7 @@ def _validate_complete(self):
 
     # Dynamic default for max_concurrent_queries
     @validator("max_concurrent_queries", always=True)
-    def set_max_queries_by_mode(cls, v, values):
+    def set_max_queries_by_mode(cls, v, values):  # noqa 805
         if v is None:
             # Model serving mode: if the servable is blocking and the wait
             # timeout is default zero seconds, then we keep the existing
@@ -95,8 +95,8 @@ def set_max_queries_by_mode(cls, v, values):
                     v = 8
 
             # Pipeline/async mode: if the servable is not blocking,
-            # router should just keep pushing queries to the worker
-            # replicas until a high limit.
+            # router should just keep pushing queries to the replicas
+            # until a high limit.
             if not values["internal_metadata"].is_blocking:
                 v = ASYNC_CONCURRENCY