Skip to content

Commit

Permalink
[serve] Rename to use replicas, not workers (ray-project#11822)
Browse files Browse the repository at this point in the history
  • Loading branch information
ijrsvt authored Nov 10, 2020
1 parent 9b8218a commit 1d158dd
Show file tree
Hide file tree
Showing 14 changed files with 172 additions and 170 deletions.
10 changes: 5 additions & 5 deletions doc/source/serve/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ the properties of a particular backend.
Scaling Out
===========

To scale out a backend to multiple workers, simply configure the number of replicas.
To scale out a backend to many instances, simply configure the number of replicas.

.. code-block:: python
Expand All @@ -27,14 +27,14 @@ To scale out a backend to multiple workers, simply configure the number of repli
config = {"num_replicas": 2}
client.update_backend_config("my_scaled_endpoint_backend", config)
This will scale up or down the number of workers that can accept requests.
This will scale up or down the number of replicas that can accept requests.

Using Resources (CPUs, GPUs)
============================

To assign hardware resources per worker, you can pass resource requirements to
To assign hardware resources per replica, you can pass resource requirements to
``ray_actor_options``.
By default, each worker requires one CPU.
By default, each replica requires one CPU.
To learn about options to pass in, take a look at :ref:`Resources with Actor<actor-resource-guide>` guide.

For example, to create a backend where each replica uses a single GPU, you can do the
Expand All @@ -49,7 +49,7 @@ Fractional Resources
--------------------

The resources specified in ``ray_actor_options`` can also be *fractional*.
This allows you to flexibly share resources between workers.
This allows you to flexibly share resources between replicas.
For example, if you have two models and each doesn't fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each.
The same could be done to multiplex over CPUs.

Expand Down
22 changes: 11 additions & 11 deletions doc/source/serve/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ There are three kinds of actors that are created to make up a Serve instance:
destroying other actors. Serve API calls like :mod:`client.create_backend <ray.serve.api.Client.create_backend>`,
:mod:`client.create_endpoint <ray.serve.api.Client.create_endpoint>` make remote calls to the Controller.
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
server that accepts incoming requests, forwards them to the worker replicas, and
server that accepts incoming requests, forwards them to replicas, and
responds once they are completed.
- Worker Replica: Worker replicas actually execute the code in response to a
request. For example, they may contain an instantiation of an ML model. Each
Expand All @@ -36,16 +36,16 @@ When an HTTP request is sent to the router, the follow things happen:
- One or more :ref:`backends <serve-backend>` is selected to handle the request given the :ref:`traffic
splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
are placed on a queue.
- For each request in a backend queue, an available worker replica is looked up
and the request is sent to it. If there are no available worker replicas (there
- For each request in a backend queue, an available replica is looked up
and the request is sent to it. If there are no available replicas (there
are more than ``max_concurrent_queries`` requests outstanding), the request
is left in the queue until an outstanding request is finished.

Each worker maintains a queue of requests and processes one batch of requests at
Each replica maintains a queue of requests and processes one batch of requests at
a time. By default the batch size is 1, you can increase the batch size <ref> to
increase throughput. If the handler (the function for the backend or
``__call__``) is ``async``, worker will not wait for the handler to run;
otherwise, worker will block until the handler returns.
``__call__``) is ``async``, the replica will not wait for the handler to run;
otherwise, the replica will block until the handler returns.

FAQ
---
Expand All @@ -54,14 +54,14 @@ How does Serve handle fault tolerance?

Application errors like exceptions in your model evaluation code are caught and
wrapped. A 500 status code will be returned with the traceback information. The
worker replica will be able to continue to handle requests.
replica will be able to continue to handle requests.

Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
actors crashes, those actors will be automatically restarted on another
available machine. All data in the Controller (routing policies, backend
configurations, etc) is checkpointed to the Ray. Transient data in the
router and the worker replica (like network connections and internal request
router and the replica (like network connections and internal request
queues) will be lost upon failure.

How does Serve ensure horizontal scalability and availability?
Expand All @@ -72,14 +72,14 @@ should be able to reach Serve and send requests to any models via any of the
servers.

This architecture ensures horizontal scalability for Serve. You can scale the
router by adding more nodes and scale the model workers by increasing the number
router by adding more nodes and scale the model by increasing the number
of replicas.

How do ServeHandles work?
^^^^^^^^^^^^^^^^^^^^^^^^^

:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
request is sent from one via worker replica to another via the handle, the
request is sent from one via replica to another via the handle, the
requests go through the same data path as incoming HTTP requests. This enables
the same backend selection and batching procedures to happen. ServeHandles are
often used to implement :ref:`model composition <serve-model-composition>`.
Expand All @@ -91,4 +91,4 @@ What happens to large requests?
Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
store. Small request objects are directly sent between actors via network
call. Larger request objects (100KiB+) are written to a distributed shared
memory store and the worker can read them via zero-copy read.
memory store and the replica can read them via zero-copy read.
2 changes: 1 addition & 1 deletion doc/source/serve/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ A backend is defined using :mod:`client.create_backend <ray.serve.api.Client.cre
Use a function when your response is stateless and a class when you might need to maintain some state (like a model).
When using a class, you can specify arguments to be passed to the constructor in :mod:`client.create_backend <ray.serve.api.Client.create_backend>`, shown below.

A backend consists of a number of *replicas*, which are individual copies of the function or class that are started in separate worker processes.
A backend consists of a number of *replicas*, which are individual copies of the function or class that are started in separate Ray Workers (processes).

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion doc/source/serve/tutorials/batch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ are specifying the maximum batch size via ``config={"max_batch_size": 4}``. This
configuration option limits the maximum possible batch size sent to the backend.

.. note::
Ray Serve performs *opportunistic batching*. When a worker is free to evaluate
Ray Serve performs *opportunistic batching*. When a replica is free to evaluate
the next batch, Ray Serve will look at the pending queries and take
``max(number_of_pending_queries, max_batch_size)`` queries to form a batch.
You can provide :mod:`batch_wait_timeout <ray.serve.BackendConfig>` to override
Expand Down
4 changes: 2 additions & 2 deletions python/ray/serve/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ def update_backend_config(
config_options(dict, serve.BackendConfig): Backend config options
to update. Either a BackendConfig object or a dict mapping
strings to values for the following supported options:
- "num_replicas": number of worker processes to start up that
- "num_replicas": number of processes to start up that
will handle requests to this backend.
- "max_batch_size": the maximum number of requests that will
be processed in one batch by this backend.
Expand Down Expand Up @@ -221,7 +221,7 @@ def create_backend(
config (dict, serve.BackendConfig, optional): configuration options
for this backend. Either a BackendConfig, or a dictionary
mapping strings to values for the following supported options:
- "num_replicas": number of worker processes to start up that
- "num_replicas": number of processes to start up that
will handle requests to this backend.
- "max_batch_size": the maximum number of requests that will
be processed in one batch by this backend.
Expand Down
23 changes: 12 additions & 11 deletions python/ray/serve/backend_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,8 @@ async def wait_for_batch(self) -> List[Query]:
return batch


def create_backend_worker(func_or_class: Union[Callable, Type[Callable]]):
"""Creates a worker class wrapping the provided function or class."""
def create_backend_replica(func_or_class: Union[Callable, Type[Callable]]):
"""Creates a replica class wrapping the provided function or class."""

if inspect.isfunction(func_or_class):
is_function = True
Expand All @@ -98,7 +98,7 @@ def create_backend_worker(func_or_class: Union[Callable, Type[Callable]]):
assert False, "func_or_class must be function or class."

# TODO(architkulkarni): Add type hints after upgrading cloudpickle
class RayServeWrappedWorker(object):
class RayServeWrappedReplica(object):
def __init__(self, backend_tag, replica_tag, init_args,
backend_config: BackendConfig, controller_name: str):
# Set the controller name so that serve.connect() will connect to
Expand All @@ -109,8 +109,8 @@ def __init__(self, backend_tag, replica_tag, init_args,
else:
_callable = func_or_class(*init_args)

self.backend = RayServeWorker(backend_tag, replica_tag, _callable,
backend_config, is_function)
self.backend = RayServeReplica(backend_tag, replica_tag, _callable,
backend_config, is_function)

async def handle_request(self, request):
return await self.backend.handle_request(request)
Expand All @@ -121,8 +121,9 @@ def update_config(self, new_config: BackendConfig):
def ready(self):
pass

RayServeWrappedWorker.__name__ = "RayServeWorker_" + func_or_class.__name__
return RayServeWrappedWorker
RayServeWrappedReplica.__name__ = "RayServeReplica_{}".format(
func_or_class.__name__)
return RayServeWrappedReplica


def wrap_to_ray_error(exception: Exception) -> RayTaskError:
Expand All @@ -140,7 +141,7 @@ def ensure_async(func: Callable) -> Callable:
return sync_to_async(func)


class RayServeWorker:
class RayServeReplica:
"""Handles requests with the provided callable."""

def __init__(self, backend_tag: str, replica_tag: str, _callable: Callable,
Expand Down Expand Up @@ -172,8 +173,8 @@ def __init__(self, backend_tag: str, replica_tag: str, _callable: Callable,
self.error_counter.set_default_tags({"backend": self.backend_tag})

self.restart_counter = metrics.Count(
"backend_worker_starts",
description=("The number of time this replica workers "
"backend_replica_starts",
description=("The number of time this replica "
"has been restarted due to failure."),
tag_keys=("backend", "replica_tag"))
self.restart_counter.set_default_tags({
Expand Down Expand Up @@ -288,7 +289,7 @@ async def invoke_batch(self, request_item_list: List[Query]) -> List[Any]:
if not isinstance(result_list, Iterable) or isinstance(
result_list, (dict, set)):
error_message = ("RayServe expects an ordered iterable object "
"but the worker returned a {}".format(
"but the replica returned a {}".format(
type(result_list)))
raise RayServeException(error_message)

Expand Down
8 changes: 4 additions & 4 deletions python/ray/serve/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class BackendMetadata:
class BackendConfig(BaseModel):
"""Configuration options for a backend, to be set by the user.
:param num_replicas: The number of worker processes to start up that will
:param num_replicas: The number of processes to start up that will
handle requests to this backend. Defaults to 0.
:type num_replicas: int, optional
:param max_batch_size: The maximum number of requests that will be
Expand Down Expand Up @@ -81,7 +81,7 @@ def _validate_complete(self):

# Dynamic default for max_concurrent_queries
@validator("max_concurrent_queries", always=True)
def set_max_queries_by_mode(cls, v, values):
def set_max_queries_by_mode(cls, v, values): # noqa 805
if v is None:
# Model serving mode: if the servable is blocking and the wait
# timeout is default zero seconds, then we keep the existing
Expand All @@ -95,8 +95,8 @@ def set_max_queries_by_mode(cls, v, values):
v = 8

# Pipeline/async mode: if the servable is not blocking,
# router should just keep pushing queries to the worker
# replicas until a high limit.
# router should just keep pushing queries to the replicas
# until a high limit.
if not values["internal_metadata"].is_blocking:
v = ASYNC_CONCURRENCY

Expand Down
Loading

0 comments on commit 1d158dd

Please sign in to comment.