Skip to content

Commit

Permalink
[docs] page for using Modin with Ray (ray-project#13937)
Browse files Browse the repository at this point in the history
Co-authored-by: Richard Liaw <[email protected]>
  • Loading branch information
devin-petersohn and richardliaw authored Feb 6, 2021
1 parent f070b3c commit 1412f3c
Show file tree
Hide file tree
Showing 3 changed files with 101 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ Papers
joblib.rst
iter.rst
xgboost-ray.rst
modin/index.rst
dask-on-ray.rst
mars-on-ray.rst
ray-client.rst
Expand Down
97 changes: 97 additions & 0 deletions doc/source/modin/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
Modin (Pandas on Ray)
=====================

Modin_, previously Pandas on Ray, is a dataframe manipulation library that
allows users to speed up their pandas workloads by acting as a drop-in
replacement. Modin also provides support for other APIs (e.g. spreadsheet)
and libraries, like xgboost.

.. code-block:: python
import modin.pandas as pd
import ray
ray.init()
df = pd.read_parquet("s3://my-bucket/big.parquet")
You can use Modin on Ray with your laptop or cluster. In this document,
we show instructions for how to set up a Modin compatible Ray cluster
and connect Modin to Ray.

.. note:: In previous versions of Modin, you had to initialize Ray before importing Modin. As of Modin 0.9.0, This is no longer the case.

Using Modin with Ray's autoscaler
---------------------------------

In order to use Modin with :ref:`Ray's autoscaler <cluster-index>`, you need to ensure that the
correct dependencies are installed at startup. Modin's repository has an
example `yaml file and set of tutorial notebooks`_ to ensure that the Ray
cluster has the correct dependencies. Once the cluster is up, connect Modin
by simply importing.

.. code-block:: python
import modin.pandas as pd
import ray
ray.init(address="auto")
df = pd.read_parquet("s3://my-bucket/big.parquet")
As long as Ray is initialized before any dataframes are created, Modin
will be able to connect to and use the Ray cluster.

Modin with the Ray Client
-------------------------

When using Modin with the :ref:`Ray Client <ray-client>`, it is important to ensure that the
cluster has all dependencies installed.

.. code-block:: python
import modin.pandas as pd
import ray
import ray.util
ray.util.connect()
df = pd.read_parquet("s3://my-bucket/big.parquet")
Modin will automatically use the Ray Client for computation when the file
is read.

How Modin uses Ray
------------------

Modin has a layered architecture, and the core abstraction for data manipulation
is the Modin Dataframe, which implements a novel algebra that enables Modin to
handle all of pandas (see Modin's documentation_ for more on the architecture).
Modin's internal dataframe object has a scheduling layer that is able to partition
and operate on data with Ray.

Dataframe operations
''''''''''''''''''''

The Modin Dataframe uses Ray tasks to perform data manipulations. Ray Tasks have
a number of benefits over the actor model for data manipulation:

- Multiple tasks may be manipulating the same objects simultaneously
- Objects in Ray's object store are immutable, making provenance and lineage easier
to track
- As new workers come online the shuffling of data will happen as tasks are
scheduled on the new node
- Identical partitions need not be replicated, especially beneficial for operations
that selectively mutate the data (e.g. ``fillna``).
- Finer grained parallelism with finer grained placement control

Machine Learning
''''''''''''''''

Modin uses Ray Actors for the machine learning support it currently provides.
Modin's implementation of XGBoost is able to spin up one actor for each node
and aggregate all of the partitions on that node to the XGBoost Actor. Modin
is able to specify precisely the node IP for each actor on creation, giving
fine-grained control over placement - a must for distributed training
performance.

.. _Modin: https://github.com/modin-project/modin
.. _documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html
.. _yaml file and set of tutorial notebooks: https://github.com/modin-project/modin/tree/master/examples/tutorial/tutorial_notebooks/cluster
4 changes: 3 additions & 1 deletion doc/source/ray-client.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _ray-client:

**********
Ray Client
**********
Expand Down Expand Up @@ -34,7 +36,7 @@ From here, another Ray script can access that server from a networked machine wi
do_work.remote(2)
#....
When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster.

============
Expand Down

0 comments on commit 1412f3c

Please sign in to comment.