Deploying Ray on a cluster currently requires a bit of manual work. The instructions here illustrate how to use parallel ssh commands to simplify the process of running commands and scripts on many machines simultaneously.
- Create an EC2 instance running Ray following instructions for installation on
Ubuntu.
- Add any packages that you may need for running your application.
- Install the pssh package:
sudo apt-get install pssh
- Create an AMI Image of your installation.
- Use the EC2 console to launch additional instances using the AMI created.
This section assumes that you have a cluster of machines running and that these nodes have network connectivity to one another. It also assumes that Ray is installed on each machine.
Additional assumptions:
- All of the following commands are run from a machine designated as the head node.
- The head node will run Redis and the global scheduler.
- The head node is the launching point for driver programs and for administrative tasks.
- The head node has ssh access to all other nodes.
- All nodes are accessible via ssh keys
- Ray is checked out on each node at the location
$HOME/ray
.
Note: The commands below will probably need to be customized for your specific setup.
In order to initiate ssh commands from the cluster head node we suggest enabling
ssh agent forwarding. This will allow the session that you initiate with the
head node to connect to other nodes in the cluster to run scripts on them. You
can enable ssh forwarding by running the following command (replacing
<ssh-key>
with the path to the private key that you would use when logging in
to the nodes in the cluster).
ssh-add <ssh-key>
Now log in to the head node with the following command, where
<head-node-public-ip>
is the public IP address of the head node (just choose
one of the nodes to be the head node).
ssh -A ubuntu@<head-node-public-ip>
Populate a file workers.txt
with one IP address on each line. Do not include
the head node IP address in this file. These IP addresses should typically be
private network IP addresses, but any IP addresses which the head node can use
to ssh to worker nodes will work here.
for host in $(cat workers.txt); do
ssh $host uptime
done
You may be prompted to verify the host keys during this process.
On the head node (just choose some node to be the head node), run the following:
./ray/scripts/start_ray.sh --head --num-workers=<num-workers> --redis-port <redis-port>
Replace <redis-port>
with a port of your choice, e.g., 6379
. Also, replace
<num-workers>
with the number of workers that you wish to start.
Create a file start_worker.sh
that contains something like the following:
# Make sure the SSH session has the correct version of Python on its path.
export PATH=/home/ubuntu/anaconda2/bin/:$PATH
ray/scripts/start_ray.sh --num-workers=<num-workers> --redis-address=<head-node-ip>:<redis-port>
This script, when run on the worker nodes, will start up Ray. You will need to
replace <head-node-ip>
with the IP address that worker nodes will use to
connect to the head node (most likely a private IP address). In this
example we also export the path to the Python installation since our remote
commands will not be executing in a login shell.
Warning: You may need to manually export the correct path to Python (you
will need to change the first line of start_worker.sh
to find the version of
Python that Ray was built against). This is necessary because the PATH
environment variable used by parallel-ssh
can differ from the PATH
environment variable that gets set when you ssh
to the machine.
Warning: If the parallel-ssh
command below appears to hang, head-node-ip
may need to be a private IP address instead of a public IP address (e.g., if you
are using EC2).
Now use parallel-ssh
to start up Ray on each worker node.
parallel-ssh -h workers.txt -P -I < start_worker.sh
Note that on some distributions the parallel-ssh
command may be called pssh
.
Now you have started all of the Ray processes on each node. These include:
- Some worker processes on each machine.
- An object store on each machine.
- A local scheduler on each machine.
- One Redis server (on the head node).
- One global scheduler (on the head node).
To confirm that the Ray cluster setup is working, start up Python on one of the nodes in the cluster and enter the following commands to connect to the Ray cluster.
import ray
ray.init(redis_address="<redis-address>")
Here <redis-address>
should have the form <head-node-ip>:<redis-port>
.
Now you can define remote functions and execute tasks. For example:
@ray.remote
def f(x):
return x
ray.get([f.remote(f.remote(f.remote(0))) for _ in range(1000)])
parallel-ssh -h workers.txt -P ray/scripts/stop_ray.sh
This command will execute the stop_ray.sh
script on each of the worker nodes.
ray/scripts/stop_ray.sh
Ray remains under active development so you may at times want to upgrade the cluster to take advantage of improvements and fixes.
On the head node, create a file called upgrade.sh
that contains the commands necessary
to upgrade Ray. It should look something like the following:
# Make sure the SSH session has the correct version of Python on its path.
export PATH=/home/ubuntu/anaconda2/bin/:$PATH
# Do pushd/popd to make sure we end up in the same directory.
pushd .
# Upgrade Ray.
cd ray
git remote set-url origin https://github.com/ray-project/ray
git checkout master
git pull
cd python
python setup.py install --user
popd
This script executes a series of git commands to update the Ray source code, then builds and installs Ray.
Follow the instructions for stopping Ray.
First run the upgrade script on the head node. This will upgrade the head node and help confirm that the upgrade script is working properly.
bash upgrade.sh
Next run the upgrade script on the worker nodes.
parallel-ssh -h workers.txt -P -t 0 -I < upgrade.sh
Note here that we use the -t 0
option to set the timeout to infinite.
Follow the instructions for starting Ray.
If you are running an application that reads input files or uses python libraries then you may find it useful to copy a directory on the head to the worker nodes.
You can do this using the parallel-rsync
command:
parallel-rsync -h workers.txt -r <workload-dir> /home/ubuntu/<workload-dir>
where <workload-dir>
is the directory you want to synchronize.
Note that the destination argument for this command must represent an absolute path on the worker node.
If any of the above commands fail, verify that the head node has SSH access to the other nodes by running
for host in $(cat workers.txt); do
ssh $host uptime
done
If you get a permission denied error, then make sure you have SSH'ed to the head node with agent forwarding enabled. This is done as follows.
ssh-add <ssh-key>
ssh -A ubuntu@<head-node-public-ip>