Skip to content

Commit

Permalink
[Doc][KubeRay] Update RayJob doc (ray-project#42790)
Browse files Browse the repository at this point in the history
Update the documentation for the KubeRay v1.1.0 release. Additionally, this PR adds a new section to demonstrate how the KubeRay operator cleans up computing resources after the Ray job finishes.

---------

Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: angelinalg <[email protected]>
  • Loading branch information
kevin85421 and angelinalg authored Jan 30, 2024
1 parent 2237aff commit c1aaaa3
Showing 1 changed file with 82 additions and 42 deletions.
124 changes: 82 additions & 42 deletions doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,46 @@

# RayJob Quickstart

:::{warning}
RayJob support in KubeRay v0.x is in alpha.
:::

## Prerequisites

* Ray 1.10 or higher
* KubeRay v0.3.0+. (v0.6.0+ is recommended)
* KubeRay v0.6.0 or higher
* KubeRay v0.6.0 or v1.0.0: Ray 1.10 or higher.
* KubeRay v1.1.0 is highly recommended: Ray 2.8.0 or higher. This document is mainly for KubeRay v1.1.0.

## What is a RayJob?
## What's a RayJob?

A RayJob manages two aspects:

* **RayCluster**: Manages resources in a Kubernetes cluster.
* **RayCluster**: A RayCluster custom resource manages all Pods in a Ray cluster, including a head Pod and multiple worker Pods.
* **Job**: A Kubernetes Job runs `ray job submit` to submit a Ray job to the RayCluster.

## What does the RayJob provide?

* **Kubernetes-native support for Ray clusters and Ray jobs**: You can use a Kubernetes config to define a Ray cluster and job, and use `kubectl` to create them. The cluster can be deleted automatically once the job is finished.
With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.

To understand the following content better, you should understand the difference between:
* RayJob: A Kubernetes custom resource definition (CRD) provided by KubeRay.
* Ray job: A Ray job is a packaged Ray application that can run on a remote Ray cluster. See [this document](jobs-overview) for more details.
* Submitter: The submitter is a Kubernetes Job that runs `ray job submit` to submit a Ray job to the RayCluster.

## RayJob Configuration

* `entrypoint` - The shell command to run for this job.
* `rayClusterSpec` - The spec for the **RayCluster** to run the job on.
* `jobId` - _(Optional)_ Job ID to specify for the job. If not provided, one will be generated.
* `metadata` - _(Optional)_ Arbitrary user-provided metadata for the job.
* `runtimeEnvYAML` - _(Optional)_ The runtime environment configuration provided as a multi-line YAML string. _(New in KubeRay version 1.0.)_
* `shutdownAfterJobFinishes` - _(Optional)_ whether to recycle the cluster after the job finishes. Defaults to false.
* `ttlSecondsAfterFinished` - _(Optional)_ TTL to clean up the cluster. This only works if `shutdownAfterJobFinishes` is set.
* `submitterPodTemplate` - _(Optional)_ Pod template spec for the pod that runs `ray job submit` against the Ray cluster.
* `entrypointNumCpus` - _(Optional)_ Specifies the quantity of CPU cores to reserve for the entrypoint command. _(New in KubeRay version 1.0.)_
* `entrypointNumGpus` - _(Optional)_ Specifies the number of GPUs to reserve for the entrypoint command. _(New in KubeRay version 1.0.)_
* `entrypointResources` - _(Optional)_ A json formatted dictionary to specify custom resources and their quantity. _(New in KubeRay version 1.0.)_
* `runtimeEnv` - [DEPRECATED] _(Optional)_ base64-encoded string of the runtime env json string.
* RayCluster configuration
* `rayClusterSpec` - Defines the **RayCluster** custom resource to run the Ray job on.
* Ray job configuration
* `entrypoint` - The submitter runs `ray job submit --address ... --submission-id ... -- $entrypoint` to submit a Ray job to the RayCluster.
* `runtimeEnvYAML` - _(Optional)_ A runtime environment that describes the dependencies the Ray job needs to run, including files, packages, environment variables, and more. Provide the configuration as a multi-line YAML string. See {ref}`Runtime Environments <runtime-environments>` for more details. _(New in KubeRay version 1.0.0)_
* `jobId` - _(Optional)_ Defines the submission ID for the Ray job. If not provided, KubeRay generates one automatically. See {ref}`Ray Jobs CLI API Reference <ray-job-submission-cli-ref>` for more details about the submission ID.
* `metadata` - _(Optional)_ See {ref}`Ray Jobs CLI API Reference <ray-job-submission-cli-ref>` for more details about the `--metadata-json` option.
* `entrypointNumCpus` / `entrypointNumGpus` / `entrypointResources` _(Optional)_: See {ref}`Ray Jobs CLI API Reference <ray-job-submission-cli-ref>` for more details.
* Submitter configuration
* `submitterPodTemplate` - _(Optional)_ Defines the Pod template for the submitter Kubernetes Job.
* `RAY_DASHBOARD_ADDRESS` - The KubeRay operator injects this environment variable to the submitter Pod. The value is `$HEAD_SERVICE:$DASHBOARD_PORT`.
* `RAY_JOB_SUBMISSION_ID` - The KubeRay operator injects this environment variable to the submitter Pod. The value is the `RayJob.Status.JobId` of the RayJob.
* Example: `ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID ...`
* Automatic resource cleanup
* `shutdownAfterJobFinishes` - _(Optional)_ Determines whether to recycle the RayCluster and the submitter after the Ray job finishes. The default value is false.
* `ttlSecondsAfterFinished` - _(Optional)_ Only works if `shutdownAfterJobFinishes` is true. The KubeRay operator deletes the RayCluster and the submitter `ttlSecondsAfterFinished` seconds after the Ray job finishes. The default value is 0.

## Example: Run a simple Ray job with RayJob

Expand All @@ -47,17 +53,16 @@ kind create cluster --image=kindest/node:v1.23.0

## Step 2: Install the KubeRay operator

Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator via Helm repository.
Please note that the YAML file in this example uses `serveConfigV2` to specify a multi-application Serve config, which is supported starting from KubeRay v0.6.0.
Follow the [RayCluster Quickstart](kuberay-operator-deploy) to install the latest stable KubeRay operator by Helm repository.

## Step 3: Install a RayJob

```sh
# Step 3.1: Download `ray_v1alpha1_rayjob.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml
# Step 3.1: Download `ray-job.sample.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.1.0/ray-operator/config/samples/ray-job.sample.yaml

# Step 3.2: Create a RayJob
kubectl apply -f ray_v1alpha1_rayjob.yaml
kubectl apply -f ray-job.sample.yaml
```

## Step 4: Verify the Kubernetes cluster status
Expand All @@ -74,8 +79,8 @@ kubectl get rayjob
kubectl get raycluster

# [Example output]
# NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
# rayservice-sample-raycluster-6mj28 1 1 ready 2m27s
# NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
# rayjob-sample-raycluster-tlsxc 1 1 400m 0 0 ready 91m

# Step 4.3: List all Pods in the `default` namespace.
# The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes.
Expand All @@ -88,18 +93,18 @@ kubectl get pods
# rayjob-sample-raycluster-9c546-worker-small-group-nfbxm 1/1 Running 0 3m46s

# Step 4.4: Check the status of the RayJob.
# The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` once the job finishes.
kubectl get rayjobs.ray.io rayjob-sample -o json | jq '.status.jobStatus'
# The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` and `jobDeploymentStatus`
# should be `Complete` once the job finishes.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobStatus}'
# [Expected output]: "SUCCEEDED"

# [Example output]
# "SUCCEEDED"
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expected output]: "Complete"
```

The KubeRay operator will create a RayCluster as defined by the `rayClusterSpec` custom resource, as well as a Kubernetes Job to submit a Ray job to the RayCluster.
The Ray job is defined in the `entrypoint` field of the RayJob custom resource.
In this example, the `entrypoint` is `python /home/ray/samples/sample_code.py`,
and `sample_code.py` is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster.
Since the default value of `shutdownAfterJobFinishes` is false, the RayCluster will not be deleted after the job finishes.
The KubeRay operator creates a RayCluster custom resource based on the `rayClusterSpec` and a submitter Kubernetes Job to submit a Ray job to the RayCluster.
In this example, the `entrypoint` is `python /home/ray/samples/sample_code.py`, and `sample_code.py` is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster.
Because the default value of `shutdownAfterJobFinishes` is false, the KubeRay operator doesn't delete the RayCluster or the submitter when the Ray job finishes.

## Step 5: Check the output of the Ray job

Expand Down Expand Up @@ -134,17 +139,52 @@ kubectl logs -l=job-name=rayjob-sample

The Python script `sample_code.py` used by `entrypoint` is a simple Ray script that executes a counter's increment function 5 times.

## Step 6: Delete the RayJob

```sh
kubectl delete -f ray-job.sample.yaml
```

## Step 7: Create a RayJob with `shutdownAfterJobFinishes` set to true

```sh
# Step 7.1: Download `ray-job.shutdown.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.1.0/ray-operator/config/samples/ray-job.shutdown.yaml

# Step 7.2: Create a RayJob
kubectl apply -f ray-job.shutdown.yaml
```

The `ray-job.shutdown.yaml` defines a RayJob custom resource with `shutdownAfterJobFinishes: true` and `ttlSecondsAfterFinished: 10`.
Hence, the KubeRay operator deletes the RayCluster and the submitter 10 seconds after the Ray job finishes.

## Step 8: Check the RayJob status

```sh
# Wait until `jobStatus` is `SUCCEEDED` and `jobDeploymentStatus` is `Complete`.
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobDeploymentStatus}'
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobStatus}'
```

## Step 9: Check if the KubeRay operator deletes the RayCluster and the submitter

```sh
# List the RayCluster custom resources in the `default` namespace. The RayCluster and the submitter Kubernetes
# Job associated with the RayJob `rayjob-sample-shutdown` should be deleted.
kubectl get raycluster
kubectl get jobs
```

## Step 6: Cleanup
## Step 10: Clean up

```sh
# Step 6.1: Delete the RayJob
kubectl delete -f ray_v1alpha1_rayjob.yaml
# Step 10.1: Delete the RayJob
kubectl delete -f ray-job.shutdown.yaml

# Step 6.2: Delete the KubeRay operator
# Step 10.2: Delete the KubeRay operator
helm uninstall kuberay-operator

# Step 6.3: Delete the Kubernetes cluster
# Step 10.3: Delete the Kubernetes cluster
kind delete cluster
```

Expand Down

0 comments on commit c1aaaa3

Please sign in to comment.