- Quick Start Guide for Distributed Workloads with the CodeFlare Stack
This quick start guide is intended to walk users through installation of the CodeFlare stack and an initial demo using the CodeFlare-SDK from within a Jupyter notebook environment. This will enable users to run and submit distributed workloads.
The CodeFlare-SDK was built to make managing distributed compute infrastructure in the cloud easy and intuitive for Data Scientists. However, that means there needs to be some cloud infrastructure on the backend for users to get the benefit of using the SDK. Currently, we support the CodeFlare stack.
This stack integrates well with Red Hat OpenShift AI and Open Data Hub, and helps to bring batch workloads, jobs, and queuing to the Data Science platform. Although this document will guide you through setting up with Red Hat OpenShift AI (RHOAI), the steps are also applicable if you are using Open Data Hub (ODH). Both platforms are available in the OperatorHub, and the installation and configuration steps are quite similar. This guide will proceed with RHOAI, but feel free to apply the instructions to ODH as needed.
In addition to the resources required by default Red Hat OpenShift AI deployments, you will need the following to deploy the Distributed Workloads stack infrastructure pods:
Total:
CPU: 1600m (1.6 vCPU)
Memory: 2048Mi (2 GiB)
Note
The above resources are just for the infrastructure pods. To be able to run actual workloads on your cluster you will need additional resources based on the size and type of workload.
Important
This step is necessary only if you require GPU capabilities for your workloads and your OpenShift cluster does not already include GPU-equipped nodes, follow these steps:
- Open the OpenShift Cluster Console.
- Navigate to your-cluster -> Machine pools.
- Click on “Add machine pool”.
- Provide a name for the new machine pool.
- In the “Compute node instance type” dropdown, scroll all the way down and search for the GPU instance type
g4dn.xlarge
or similar. - Click on Add machine pool to finalize the creation of your new GPU-enabled machine pool.
After adding the machine pool, OpenShift will begin provisioning the new GPU worker node. This process can take a few minutes. Once completed, the new node will be ready to handle GPU-accelerated workloads.
Note
The g4dn.xlarge
instance type is used for GPU worker nodes. Ensure this instance type meets your application needs or select another as required.
As a quick alternative to the following manual deployment steps an automatic Makefile script can be used to deploy the CodeFlare stack. This script also deploys the prerequisite operators and the entire CodeFlare stack.
- Clone the Repository
git clone https://github.com/project-codeflare/codeflare-operator.git
cd codeflare-operator
- Run the Makefile script
make all-in-one
Tip
Execute make help
to list additional available operations.
After the automatic deployment is complete, you can proceed directly to the section Configure Kueue for Task Scheduling to finish setting up your environment.
This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Red Hat OpenShift AI (RHOAI) installation with version >2.9 is present on your cluster. But the quick step to install RHOAI is as follows:
- Using the OpenShift web console, navigate to Operators -> OperatorHub.
- Search for
Red Hat OpenShift AI
. - Install it using the
fast
channel.
After the installation of the Red Hat OpenShift AI Operator, proceed to configure the necessary components for data science work:
- From the OpenShift web console, navigate to the installed RHOAI Operator.
- Look for tab labeled DSC Initialization.
- If it has not already been created - Locate
Create DSCInitialization
and create one. - Look for tab labeled Data Science Cluster.
- Locate
Create DataScienceCluster
and create one.
To leverage GPU-enabled workloads on your OpenShift cluster, you need to install both the Node Feature Discovery (NFD) Operator and the NVIDIA GPU Operator.
Both the NFD and the NVIDIA GPU Operators can be installed from the OperatorHub. Detailed steps for installation and configuration are provided in the NVIDIA documentation, which can be accessed here.
- Open the OpenShift dashboard.
- Navigate to OperatorHub.
- Search for and install the following operators (default settings are fine):
- Node Feature Discovery Operator
- NVIDIA GPU Operator
After installing the Node Feature Discovery Operator, you need to create a Node Feature Discovery Custom Resource (CR). You can use the default settings for this CR:
- Create the Node Feature Discovery CR in the dashboard.
- Several pods will start in the
openshift-nfd
namespace (which is the default). Wait for all these pods to become operational. Once active, your nodes will be labeled with numerous feature flags, indicating that the operator is functioning correctly.
After installing the NVIDIA GPU Operator, proceed with creating a GPU ClusterPolicy Custom Resource (CR):
- Create the GPU ClusterPolicy CR through the dashboard.
- This action will trigger several pods to start in the NVIDIA GPU namespace.
Note
These pods may take some time to become operational as they compile the necessary drivers.
Kueue is used for managing and scheduling task workflows in your cluster. To configure Kueue in your environment, follow the detailed steps provided
- Install Kueue resources, namely Cluster Queue, Resource Flavor, and Local Queue:
After setting up the Data Science Cluster components, you can start using the Jupyter notebooks for your data science projects. Here’s how to launch a Jupyter notebook:
- Access the RHOAI Dashboard:
- Create a Data Science Project:
- Go to the Data Science Projects section from the dashboard menu.
- Click on
Create data science project
and follow the prompts to set up a new project.
- Launch Jupyter Workbench:
- Inside your newly created project, find and click on the "Create Workbench" button.
- On the Workbench creation page, select "Standard Data Science" from the list of available notebook images. This image will include common data science libraries and tools that you might need.
- Configure any additional settings such as compute resources or environment variables as needed and
Create workbench
- Access Your Notebook:
- Once the workbench is ready, click on the provided link or button to open your Jupyter notebook.
We can now go ahead and submit our first distributed model training job to our cluster.
This can be done from any python based environment, including a script or a jupyter notebook. For this guide, we'll assume you've selected the "Jupyter Data Science" from the list of available images on your notebook spawner page.
Once your notebook environment is ready, in order to test our CodeFlare stack we will want to run through some of the demo notebooks provided by the CodeFlare community. So let's start by cloning their repo into our working environment.
git clone https://github.com/project-codeflare/codeflare-sdk
cd codeflare-sdk
For further development guidelines and instructions on setting up your development environment for codeflare-sdk, please refer to the CodeFlare SDK README.
Get started with the guided demo notebooks for the CodeFlare-SDK by following these steps:
- Access Your Jupyter Notebook Server:
- Update Your Notebook with Access Token and Server Details:
- Retrieve your OpenShift access token by selecting your username in the console, choosing "Copy Login Command", and then "Display Token".
- Open your desired demo notebook from the
codeflare-sdk/demo-notebooks/guided-demos
directory. - Update the notebook with your access token and server details and run the demos.
To completely clean up all the components after an install, follow these steps:
make delete-all-in-one
If you prefer to manually clean up the installation or need to manually remove individual components and operators, follow these steps:
- Uninstall Operators
- Open the OpenShift dashboard.
- Go to Installed Operators.
- Look for any operators you have installed, such as the NVIDIA GPU Operator, Node Feature Discovery Operator, and Red Hat OpenShift AI Operator.
- Click on the operator and then click Uninstall Operator. Follow the prompts to remove the operator and its associated resources.
And with that you have gotten started using the CodeFlare stack alongside your Red Hat OpenShift AI Deployment to add distributed workloads and batch computing to your machine learning platform.
You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community.