This reference architecture is intended to help platform administrators, cloud architects, and operations professionals deploy a batch processing platform on Google Kubernetes Engine (GKE). This document features GKE in Standard mode, using Kueue to manage resource quotas and borrowing rules between multiple tenant teams sharing the cluster to run their batch workloads in a fair, cost efficient and performant way. Best practices for running batch workloads on GKE discusses many recommendations that are implemented in this document.
Traditionally, batch platforms have two main user personas, developers and platform administrators:
-
A developer submits a Job specifying the program, the data to be processed, and requirements for the Job.
-
A platform administrator manages and delivers an efficient and reliable batch processing platform to the developers.
Let's assume we have four teams of developers (team-a
, team-b
, team-c
, team-d
) who want to share a GKE cluster to run their batch machine learning training workloads. Additionally, team-a
and team-b
share billing so would like to use each other's unused resource quota for their workloads. team-c
and team-d
have a similar agreement. The Organization has purchased a Committed Use Discount consumed via reservations and would like to maximize the utilization of their reservations before bursting into On-demand or Spot VMs.
To support these teams, the platform administrator would create a GKE cluster and configure it with Google published good practices for building batch platforms on GKE, in addition to their specific organizational best practices.
This reference architecture illustrates an example of a batch platform on GKE that can support multiple teams:
-
A regional GKE cluster with Node Auto Provisioning enabled, private nodes and Cloud NAT for outbound external access. A Cloud Monitoring dashboard tracking GPU utilization metrics is also created.
-
The cluster is configured with user defined node pools and is capable of creating auto-provisioned GPU node pools based on workload requirements:
-
Reserved node pool consuming zonal NVIDIA L4 GPU VM reservations.
-
Autoscaling On-demand node pool with NVIDIA L4 GPU for high-priority Jobs spilling over from reservations.
-
Autoscaling Spot node pool with NVIDIA L4 GPU for failure tolerant or low-priority workloads spilling over from reservations.
-
Compactly placed, auto-provisioned Spot node pools for low latency communication between workload Pods consuming NVIDIA A100 GPUs.
-
-
Four teams (
team-a
,team-b
,team-c
,team-d
) each with their own namespace on the cluster, with Workload Identity established. -
PriorityClasses defined for low-priority (default), high-priority and compactly placed Jobs. Incoming high-priority Jobs can preempt running low-priority Jobs to reclaim reserved resources.
-
Kueue is configured such that:
-
Each team has a
high-priority queue
, alow-priority
queue and acompact-placement
queue to which Jobs are submitted. -
Four ResourceFlavors defined; one each for
reserved
,on-demand
andspot
G2 VMs with NVIDIA L4 GPUs, and one forspot
A2 VMs with NVIDIA A100 GPUs that are compactly-placed. -
Prometheus and Grafana are installed on the cluster for monitoring Kueue.
-
-
-
This reference architecture shows the teams submitting the following workloads:
-
Distributed (multi-GPU, multi-host) machine learning model training using PyTorch and the mnist dataset, using Kubernetes Jobs in Indexed mode. See original example published here. This workload is used to showcase Kubernetes Job priority and preemption behavior using L4 GPUs and to show how to setup A100 GPUs in compact placement for multi-node training:
-
Low-priority Jobs: Jobs that don't specify a PriorityClass will get this default PriorityClass set to them. Kueue is configured to allow these Jobs to run on reserved VMs or Spot G2 VMs. If there is no room in or if a low-priority Job gets preempted from the reserved node pool, Kueue will evaluate them for other ResourceFlavors and will assign them the Spot G2 VM ResourceFlavor. In this example, each low-priority Job will have two Pods, each consuming two L4 GPUs.
-
High-priority Jobs: Jobs specifying this PriorityClass will preempt any low-priority Jobs running on reserved VMs. Any overflow high-priority Jobs will trigger scale-up in the On-demand node pool. In this example, each high-priority Job will have two Pods, each consuming two L4 GPUs.
-
Compactly placed Jobs: When admitted by Kueue, these Jobs will trigger GKE Node Auto Provisioning to create node pools purpose built for each of the Jobs, with nodes placed in close physical proximity to each other for low latency communication. In this example, each compactly placed Job will have two Pods, each consuming a single A100 GPU.
-
-
Distributed (multi-GPU, multi-host) machine learning training using PyTorch and the mnist dataset, using the Kubernetes JobSet API. In this example, JobSet automatically creates the headless ClusterIP service for all the workers in the Job to communicate with each other. See original example published here.
-
-
This reference architecture has been tested on Cloud Shell which comes preinstalled with Google Cloud SDK that is required to complete this guide.
-
Since a recent change in Grafana, services that run behind a reverse proxy cannot directly access metrics data. Port-forwarding from Cloud Shell, you would not be able to view Kueue metrics in Grafana. For a portion of this guide, you will be doing a port-forward from your local machine and for this, your local machine must have gcloud CLI available.
-
It is recommended to start the guide in a fresh project since the easiest way to clean up once complete is to delete the project. See here for more details.
-
This guide requires a number of different GCP Quotas (~60 L4 GPUs, ~30 Spot A100 GPUs and up to 600 CPU cores) in the region of your choosing. Please visit the IAM -> Quotas page in the context of your project, region and zone to request additional quota before proceeding with this guide. This document can help you find the appropriate regions and zones where the G2 (with NVIDIA L4 GPUs) and A2 (with NVIDIA A100 GPUs) VM families are available. For example, the zones us-central-a, us-central1-b and us-central-c all have both VM families available.
-
Export the required environment variables; replace the value of
YOUR_PROJECT_ID
with that of a fresh project you created for this tutorial,YOUR_REGION
with the name of your chosen Google Cloud region (eg. us-central1) andYOUR_ZONE
with one of the zones in your chosen region (eg. us-central1-c).export PROJECT_ID=YOUR_PROJECT_ID export REGION=YOUR_REGION export ZONE=YOUR_ZONE
-
Clone this repo, switch to the appropriate subdirectory and run the
create-platform.sh
script. The rest of this step enables the required APIs, creates an IAM policy binding for the Cloud Build service account, creates an Artifact Registry to host the Cloud Build container images and submit a Cloud Build job to create the required Google Cloud resources using Terraform. For more details seecreate-platform.sh
. Navigate to the Cloud Build page in the Google Cloud console to view status of the build, ensure you're in the context of the correct project.cd $HOME && \ git clone https://github.com/GoogleCloudPlatform/ai-on-gke.git && \ cd ai-on-gke/gke-batch-refarch && \ ./create-platform.sh
-
While the build runs, head over to the Kubernetes Engine page in the console. You can see the cluster being created and configured by the steps running in Cloud Build.
a. The
Setup GKE
step will create the GKE cluster with a default node pool for system pods, a statically sized reserved node pool with four g2-standard-24 reserved VMs with two Nvidia L4 accelerators each, one auto-scaled (0 - 24 nodes) On-demand node pool for spill-over high-priority Jobs, and one auto-scaled (0 - 36 nodes) Spot node pool for spill-over low-priority Jobs. For more details, or to modify the infrastructure deployed in this document see01_gke/main.tf
in this repository.b. Once the cluster is ready, the next steps deploy the recommended DaemonSet to install GPU drivers, and system resources for Kueue and JobSet are installed.
c. The
Deploy Manifests: Priority Classes, Teams and Kueue configuration
step establishes default, high and compact PriorityClasses to showcase preemption of lower priority workloads by higher priority workloads for reserved resources. This step deploys four team manifests one for each ofteam-a
,team-b
,team-c
andteam-d
including Namespaces and Kubernetes Service Accounts to be used for Workload Identity. In addition, this step configures Kueue such that each team gets a high priority queue, a low priority queue and a compact placement queue with the ability to specify nominalQuota and borrowingLimit for the respective flavors. See02_platform/kueue
for more details.d. As a part of the platform deployment, a monitoring dashboard is created that tracks the number and utilization of GPUs on the cluster. Head over to Cloud Monitoring Dashboards page and you should see a dashboard entitled
NVIDIA GPU Monitoring Overview (GCE & GKE)
. Keep a tab with this dashboard open and auto-refresh enabled (icon in the top right corner next to the time-frame). -
For monitoring Kueue metrics Prometheus and Grafana have been deployed on the GKE cluster and a dashboard to visualize the data has been configured. Since a recent change in Grafana, services that run behind a reverse proxy cannot directly access metrics data. Port-forwarding from Cloud Shell, you would not be able to view Kueue metrics. You will be doing a port-forward from your local machine and for this, your local machine must have gcloud CLI available.
a. Open a new terminal on your local machine and get the cluster credentials using the following command:
gcloud container clusters get-credentials gke-batch-refarch --region $REGION --project $PROJECT_ID
b. Next, create a port-forward to the
grafana
service running in the cluster so you can use your web browser to access the Grafana UI. Keep this terminal open for the rest of this guide.kubectl port-forward svc/grafana 8080:3000 -n monitoring
Expected output:
Forwarding from 127.0.0.1:8080 -> 3000
c. Open a new tab in your web browser and navigate to
http://localhost:8080
. You should see the Grafana login page. Useadmin
as the username and password.d. In the menu located on the top left part of the home page, click
Dashboards
.e. Navigate to
Kueue Dashboard
, you should see the current number of nodes in the cluster and other useful Kueue metrics; there should be no workloads pending or admitted. We will return to this dashboard periodically during this guide to see how the system has reacted to incoming workloads. -
Deploying Low Priority workloads: Switch to the
03_low_priority
directory and run thedeploy_workloads.sh
script. This script will connect to the cluster and deploy one Job from each team at a time until all teams have four low priority Jobs submitted (job-0 through job-3).cd $HOME/ai-on-gke/gke-batch-refarch/03_low_priority && \ ./create_workloads.sh
Expected output:
service/team-a-low-priority-svc-0 created configmap/team-a-low-priority-config-0 created job.batch/team-a-low-priority-job-0 created ... service/team-d-low-priority-svc-3 created configmap/team-d-low-priority-config-3 created job.batch/team-d-low-priority-job-3 created
a. List the Jobs running on the GKE cluster across all namespaces, you should see four Jobs from each team created. These Jobs have been admitted by Kueue.
watch kubectl get jobs --all-namespaces
Expected output:
NAMESPACE NAME COMPLETIONS DURATION AGE team-a team-a-low-priority-job-0 0/2 54s 55s team-a team-a-low-priority-job-1 0/2 49s 49s team-a team-a-low-priority-job-2 0/2 43s 43s team-a team-a-low-priority-job-3 0/2 38s 38s team-b team-b-low-priority-job-0 0/2 53s 53s team-b team-b-low-priority-job-1 0/2 47s 48s team-b team-b-low-priority-job-2 0/2 42s 42s team-b team-b-low-priority-job-3 0/2 37s 37s team-c team-c-low-priority-job-0 0/2 52s 52s team-c team-c-low-priority-job-1 0/2 46s 47s team-c team-c-low-priority-job-2 0/2 41s 41s team-c team-c-low-priority-job-3 0/2 36s 36s team-d team-d-low-priority-job-0 0/2 51s 51s team-d team-d-low-priority-job-1 0/2 45s 46s team-d team-d-low-priority-job-2 0/2 40s 40s team-d team-d-low-priority-job-3 0/2 35s 35s
b. In a new terminal tab, watch the ClusterQueues, you should see four admitted workloads for each team's low priority ClusterQueue. These workloads correspond to the Jobs you saw in the previous step.
watch kubectl get clusterqueues
Expected output:
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cq-team-a-compact team-a-b StrictFIFO 0 0 cq-team-a-hp team-a-b StrictFIFO 0 0 cq-team-a-lp team-a-b StrictFIFO 0 4 cq-team-b-compact team-a-b StrictFIFO 0 0 cq-team-b-hp team-a-b StrictFIFO 0 0 cq-team-b-lp team-a-b StrictFIFO 0 4 cq-team-c-compact team-c-d StrictFIFO 0 0 cq-team-c-hp team-c-d StrictFIFO 0 0 cq-team-c-lp team-c-d StrictFIFO 0 4 cq-team-d-compact team-c-d StrictFIFO 0 0 cq-team-d-hp team-c-d StrictFIFO 0 0 cq-team-d-lp team-c-d StrictFIFO 0 4
c. Since the Reserved node pool already has nodes available to run admitted low priority workloads, some of these workloads will be scheduled on the Reserved nodes.
d. In a new terminal tab, watch the cluster nodes. Initially you should see ten nodes, six in the default node pool and four in the reserved node pool.
watch kubectl get nodes
Expected output:
NAME STATUS ROLES AGE VERSION gke-gke-batch-refarch-default-pool-8bba21a2-3328 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-bmzg Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-8k19 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-srxf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-gn5j Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-wht3 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-djvf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p2w7 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p42h Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-r6rt Ready <none> 12h v1.28.3-gke.1203001
e. After a short while you should see Spot nodes being added to the Spot node pool to accommodate the low priority Jobs that could not fit on the Reserved node pool (sized for two Jobs at a time).
Expected output:
NAME STATUS ROLES AGE VERSION gke-gke-batch-refarch-default-pool-8bba21a2-3328 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-bmzg Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-8k19 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-srxf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-gn5j Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-wht3 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-djvf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p2w7 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p42h Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-r6rt Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-2gl8 Ready <none> 102s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-2thb Ready <none> 97s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-4ccx Ready <none> 89s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-69p2 Ready <none> 94s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-7246 Ready <none> 80s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-7rhl Ready <none> 96s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-8267 Ready <none> 95s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-c5x5 Ready <none> 100s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-cc2h Ready <none> 109s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-fsjs Ready <none> 95s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-fsqw Ready <none> 100s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-h9bh Ready <none> 96s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-hshk Ready <none> 69s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-htjv Ready <none> 97s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-j7zc Ready <none> 94s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-jr45 Ready <none> 104s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-l5v2 Ready <none> 90s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-ltv4 Ready <none> 98s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-ptx9 Ready <none> 93s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-q57t Ready <none> 100s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-qlsz Ready <none> 106s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-t5tl Ready <none> 100s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-t62s Ready <none> 105s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-wxrr Ready <none> 91s v1.28.3-gke.1203001
-
Deploying High Priority workloads: Switch to the
04_high_priority
directory and run thedeploy_workloads.sh
script. This script will connect to the cluster and deploy one Job from each team at a time until all teams have four low priority Jobs submitted (job-0 through job-3).cd $HOME/ai-on-gke/gke-batch-refarch/04_high_priority && \ ./create_workloads.sh
Expected output:
service/team-a-high-priority-svc-0 created configmap/team-a-high-priority-config-0 created job.batch/team-a-high-priority-job-0 created ... service/team-d-high-priority-svc-3 created configmap/team-d-high-priority-config-3 created job.batch/team-d-high-priority-job-3 created
a. Return to the terminal tab watching the
clusterqueues
. You should see high-priority workloads being added to the high-priority clusterqueues.watch kubectl get clusterqueues
Expected output:
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cq-team-a-compact team-a-b StrictFIFO 0 0 cq-team-a-hp team-a-b StrictFIFO 2 2 cq-team-a-lp team-a-b StrictFIFO 1 3 cq-team-b-compact team-a-b StrictFIFO 0 0 cq-team-b-hp team-a-b StrictFIFO 2 2 cq-team-b-lp team-a-b StrictFIFO 0 3 cq-team-c-compact team-c-d StrictFIFO 0 0 cq-team-c-hp team-c-d StrictFIFO 2 2 cq-team-c-lp team-c-d StrictFIFO 0 3 cq-team-d-compact team-c-d StrictFIFO 0 0 cq-team-d-hp team-c-d StrictFIFO 2 2 cq-team-d-lp team-c-d StrictFIFO 0 3
b. As high priority workloads are admitted, the low priority workloads running on the Reserved node pool are evicted to make room.
These low priority workloads are then set to pending and reevaluated by Kueue. They will use the Spot flavor to schedule on to the Spot node pool once new nodes have spun up to accommodate them.
d. Return to the terminal tab watching the cluster nodes, after a short while you should see nodes being added to the On Demand node pool to accommodate the high priority Jobs that could not fit on the Reserved node pool.
Expected output:
NAME STATUS ROLES AGE VERSION gke-gke-batch-refarch-default-pool-8bba21a2-3328 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-bmzg Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-8k19 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-srxf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-gn5j Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-wht3 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-426d Ready <none> 3m25s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-bs59 Ready <none> 3m30s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-vktd Ready <none> 3m24s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-znst Ready <none> 3m30s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-7dpd Ready <none> 2m39s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-ggs4 Ready <none> 2m41s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-qb49 Ready <none> 2m32s v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-x2cq Ready <none> 2m37s v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-djvf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p2w7 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p42h Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-r6rt Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-2gl8 Ready <none> 7m45s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-2thb Ready <none> 7m40s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-4ccx Ready <none> 7m32s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-69p2 Ready <none> 7m37s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-7246 Ready <none> 7m23s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-7rhl Ready <none> 7m39s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-8267 Ready <none> 7m38s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-c5x5 Ready <none> 7m43s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-cc2h Ready <none> 7m52s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-fsjs Ready <none> 7m38s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-fsqw Ready <none> 7m43s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-h9bh Ready <none> 7m39s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-hshk Ready <none> 7m12s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-htjv Ready <none> 7m40s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-j7zc Ready <none> 7m37s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-jr45 Ready <none> 7m47s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-l5v2 Ready <none> 7m33s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-ltv4 Ready <none> 7m41s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-ptx9 Ready <none> 7m36s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-q57t Ready <none> 7m43s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-qlsz Ready <none> 7m49s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-t5tl Ready <none> 7m43s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-t62s Ready <none> 7m48s v1.28.3-gke.1203001 gke-gke-batch-refarch-spot-np-fd534d43-wxrr Ready <none> 7m34s v1.28.3-gke.1203001
The following diagram illustrates scale up in the On Demand node pool:
The Kueue dashboard also shows updated node counts, pending and active workloads,and other useful metrics from the batch platform.
-
Deploying compact placement workloads: Switch to the
05_compact_placement
directory and run thedeploy_workloads.sh
script. This script will connect to the cluster and deploy one Job from each team at a time until all teams have four compactly placed Jobs submitted (job-0 through job-3).cd $HOME/ai-on-gke/gke-batch-refarch/05_compact_placement && \ ./create_workloads.sh
Expected output:
service/team-a-compact-svc-0 created configmap/team-a-compact-config-0 created job.batch/team-a-compact-job-0 created ... service/team-d-compact-svc-3 created configmap/team-d-compact-config-3 created job.batch/team-d-compact-job-3 created
a. Return to the terminal tab watching the
clusterqueues
, you should see submitted compact workloads being pending and admitted to the compact clusterqueues for all four teams.Expected output:
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cq-team-a-compact team-a-b StrictFIFO 3 1 cq-team-a-hp team-a-b StrictFIFO 0 2 cq-team-a-lp team-a-b StrictFIFO 0 0 cq-team-b-compact team-a-b StrictFIFO 2 1 cq-team-b-hp team-a-b StrictFIFO 0 2 cq-team-b-lp team-a-b StrictFIFO 0 0 cq-team-c-compact team-c-d StrictFIFO 3 1 cq-team-c-hp team-c-d StrictFIFO 1 2 cq-team-c-lp team-c-d StrictFIFO 0 0 cq-team-d-compact team-c-d StrictFIFO 2 1 cq-team-d-hp team-c-d StrictFIFO 0 2 cq-team-d-lp team-c-d StrictFIFO 0 1
b. As these Jobs are admitted by Kueue, they will become Unschedulable, meaning that the cluster does not have the resources these workloads are asking for. However, since the cluster is configured with Node Auto Provisioning, GKE will create and scale a node pool purpose built for each Job based on the resource requests and limits. Once the Job is complete, the node pool will be cleaned up by GKE automatically, eliminating unnecessary spend.
c. Return to the terminal tab watching the cluster nodes, after a short while you should see nodes being created in auto-provisioned Spot node pools to accommodate the compactly placed Jobs that could not find A100 GPUs in the node pools present on the cluster.
Expected output:
NAME STATUS ROLES AGE VERSION gke-gke-batch-refarc-team-a-0-compact-4391522a-k57w Ready <none> 5m14s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-a-0-compact-4391522a-zqcr Ready <none> 5m12s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-b-0-compact-5ed4b762-7lsv Ready <none> 3m26s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-b-0-compact-5ed4b762-8h7n Ready <none> 3m26s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-c-0-compact-a2204c20-cwrt Ready <none> 3m44s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-c-0-compact-a2204c20-g4lz Ready <none> 4m7s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-d-0-compact-9f10df61-fwpj Ready <none> 2m49s v1.28.3-gke.1203001 gke-gke-batch-refarc-team-d-0-compact-9f10df61-jw6k Ready <none> 2m49s v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-3328 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-bmzg Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-8k19 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-srxf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-gn5j Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-wht3 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-426d Ready <none> 17m v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-bs59 Ready <none> 17m v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-48a23fe6-vktd Ready <none> 17m v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-7dpd Ready <none> 16m v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-ggs4 Ready <none> 16m v1.28.3-gke.1203001 gke-gke-batch-refarch-ondemand-np-8687a394-qb49 Ready <none> 16m v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-djvf Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p2w7 Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p42h Ready <none> 12h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-r6rt Ready <none> 12h v1.28.3-gke.1203001
-
Deploying JobSet workloads: Switch to the
06_jobset
directory and run thedeploy_workloads.sh
script. This script will connect to the cluster and deploy one JobSet from each team at a time until all teams have three JobSets submitted (jobset-0 through jobset-3).cd $HOME/ai-on-gke/gke-batch-refarch/06_jobset && \ ./create_workloads.sh
Expected output:
jobset.jobset.x-k8s.io/team-a-jobset-0 created jobset.jobset.x-k8s.io/team-b-jobset-0 created jobset.jobset.x-k8s.io/team-c-jobset-0 created jobset.jobset.x-k8s.io/team-d-jobset-0 created ... jobset.jobset.x-k8s.io/team-a-jobset-3 created jobset.jobset.x-k8s.io/team-b-jobset-3 created jobset.jobset.x-k8s.io/team-c-jobset-3 created jobset.jobset.x-k8s.io/team-d-jobset-3 created
a. Return to the terminal tab watching the
clusterqueues
, you should see submitted JobSet workloads being pending and admitted to the low priority clusterqueues for all four teams.Expected output:
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cq-team-a-compact team-a-b StrictFIFO 2 1 cq-team-a-hp team-a-b StrictFIFO 0 0 cq-team-a-lp team-a-b StrictFIFO 0 4 cq-team-b-compact team-a-b StrictFIFO 2 0 cq-team-b-hp team-a-b StrictFIFO 0 0 cq-team-b-lp team-a-b StrictFIFO 0 4 cq-team-c-compact team-c-d StrictFIFO 0 1 cq-team-c-hp team-c-d StrictFIFO 0 1 cq-team-c-lp team-c-d StrictFIFO 0 4 cq-team-d-compact team-c-d StrictFIFO 2 1 cq-team-d-hp team-c-d StrictFIFO 0 0 cq-team-d-lp team-c-d StrictFIFO 0 4
b. At this time all the workloads have been submitted to the batch platform, and will continue to process in the order decided by Kueue.
c. Head over to the GPU monitoring dashboard you opened earlier. You should see charts displaying useful GPU utilization data from the batch platform for example:
- Number and type of GPUs in the cluster over time
- Distribution (heatmap) of GPU utilization
- Distribution (heatmap) of GPU memory utilization
d. Return to the terminal tab watching Jobs across all namespaces, after a while you should see all Jobs completed.
Expected output:
NAMESPACE NAME COMPLETIONS DURATION AGE team-a team-a-compact-job-0 2/2 7m22s 42m team-a team-a-compact-job-1 2/2 11m 42m team-a team-a-compact-job-2 2/2 7m52s 42m team-a team-a-compact-job-3 2/2 7m15s 42m team-a team-a-high-priority-job-0 2/2 8m37s 53m team-a team-a-high-priority-job-1 2/2 15m 53m team-a team-a-high-priority-job-2 2/2 4m38s 53m team-a team-a-high-priority-job-3 2/2 4m38s 53m team-a team-a-jobset-0-worker-0 2/2 18m 24m team-a team-a-jobset-1-worker-0 2/2 5m4s 26m team-a team-a-jobset-2-worker-0 2/2 91s 22m team-a team-a-jobset-3-worker-0 2/2 5m 26m team-a team-a-low-priority-job-0 2/2 3m31s 58m team-a team-a-low-priority-job-1 2/2 11m 58m team-a team-a-low-priority-job-2 2/2 10m 58m team-a team-a-low-priority-job-3 2/2 10m 57m team-b team-b-compact-job-0 2/2 8m56s 42m team-b team-b-compact-job-1 2/2 7m28s 42m team-b team-b-compact-job-2 2/2 7m16s 42m team-b team-b-compact-job-3 2/2 8m3s 42m team-b team-b-high-priority-job-0 2/2 9m 53m team-b team-b-high-priority-job-1 2/2 7m34s 53m team-b team-b-high-priority-job-2 2/2 13m 53m team-b team-b-high-priority-job-3 2/2 4m52s 53m team-b team-b-jobset-0-worker-0 2/2 5m25s 26m team-b team-b-jobset-1-worker-0 2/2 3m37s 4m21s team-b team-b-jobset-2-worker-0 2/2 5m2s 26m team-b team-b-jobset-3-worker-0 2/2 5m10s 26m team-b team-b-low-priority-job-0 2/2 3m40s 58m team-b team-b-low-priority-job-1 2/2 11m 58m team-b team-b-low-priority-job-2 2/2 12m 58m team-b team-b-low-priority-job-3 2/2 10m 57m team-c team-c-compact-job-0 2/2 8m39s 42m team-c team-c-compact-job-1 2/2 7m6s 42m team-c team-c-compact-job-2 2/2 7m12s 42m team-c team-c-compact-job-3 2/2 7m11s 42m team-c team-c-high-priority-job-0 2/2 28m 53m team-c team-c-high-priority-job-1 2/2 11m 53m team-c team-c-high-priority-job-2 2/2 3m36s 53m team-c team-c-high-priority-job-3 2/2 3m30s 53m team-c team-c-jobset-0-worker-0 2/2 3m37s 11m team-c team-c-jobset-1-worker-0 2/2 3m33s 16m team-c team-c-jobset-2-worker-0 2/2 5m2s 26m team-c team-c-jobset-3-worker-0 2/2 3m38s 18m team-c team-c-low-priority-job-0 2/2 2m46s 58m team-c team-c-low-priority-job-1 2/2 11m 58m team-c team-c-low-priority-job-2 2/2 9m50s 58m team-c team-c-low-priority-job-3 2/2 10m 57m team-d team-d-compact-job-0 2/2 9m50s 42m team-d team-d-compact-job-1 2/2 7m34s 42m team-d team-d-compact-job-2 2/2 7m42s 42m team-d team-d-compact-job-3 2/2 7m13s 42m team-d team-d-high-priority-job-0 2/2 10m 53m team-d team-d-high-priority-job-1 2/2 27m 53m team-d team-d-high-priority-job-2 2/2 6m16s 53m team-d team-d-high-priority-job-3 2/2 6m22s 53m team-d team-d-jobset-0-worker-0 2/2 5m3s 26m team-d team-d-jobset-1-worker-0 2/2 97s 22m team-d team-d-jobset-2-worker-0 2/2 93s 22m team-d team-d-jobset-3-worker-0 2/2 16m 22m team-d team-d-low-priority-job-0 2/2 2m57s 58m team-d team-d-low-priority-job-1 2/2 10m 58m team-d team-d-low-priority-job-2 2/2 19m 58m team-d team-d-low-priority-job-3 2/2 17m 57m
e. Return to the terminal tab watching the
clusterqueues
, you should see no workloads pending or admitted.Expected output:
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cq-team-a-compact team-a-b StrictFIFO 0 0 cq-team-a-hp team-a-b StrictFIFO 0 0 cq-team-a-lp team-a-b StrictFIFO 0 0 cq-team-b-compact team-a-b StrictFIFO 0 0 cq-team-b-hp team-a-b StrictFIFO 0 0 cq-team-b-lp team-a-b StrictFIFO 0 0 cq-team-c-compact team-c-d StrictFIFO 0 0 cq-team-c-hp team-c-d StrictFIFO 0 0 cq-team-c-lp team-c-d StrictFIFO 0 0 cq-team-d-compact team-c-d StrictFIFO 0 0 cq-team-d-hp team-c-d StrictFIFO 0 0 cq-team-d-lp team-c-d StrictFIFO 0 0
f. Return to the terminal tab watching the nodes, you should see GKE shrink the cluster back down to the initial state of ten nodes, six in the default node pool and four in the reserved node pool.
Expected output:
NAME STATUS ROLES AGE VERSION gke-gke-batch-refarch-default-pool-8bba21a2-3328 Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-8bba21a2-bmzg Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-8k19 Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-9899b2fc-srxf Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-gn5j Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-default-pool-ab9bedc3-wht3 Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-djvf Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p2w7 Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-p42h Ready <none> 13h v1.28.3-gke.1203001 gke-gke-batch-refarch-reserved-np-866c1d22-r6rt Ready <none> 13h v1.28.3-gke.1203001
-
The easiest way to prevent continued billing for the resources that you created for this tutorial is to delete the project you created for the tutorial. Run the following commands from Cloud Shell:
gcloud config unset project && \ echo y | gcloud projects delete $PROJECT_ID
-
If the project needs to be left intact, another option is to destroy the infrastructure created for this tutorial using Cloud Build. Note, this does not destroy the Cloud Storage bucket containing the Terraform state, the artifact registry used to host container images or the IAM bindings and service enablement created via the
create-platform.sh
script.cd $HOME/ai-on-gke/gke-batch-refarch && \ ./destroy-platform.sh