GPUs require special drivers and software. These items are not pre-installed on Cloud Dataproc clusters by default. This initialization action installs GPU drivers for NVIDIA on master and workers node in a Google Cloud Dataproc cluster.
Note: This feature is in Beta mode.
You can use this initialization action to create a new Dataproc cluster with GPU support: this initialization action will install GPU drivers and CUDA. If you need a more recent GPU driver please visit NVIDIA site.
-
Use the
gcloud
command to create a new cluster with this initialization action. The following command will create a new cluster named<CLUSTER_NAME>
and install GPU drivers.gcloud beta dataproc clusters create <CLUSTER_NAME> \ --master-accelerator type=nvidia-tesla-v100 \ --worker-accelerator type=nvidia-tesla-v100,count=4 \ --initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \ --metadata install_gpu_agent=false
-
Use the
gcloud
command to create a new cluster with this initialization action. The following command will create a new cluster named<CLUSTER_NAME>
, install GPU drivers and add the GPU monitoring service.gcloud beta dataproc clusters create <CLUSTER_NAME> \ --master-accelerator type=nvidia-tesla-v100 \ --worker-accelerator type=nvidia-tesla-v100,count=4 \ --initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \ --metadata install_gpu_agent=true \ --scopes https://www.googleapis.com/auth/monitoring.write
-
install_gpu_agent: true|false
- this is an optional parameter with case sensitive value.Note: This parameter will collect GPU utilization and send statistics to StackDriver. Make sure you add the correct scope to access StackDriver.
-
Once the cluster has been created, you can access the Dataproc cluster and verify NVIDIA drivers are install successfully.
sudo nvidia-smi
-
If you install the GPU collection service, verify installation by using the following command:
sudo systemctl status gpu_utilization_agent.service
For more information about GPU support, take a look at Dataproc documentation
The initialization action installs a monitoring agent that monitors the GPU usage on the instance. This will auto create the GPU metrics.
pip3 install -r ./requirements.txt
python3 report_gpu_metrics.py &
If you need to create metrics using create_metric_descriptor
first run the
following commands:
pip3 install -r ./requirements.txt
python3 create_gpu_metrics.py
Example:
Created projects/project-sample/metricDescriptors/custom.googleapis.com/gpu_utilization.
Created projects/project-sample/metricDescriptors/custom.googleapis.com/gpu_memory_utilization.
Problem: Error when running report_gpu_metrics
google.api_core.exceptions.InvalidArgument: 400 One or more TimeSeries could not be written:
One or more points were written more frequently than the maximum sampling period configured for the metric.
:timeSeries[0]
Solution: Verify service is running in background
sudo systemctl status gpu_utilization_agent.service
- This initialization script will install NVIDIA GPU drivers in all nodes in which a GPU is detected.
- This initialization script uses Debian packages to install CUDA driver
- Tested with Dataproc 1.2+.