Skip to content

Commit

Permalink
patch: new GPU role to support both AMD and NVIDIA drivers
Browse files Browse the repository at this point in the history
* Created new gpu role that can support installing a variety of GPU drivers
* Added AMD support
* Moved NVIDIA into the new role and added gpu_arch var to support selecting the type of GPU driver to install
  • Loading branch information
drew-viles committed Feb 12, 2024
1 parent d967e60 commit c5cbb4f
Show file tree
Hide file tree
Showing 9 changed files with 177 additions and 44 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.vscode/
.idea/
16 changes: 12 additions & 4 deletions images/capi/.ansible-lint-ignore
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,18 @@ ansible/roles/node/tasks/main.yml no-changed-when
ansible/roles/node/tasks/photon.yml fqcn[action-core]
ansible/roles/node/tasks/photon.yml fqcn[action]
ansible/roles/node/tasks/photon.yml risky-file-permissions
ansible/roles/nvidia/tasks/main.yml fqcn[action-core]
ansible/roles/nvidia/tasks/main.yml fqcn[action]
ansible/roles/nvidia/tasks/main.yml ignore-errors
ansible/roles/nvidia/tasks/main.yml no-changed-when
ansible/roles/gpu/tasks/main.yml fqcn[action-core]
ansible/roles/gpu/tasks/main.yml fqcn[action]
ansible/roles/gpu/tasks/main.yml ignore-errors
ansible/roles/gpu/tasks/main.yml no-changed-when
ansible/roles/gpu/tasks/amd.yml fqcn[action-core]
ansible/roles/gpu/tasks/amd.yml fqcn[action]
ansible/roles/gpu/tasks/amd.yml ignore-errors
ansible/roles/gpu/tasks/amd.yml no-changed-when
ansible/roles/gpu/tasks/nvidia.yml fqcn[action-core]
ansible/roles/gpu/tasks/nvidia.yml fqcn[action]
ansible/roles/gpu/tasks/nvidia.yml ignore-errors
ansible/roles/gpu/tasks/nvidia.yml no-changed-when
ansible/roles/providers/defaults/main.yml var-naming[no-role-prefix]
ansible/roles/providers/tasks/aws.yml command-instead-of-shell
ansible/roles/providers/tasks/aws.yml fqcn[action-core]
Expand Down
58 changes: 58 additions & 0 deletions images/capi/ansible/roles/gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# GPU driver installation

The GPU drivers have to be installed via the `node_custom_roles_pre` option to avoid an issue where, should a
dist-upgrade install a new kernel,
the driver won't work when the image is booted. This is because the DKMS hook doesn't run when the driver
is installed after the kernel has been installed. To get around this, we install the drivers first.

# NVIDIA vGPU

To install the NVIDIA vGPU driver as part of the image build process, you must have a `.run` file and `.tok` file from
NVIDIA ready and available from an S3 endpoint.
Once done you need to reference those files in your packer file.

_This is because NVIDIA place the vGPU drivers behind a licensing wall which means you can't just use the standard
installation process for them._
_NVIDIA, as of July 2023, no longer support an internal licensing server being hosted by a customer._
_This role currently doesn't support installing the publicly available drivers._

An example of the fields you need are defined below. Make sure to review and change any fields where required.
If the gridd configuration or licensing .tok file are not required then you can omit the `gridd_feature_type`
and `nvidia_tok_location` respectively.

```json
{
"ansible_user_vars": "gpu_vendor=nvidia nvidia_s3_url=https://s3-endpoint nvidia_bucket=nvidia nvidia_bucket_access=ACCESS_KEY nvidia_bucket_secret=SECRET_KEY nvidia_installer_location=NVIDIA-Linux-x86_64-525.85.05-grid.run nvidia_tok_location=client_configuration_token.tok gridd_feature_type=4",
"node_custom_roles_pre": "gpu"
}

```

The `nvidia` custom role does not make use of the `load_additional_components->s3` role due to a conflict that can occur
when attempting to also use other aspects of `load_additional_components`.
As the `nvidia` role is loaded as part of `node_custom_roles_pre`, it means that `load_additional_components` could be
called out of order.

As a result they now require a `.tok` file to be available for licensing via their cloud services.
This file contains sensitive information and is unique to the company/license to which it is provided.

# AMD

Installing the AMD GPU driver is much more straightforward due to the public availability of the drivers.

An example of the fields you need are defined below. Make sure to review and change any fields where required.

```json
{
"ansible_user_vars": "gpu_vendor=amd amd_version=6.0.2 amd_deb_version=6.0.60002-1 amd_usecase=dkms",
"node_custom_roles_pre": "gpu"
}

```

_**It is highly recommended you read through
the [AMDGPU_Installer use-cases](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/amdgpu-install.html#use-cases)
first to ensure you supply the correct one.**_

_**For example, using the `rocm` use case will install +24GB of libraries as
well as the driver so your disk size will need to compensate for this.**_
17 changes: 17 additions & 0 deletions images/capi/ansible/roles/gpu/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright 2024 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

gpu_amd_usecase: dkms
53 changes: 53 additions & 0 deletions images/capi/ansible/roles/gpu/tasks/amd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright 2024 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

- name: Add the root user to the render and video groups
ansible.builtin.user:
name: root
groups: render,video
append: true
when: ansible_os_family == "Debian"

- name: Install the .deb for AMDGPU-Install
ansible.builtin.apt:
deb: "https://repo.radeon.com/amdgpu-install/{{ amd_version }}/ubuntu/jammy/amdgpu-install_{{ amd_deb_version }}_all.deb"
when: ansible_os_family == "Debian"

- name: Perform a cache update
ansible.builtin.apt:
force_apt_get: true
update_cache: true
register: apt_lock_status
until: apt_lock_status is not failed
retries: 5
delay: 10
when: ansible_os_family == "Debian"

- name: Install packages required for AMD driver installation
become: true
ansible.builtin.apt:
pkg:
- "linux-headers-{{ ansible_kernel }}"
- "linux-modules-extra-{{ ansible_kernel }}"
- build-essential
- dkms
- rocminfo
- clinfo
when: ansible_os_family == "Debian"

- name: Run AMDGPU_Install binary with use-cases
ansible.builtin.command:
cmd: "amdgpu-install -y --usecase={{ gpu_amd_usecase }}"
29 changes: 29 additions & 0 deletions images/capi/ansible/roles/gpu/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2024 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

- name: Unload nouveau
community.general.modprobe:
name: nouveau
state: absent
ignore_errors: true

- name: Include AMD
ansible.builtin.include_tasks: amd.yml
when: gpu_vendor == "amd"

- name: Include NVIDIA
ansible.builtin.include_tasks: nvidia.yml
when: gpu_vendor == "nvidia"
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 The Kubernetes Authors.
# Copyright 2024 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -13,19 +13,14 @@
# limitations under the License.

---
- name: Unload nouveau
modprobe:
name: nouveau
state: absent
ignore_errors: true

- name: Add NVIDIA package signing key
ansible.builtin.apt_key:
url: https://nvidia.github.io/libnvidia-container/gpgkey
when: ansible_os_family == "Debian"

- name: Perform a cache update
apt:
ansible.builtin.apt:
force_apt_get: true
update_cache: true
register: apt_lock_status
Expand All @@ -47,7 +42,7 @@

- name: Make /etc/nvidia/ClientConfigToken directory
become: true
file:
ansible.builtin.file:
path: /etc/nvidia/ClientConfigToken
state: directory
owner: root
Expand All @@ -70,7 +65,7 @@
when: nvidia_tok_location is defined

- name: Set Permissions of NVIDIA License Token
file:
ansible.builtin.file:
path: /etc/nvidia/ClientConfigToken/client_configuration_token.tok
state: file
owner: root
Expand All @@ -80,7 +75,7 @@

- name: Create GRIDD licensing config
become: true
template:
ansible.builtin.template:
src: templates/gridd.conf.j2
dest: /etc/nvidia/gridd.conf
mode: "0644"
Expand All @@ -100,7 +95,7 @@
delay: 3

- name: Set Permissions of NVIDIA driver installer file
file:
ansible.builtin.file:
path: /tmp/NVIDIA-Linux.run
state: file
owner: root
Expand All @@ -113,7 +108,7 @@
cmd: /tmp/NVIDIA-Linux.run -s --dkms --no-cc-version-check

- name: Remove the NVIDIA driver installer file
file:
ansible.builtin.file:
path: /tmp/NVIDIA-Linux.run
state: absent

Expand Down
28 changes: 0 additions & 28 deletions images/capi/ansible/roles/nvidia/README.md

This file was deleted.

0 comments on commit c5cbb4f

Please sign in to comment.