You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are encountering inconsistent GPU resource availability in our Kubernetes (K8s) environment, where we utilize the NVIDIA GPU Operator and MIG (Multi-Instance GPU) technology to manage H100 GPUs. Specifically, when allocating multiple MIG profiles or non-MIG GPUs to pods, the expected number of GPUs is not consistently available within the user's container.
Environment Details
Kubernetes Version: v1.29.9
OS Image: Garden Linux 1443.10
Kernel Version: 6.6.41-amd64
Container Runtime: containerd://1.6.24
GPU Operator Version: v23.9
k8s-device-plugin Version: v0.14.5-ubi8
CUDA Version: 12.2
Driver Version: 535.86.10
Pod Docker Image: kubeflow1.8-jupyter7.1.1-pytorch2.2.1-cuda12.1
Problem Description
MIG Allocation Inconsistency
When allocating mig-3g.40gb MIG instances to a Kubeflow Notebook pod, only a subset of the requested GPUs are visible:
Observed Behavior:
Only 7 out of the 10 requested mig-3g.40gb GPUs are detected within the pod. The output of nvidia-smi -L inside the pod lists only 7 MIG devices instead of 10.
Non-MIG GPU Allocation Variability
When attempting to allocate 4 standard nvidia.com/gpu (80GB) vGPUs to a pod, the actual number of GPUs visible varies unpredictably between 2 to 4.
Configuration
The following is the configuration for the k8s-device-plugin as defined in its ConfigMap:
The issue here is that you have configured timeslicing in your k8s-device-plugin configuration for the nvidia.com/mig-3g.40gb extended resource, so 1 physical 3g.40gb will be advertised as 4 independent devices to kubelet. kubelet is the component that actually makes the allocation decision, so it is entirely possible that when you request 10 nvidia.com/mig-3g.40gb resources for your container kubelet picks some devices which map back to the same underlying MIG instance. If you want to guarantee that your container is allocated 10 different MIG instances, then you will need to disable timeslicing for the nvidia.com/mig-3g.40gb extended resource.
@cdesiniotis Thank you for the detailed explanation. We appreciate your guidance and will proceed with disabling timeslicing to ensure proper allocation of different MIG instances. We will try this solution and see if it resolves the problem.
Issue Description
Summary
We are encountering inconsistent GPU resource availability in our Kubernetes (K8s) environment, where we utilize the NVIDIA GPU Operator and MIG (Multi-Instance GPU) technology to manage H100 GPUs. Specifically, when allocating multiple MIG profiles or non-MIG GPUs to pods, the expected number of GPUs is not consistently available within the user's container.
Environment Details
Problem Description
MIG Allocation Inconsistency
When allocating
mig-3g.40gb
MIG instances to a Kubeflow Notebook pod, only a subset of the requested GPUs are visible:Requested Resources:
Observed Behavior:
Only 7 out of the 10 requested
mig-3g.40gb
GPUs are detected within the pod. The output ofnvidia-smi -L
inside the pod lists only 7 MIG devices instead of 10.Non-MIG GPU Allocation Variability
When attempting to allocate 4 standard
nvidia.com/gpu
(80GB) vGPUs to a pod, the actual number of GPUs visible varies unpredictably between 2 to 4.Configuration
The following is the configuration for the k8s-device-plugin as defined in its ConfigMap:
Steps to Reproduce
mig-3g.40gb
GPUs to a Kubeflow Notebook.nvidia-smi -L
.nvidia.com/gpu
(80GB) vGPUs to another pod.nvidia-smi -L
.Expected Outcome
For both MIG and non-MIG GPU allocations, the number of GPUs visible within the pod should match the number specified in the resource limits.
Additional Information
The text was updated successfully, but these errors were encountered: