Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent GPU Resource Allocation with MIG and Non-MIG Profiles in Kubernetes #1078

Closed
haitwang-cloud opened this issue Dec 2, 2024 · 3 comments
Assignees

Comments

@haitwang-cloud
Copy link

Issue Description

Summary

We are encountering inconsistent GPU resource availability in our Kubernetes (K8s) environment, where we utilize the NVIDIA GPU Operator and MIG (Multi-Instance GPU) technology to manage H100 GPUs. Specifically, when allocating multiple MIG profiles or non-MIG GPUs to pods, the expected number of GPUs is not consistently available within the user's container.

Environment Details

  • Kubernetes Version: v1.29.9
  • OS Image: Garden Linux 1443.10
  • Kernel Version: 6.6.41-amd64
  • Container Runtime: containerd://1.6.24
  • GPU Operator Version: v23.9
  • k8s-device-plugin Version: v0.14.5-ubi8
  • CUDA Version: 12.2
  • Driver Version: 535.86.10
  • Pod Docker Image: kubeflow1.8-jupyter7.1.1-pytorch2.2.1-cuda12.1

Problem Description

MIG Allocation Inconsistency

When allocating mig-3g.40gb MIG instances to a Kubeflow Notebook pod, only a subset of the requested GPUs are visible:

  • Requested Resources:

    resources:
      limits:
        cpu: '18'
        memory: 180Gi
        nvidia.com/mig-3g.40gb: '10'
  • Observed Behavior:
    Only 7 out of the 10 requested mig-3g.40gb GPUs are detected within the pod. The output of nvidia-smi -L inside the pod lists only 7 MIG devices instead of 10.

Non-MIG GPU Allocation Variability

When attempting to allocate 4 standard nvidia.com/gpu (80GB) vGPUs to a pod, the actual number of GPUs visible varies unpredictably between 2 to 4.

Configuration

The following is the configuration for the k8s-device-plugin as defined in its ConfigMap:

version: v1
flags:
  failOnInitError: true
  nvidiaDriverRoot: "/run/nvidia/driver/"
  plugin:
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 2
    - name: nvidia.com/mig-1g.10gb
      replicas: 4
    - name: nvidia.com/mig-2g.20gb
      replicas: 4
    - name: nvidia.com/mig-3g.40gb
      replicas: 4

Steps to Reproduce

  1. Allocate 10 mig-3g.40gb GPUs to a Kubeflow Notebook.
  2. Check the available GPUs within the pod using nvidia-smi -L.
  3. Allocate 4 nvidia.com/gpu (80GB) vGPUs to another pod.
  4. Verify the available GPUs within this pod using nvidia-smi -L.

Expected Outcome

For both MIG and non-MIG GPU allocations, the number of GPUs visible within the pod should match the number specified in the resource limits.

Additional Information

  • The cluster consists of three nodes, with one node configured to use NVIDIA MIG for vGPU support.
  • We suspect that the issue may be related to how the device plugin handles GPU resource allocation or reporting.
@cdesiniotis
Copy link
Contributor

The issue here is that you have configured timeslicing in your k8s-device-plugin configuration for the nvidia.com/mig-3g.40gb extended resource, so 1 physical 3g.40gb will be advertised as 4 independent devices to kubelet. kubelet is the component that actually makes the allocation decision, so it is entirely possible that when you request 10 nvidia.com/mig-3g.40gb resources for your container kubelet picks some devices which map back to the same underlying MIG instance. If you want to guarantee that your container is allocated 10 different MIG instances, then you will need to disable timeslicing for the nvidia.com/mig-3g.40gb extended resource.

@haitwang-cloud
Copy link
Author

haitwang-cloud commented Dec 3, 2024

@cdesiniotis Thank you for the detailed explanation. We appreciate your guidance and will proceed with disabling timeslicing to ensure proper allocation of different MIG instances. We will try this solution and see if it resolves the problem.

@haitwang-cloud
Copy link
Author

Close it since our issue have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants