Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I want to use MPS in Kubernetes, I need to specify --mps-root. #816

Closed
zbk2012 opened this issue Jul 11, 2024 · 5 comments
Closed

When I want to use MPS in Kubernetes, I need to specify --mps-root. #816

zbk2012 opened this issue Jul 11, 2024 · 5 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zbk2012
Copy link

zbk2012 commented Jul 11, 2024

####################
logs:
using mps requires --mps-root to be specified.
####################
The contents of the nvidia-device-plugin.yml file are as follows:

...
env:
- name: CONFIG_FILE
  value: "/data/system-yaml/a100-mps.yaml"
...

####################
The contents of the /data/system-yaml/a100-mps.yaml file are as follows:

version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 2

####################
I have added the following content to the nvidia-device-plugin.yml file:

...
env:
- name: CONFIG_FILE
  value: "/data/system-yaml/a100-mps.yaml"
- name: MPS_ROOT
  value: "/run/nvidia/mps"
...

The container successfully started, but no GPU was found and there is nothing in the /run/nvidia/mps directory.

How to fill in MPS_ROOT?

@elezar
Copy link
Member

elezar commented Jul 17, 2024

Hi @zbk2012. From your example, it seems as if your config file is not properly indented. You are probably looking for something like instead:

version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 2

This should also be confirmed by your device plugin logs.

@zbk2012
Copy link
Author

zbk2012 commented Jul 17, 2024

Hi @zbk2012. From your example, it seems as if your config file is not properly indented. You are probably looking for something like instead:

version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
       replicas: 2

This should also be confirmed by your device plugin logs.

Oh, I'm sorry, the indentation was missing when copying. The indentation in the config file is correct.

@elezar
Copy link
Member

elezar commented Aug 8, 2024

@zbk2012 could you provide the logs for GFD and the device plugin? For example, I use the following to deploy the plugin:

helm upgrade nvidia -i deployments/helm/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set runtimeClassName=nvidia \
    --set config.name=nvidia-plugin-configs \
    --set nvidiaDriverRoot=/ \
    --set gfd.enabled=true

Where the config is created from:

cat << EOF > dp-mps-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy:
    - envvar
    deviceIDStrategy: uuid
sharing:
  mps:
    renameByDefault: false
    resources:
    - name: nvidia.com/gpu
      replicas: 4
EOF

by running:

kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
    --from-file=config=dp-mps-config.yaml

Copy link

github-actions bot commented Nov 7, 2024

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2024
Copy link

github-actions bot commented Dec 7, 2024

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants