-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA-DALI Capabilities issue #100
Comments
You can set the environment variable in your Dockerfile, or on the command line: $ enroot import docker://nvidia/cuda:11.4.0-base
$ enroot create nvidia+cuda+11.4.0-base.sqsh
$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 But if |
I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D nvidia-smi is also working correctly. It's basic pipeline is executed. It's only when the video pipeline is created that the error occurs. |
interestingly when running enroot via slurm's pyxis integration the following will NOT work as one could expect:
The output here seems to be some default depending on the image rather than our env var setting. (Using the Is it possible that the docker://nvidia/cuda:11.4.0-base image mentioned above has some Anyhow, the rather obvious workaround seems to be to just set the env var inside the container:
another option seems to be an enroot env var config file, but that's probably overkill and more confusing if other containers need other settings... |
ah, i think i actually found it... seems to be a scoping issue... Observe the following 4 calls (all without the mentioned enroot env var config file): # default
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157794 queued and waiting for resources
srun: job 157794 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
compute,utility
# inside only
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
all
# outside only
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility
# inside and outside
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all Notice how libnvcuvid is only available in the container if the outside ENV var was set. Also notice how inside the container the does not reflect the outside ENV var! Let's repeat the same with a pytorch image: # default:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157806 queued and waiting for resources
srun: job 157806 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video
# inside only:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all
# outside only:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video
# inside and outside:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all
# explicitly setting outside to compute only
$ NVIDIA_DRIVER_CAPABILITIES=compute LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
compute,utility,video Summarizing there seem to be 2 scopes, one outer and one inner, both which are dangerously out of sync, probably causing the confusion:
So if your base image already set the capabilities right, apparently magic kicks in and you don't need to worry. If your base image doesn't, then things get confusing and currently i'd suggest to explicitly set the |
Yes, sorry, it's a bit confusing. The environment of |
This was discussed in NVIDIA/pyxis#26, but I admit that this particular case here is even more confusing than the problems I saw before. While the mismatch of |
I am running enroot container on slurm cluster and I am getting following error:
This is the whole error: https://pastebin.com/96CYv9fs
I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet
The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102
Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: NVIDIA/DALI#2229
Which is a similar issue as mine
After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: NVIDIA/nvidia-docker#1128 (comment)
However, I am not sure how to achieve this task on our enroot containers?
The text was updated successfully, but these errors were encountered: