Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA-DALI Capabilities issue #100

Open
gulzainali98 opened this issue Sep 29, 2021 · 6 comments
Open

NVIDIA-DALI Capabilities issue #100

gulzainali98 opened this issue Sep 29, 2021 · 6 comments

Comments

@gulzainali98
Copy link

I am running enroot container on slurm cluster and I am getting following error:

This is the whole error: https://pastebin.com/96CYv9fs
I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet

The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: NVIDIA/DALI#2229
Which is a similar issue as mine

After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: NVIDIA/nvidia-docker#1128 (comment)

However, I am not sure how to achieve this task on our enroot containers?

@flx42
Copy link
Member

flx42 commented Sep 29, 2021

You can set the environment variable in your Dockerfile, or on the command line:

$ enroot import docker://nvidia/cuda:11.4.0-base
$ enroot create nvidia+cuda+11.4.0-base.sqsh 

$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
        libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

But if nvidia-smi doesn't work in the container, as you mentioned in NVIDIA/DALI#3390 (comment), then you probably have a different problem.

@gulzainali98
Copy link
Author

I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D

nvidia-smi is also working correctly. It's basic pipeline is executed. It's only when the video pipeline is created that the error occurs.

@joernhees
Copy link

interestingly when running enroot via slurm's pyxis integration the following will NOT work as one could expect:

NVIDIA_DRIVER_CAPABILITIES=all srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'echo $NVIDIA_DRIVER_CAPABILITIES'
#output: 
compute,utility,video

The output here seems to be some default depending on the image rather than our env var setting. (Using the --export=... arg of srun also doesn't work and probably confuses people due to the default of passing all if not present and its parsing of , in it.)

Is it possible that the docker://nvidia/cuda:11.4.0-base image mentioned above has some ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility line in their Dockerfile, which in the case of pyxis overrides the same env var from the current context?

Anyhow, the rather obvious workaround seems to be to just set the env var inside the container:

srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; echo $NVIDIA_DRIVER_CAPABILITIES ; ...'
#output: 
all

another option seems to be an enroot env var config file, but that's probably overkill and more confusing if other containers need other settings...

@joernhees
Copy link

joernhees commented Sep 30, 2021

ah, i think i actually found it... seems to be a scoping issue...

Observe the following 4 calls (all without the mentioned enroot env var config file):

# default
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157794 queued and waiting for resources
srun: job 157794 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
compute,utility

# inside only
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
all

# outside only
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility

# inside and outside
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

Notice how libnvcuvid is only available in the container if the outside ENV var was set. Also notice how inside the container the does not reflect the outside ENV var!

Let's repeat the same with a pytorch image:

# default:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157806 queued and waiting for resources
srun: job 157806 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video

# inside only:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

# outside only:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video

# inside and outside:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
glasgow
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

# explicitly setting outside to compute only
$ NVIDIA_DRIVER_CAPABILITIES=compute LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
compute,utility,video

Summarizing there seem to be 2 scopes, one outer and one inner, both which are dangerously out of sync, probably causing the confusion:

  • The outer scope
    • seems to influence the availability/loading of libs inside the container
    • seems to default to the image's (Dockerfile ENV) settings
    • is not necessarily reflected by the $NVIDIA_DRIVER_CAPABILITIES var inside the container!!!
  • The inner scope
    • is not linked nor synced to the outer scope
    • seems to default to the images' (Dockerfile ENV) settings

So if your base image already set the capabilities right, apparently magic kicks in and you don't need to worry. If your base image doesn't, then things get confusing and currently i'd suggest to explicitly set the NVIDIA_DRIVER_CAPABILITIES env var twice to the same value in outer and inner scope.

@flx42
Copy link
Member

flx42 commented Sep 30, 2021

Yes, sorry, it's a bit confusing. The environment of srun is passed to enroot so it can influence how the container is started and thus whether libnvcuvid.so.1 is mounted inside the container.
However, the environment variables of srun and the environment variables of the container are then merged, but the container environment variables always take precedence.

@flx42
Copy link
Member

flx42 commented Sep 30, 2021

This was discussed in NVIDIA/pyxis#26, but I admit that this particular case here is even more confusing than the problems I saw before.

While the mismatch of NVIDIA_DRIVER_CAPABILITIES is confusing, there is no reason to set NVIDIA_DRIVER_CAPABILITIES inside the container though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants