Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

Open
vishwakaria opened this issue Feb 26, 2023 · 1 comment

Comments

@vishwakaria
Copy link

vishwakaria commented Feb 26, 2023

I have set up enroot with a docker image and am trying to launch a parallel job with pyxis on a cluster with 128 AWS EC2 t3.2xlarge nodes. Beyond 64 nodes, I see this error while starting the container:

 58: slurmstepd: error: pyxis: container start failed with error code: 1
 58: slurmstepd: error: pyxis: printing enroot log file:
 58: slurmstepd: error: pyxis: [ERROR] Could not acquire rootfs lock
 58: slurmstepd: error: pyxis: couldn't start container
 58: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
 58: slurmstepd: error: Failed to invoke spank plugin stack

From the code, it seems like flock times out because many nodes are trying to access the shared file system at the same time.

I was temporarily able to get past this error by increasing the timeout on my cluster to 400s for a 120 node job.

Reporting this error to check if increasing timeout is the right long-term fix or if you have guidance on how to resolve this.

@flx42
Copy link
Member

flx42 commented Feb 28, 2023

How are you configuring enroot.conf exactly? It seems like one of the enroot path is targeting the shared filesystem, our recommendation is usually to avoid storing the extracted rootfs on a shared filesystem, but instead use local storage (or even a tmpfs) for enroot.

If you need to share common container images on a shared filesystem you can use enroot import or --container-save to save a squashfs to the shared filesystem, then each node will copy the image to their local storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants