"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

vishwakaria · 2023-02-26T22:44:12Z

I have set up enroot with a docker image and am trying to launch a parallel job with pyxis on a cluster with 128 AWS EC2 t3.2xlarge nodes. Beyond 64 nodes, I see this error while starting the container:

 58: slurmstepd: error: pyxis: container start failed with error code: 1
 58: slurmstepd: error: pyxis: printing enroot log file:
 58: slurmstepd: error: pyxis: [ERROR] Could not acquire rootfs lock
 58: slurmstepd: error: pyxis: couldn't start container
 58: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
 58: slurmstepd: error: Failed to invoke spank plugin stack

From the code, it seems like flock times out because many nodes are trying to access the shared file system at the same time.

I was temporarily able to get past this error by increasing the timeout on my cluster to 400s for a 120 node job.

Reporting this error to check if increasing timeout is the right long-term fix or if you have guidance on how to resolve this.

The text was updated successfully, but these errors were encountered:

flx42 · 2023-02-28T01:31:55Z

How are you configuring enroot.conf exactly? It seems like one of the enroot path is targeting the shared filesystem, our recommendation is usually to avoid storing the extracted rootfs on a shared filesystem, but instead use local storage (or even a tmpfs) for enroot.

If you need to share common container images on a shared filesystem you can use enroot import or --container-save to save a squashfs to the shared filesystem, then each node will copy the image to their local storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

vishwakaria commented Feb 26, 2023 •

edited

Loading

flx42 commented Feb 28, 2023

"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

"Could not acquire rootfs lock" when using enroot with pyxis on a large cluster #154

Comments

vishwakaria commented Feb 26, 2023 • edited Loading

flx42 commented Feb 28, 2023

vishwakaria commented Feb 26, 2023 •

edited

Loading