You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have set up enroot with a docker image and am trying to launch a parallel job with pyxis on a cluster with 128 AWS EC2 t3.2xlarge nodes. Beyond 64 nodes, I see this error while starting the container:
How are you configuring enroot.conf exactly? It seems like one of the enroot path is targeting the shared filesystem, our recommendation is usually to avoid storing the extracted rootfs on a shared filesystem, but instead use local storage (or even a tmpfs) for enroot.
If you need to share common container images on a shared filesystem you can use enroot import or --container-save to save a squashfs to the shared filesystem, then each node will copy the image to their local storage.
I have set up enroot with a docker image and am trying to launch a parallel job with pyxis on a cluster with 128 AWS EC2 t3.2xlarge nodes. Beyond 64 nodes, I see this error while starting the container:
From the code, it seems like
flock
times out because many nodes are trying to access the shared file system at the same time.I was temporarily able to get past this error by increasing the timeout on my cluster to 400s for a 120 node job.
Reporting this error to check if increasing timeout is the right long-term fix or if you have guidance on how to resolve this.
The text was updated successfully, but these errors were encountered: