-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe performance implication by setting global OMP_NUM_THREADS=1 #185
Comments
Hello @kftse-ust-hk, this was discussed recently in #174 The same argument still apply, internally the feedback I got was that |
Hello @flx42 I would like to further bring your attention
of such default, and please reconsider the issue in such context. There are quite a lot ML researchers who didn't write a line of What users can feel is the training maybe somewhat slower, and also hard to quantify the difference without a comparison (e.g. to bare metal) Affected Many ML usersWe estimated 15% of jobs had been set As we support 3 ways of launch: Subtle Resource WastingClue 1: According to
I am unable to evaluate how this value of Clue 2: Unable to fully utilize the "cores" user have requested and allocated We observe (without rigid proof) pytorch dataloader setting large Inconsistency and ConfusingI personally spent 4 hours testing different SLURM config, all SLURM parameters / defaults, before finally ended-up a conclusion that it is the And please note it is already 2 months after I suspect there is a performance issue after performing a large-scale hands-on model training technical demonstration via both bare-metal and container, as part of the system administrating team. And this depends on which base container it is derived from, not whether pytorch is actually installed. Not working FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN pip install torch All below are fine # Working
FROM nvcr.io/nvidia/cuda:12.3.2-devel-ubi8
RUN python3 -m pip install torch # Working
FROM nvcr.io/nvidia/nvhpc:24.3-devel-cuda_multi-ubuntu22.04
RUN python3 -m pip install torch # Working, the env is called PYTORCH_BUILD_VERSION
FROM nvcr.io/partners/gridai/pytorch-lightning:v1.4.0 |
I'm curious, which data loader? As far as I know, the pytorch data loader is not impacted, and DALI isn't either.
I really don't understand why
This is in fact a hard problem that is not fully handled correctly or fully by most applications / libraries, you can take a look at https://fosdem.org/2024/schedule/event/fosdem-2024-3033-libamicontained-a-low-level-library-for-reasoning-about-resource-restriction/
If they sum to no more than 100%, it usually means that your job is actually restricted to one CPU core, either through CPU shares or through a CPU set. And if that's the case, then |
BTW, this hook is not enabled by default, it's up to the administrator to understand the consequences of doing so, and modify it if needed for his environment. |
Unfortunately it is delivered as is to Nvidia Superpod-grade machine for some unknown reason |
enroot/conf/hooks/extra/50-slurm-pytorch.sh
Lines 39 to 42 in 0d85f8d
These lines are designed to "mimic" torch.distributed.run, but consider the
torch.distributed.run
torch.distributed.run
nproc
shows 1 if OMP_NUM_THREADS=1, which does not represent how many CPU can be used.I would argue that
The method used above is not a match to the pytorch config, these two code blocks below are very different.
OMP_NUM_THREADS=1 is a poor performance trade-off, various better trade-off can be made, e.g.
As there is no visible notification if
torch.distributed.run
is never used, I suggest to remove or modify these lines here accordingly to some more sensible defaults following slurm's resource allocation convention.The text was updated successfully, but these errors were encountered: