Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
torchelastic: change monitor_interval default to 0.1 (pytorch#124692)
This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core. Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU. As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035). ## Motivation This setting is used to control how frequently we poll for failed processes in elastic. * For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry. * @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things ## Results This is measured in cores (100% is a single core under full load). | monitor_interval (s) | nproc-per-node | CPU util (highest observed) | | -------------------- | -------------- | --------------------------- | | 1.0 | 10 | 0.2% | | 0.1 | 1 | 0.4% | | 0.1 | 10 | 0.4% | | 0.01 | 10 | 0.9% | | 0.001 | 10 | 4.0% | | 0.1 | 100 | 0.5% | | 0.1 | 1000 | 2.2% | | 0.1 | 10000 | 15.7% | | 0.1 | 100000 | 45.4% | ## Methodology ```sh # run command $ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py # wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest $ top -b -d 10 -c | rg 'torchrun.*wait ``` wait.py ```py import time time.sleep(10*60) ``` Pull Request resolved: pytorch#124692 Approved by: https://github.com/kiukchung, https://github.com/kurman
- Loading branch information