Name		Name	Last commit message	Last commit date
parent directory ..
slurm		slurm
README.md		README.md
__init__.py		__init__.py
run_tune.py		run_tune.py
run_tune_pbt.py		run_tune_pbt.py

README.md

Running Hyperparameter Optimization with Ray Tune

Installation

pip install ray ray[tune]

Model config considerations

The current Ray Tune implementation uses the standard OCP config. However, there are a number of config settings that require additional consideration.

logger: None
is_hpo: True

optim:
  …
  eval_every: (int) number of steps
  checkpoint_every: (int: optional) number of steps

The first two are easily set. The logger is set to None because Ray Tune internally handles the logging.

The eval_every setting is case specific and will likely require some experimentation. The eval_every flag sets how often the validation set is run in number of steps. Depending on the OCP model and dataset of interest, training for a single epoch can take a substantial amount of time. However, to take full advantage of HPO methods that minimize compute by terminating trials that are not promising, such as successive halving, communication of train and val metrics need to happen on shorter timescales. Paraphrasing the Ray Tune docs, eval_every should be set large enough to avoid overheads but short enough to report progress periodically — minutes timescale recommended.

The eval_every setting is only available for the force trainer so when using the energy trainer validation will be run and reporting to Ray Tune will occur on a per epoch basis.

The checkpoint_every setting determines how frequently, in steps, Ray Tune will write a checkpoint. Checkpointing can create a lot of overhead for certain HPO methods so do not do it too frequently. The default behavior is no checkpointing.

Usage with Slurm

Make necessary changes to run_tune.py and slurm/submit-ray-cluster.sbatch

Example run_tune.py updates
- choose search and scheduler algorithms and set associated parameters (see Ray Tune docs for details)
- set the resources to use per individual trial
Example slurm/submit-ray-cluster.sbatch updates
- load modules or set conda env
- change the total run time and resources to use
submit using sbatch slurm/submit-ray-cluster.sbatch

Slurm scripts taken from https://github.com/NERSC/slurm-ray-cluster

For usage with other cluster managers or cloud resources please refer to the Distributed Ray Docs

Examples

Asynchronous Successive Halving — ocp/scripts/hpo/run_tune.py
Population Based Training — ocp/scripts/hpo/run_tune_pbt.py

Testing/Debugging Ray Tune

In run_tune.py set ray.init(local_mode=True)
run python path_to/run_tune.py --mode train --config-yml path_to/config --run_dir path_to_run_dir

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hpo

hpo

README.md

Running Hyperparameter Optimization with Ray Tune

Installation

Model config considerations

Usage with Slurm

Examples

Testing/Debugging Ray Tune

Files

hpo

Directory actions

More options

Directory actions

More options

Latest commit

History

hpo

Folders and files

parent directory

README.md

Running Hyperparameter Optimization with Ray Tune

Installation

Model config considerations

Usage with Slurm

Examples

Testing/Debugging Ray Tune