Examples of how to distribute deep learning on a High Performance Computer (HPC).
These examples use Ray Train in a static job on a HPC. Ray handles most of the complexity of distributing the work, with minimal changes to your TensorFlow or PyTorch code.
First, install the Python environments for the required HPC: install_python_environments.md
.
- Python script examples:
- TensorFlow
- MNIST end-to-end:
tensorflow_mnist_example.py
. - MNIST tuning:
tensorflow_tune_mnist_example.py
. - Train linear model with Ray Datasets:
tensorflow_linear_dataset_example.py
.
- MNIST end-to-end:
- PyTorch
- Linear:
pytorch_train_linear_example.py
. - Fashion MNIST:
pytorch_train_fashion_mnist_example.py
. - HuggingFace Transformer:
pytorch_transformers_example.py
. - Tune linear model with Ray Datasets:
pytorch_tune_linear_dataset_example.py
.
- Linear:
- TensorFlow
- Then submit the job to HPC (choose one and update the Python script within it):
- ARC4 (SGE)
- CPU:
ray_train_on_arc4_cpu.bash
. - GPU:
ray_train_on_arc4_gpu.bash
.
- CPU:
- Bede (SLURM)
- GPU:
ray_train_on_bede.bash
.
- GPU:
- JADE-2 (SLURM)
- GPU: ...
- ARC4 (SGE)
It's preferable to use a static job on the HPC. To do this, you could test out different ideas locally in a Jupyter Notebook, then when ready convert this to an executable script (.py
) and move it over. However, it is also possible to use Jupyter Notebooks interactively on the HPC following the instructions here: jupyter_notebook_to_hpc.md
.