Distributed deep learning

Examples of how to distribute deep learning on a High Performance Computer (HPC).

Python script examples:
- TensorFlow
  - MNIST end-to-end: tensorflow_mnist_example.py.
  - MNIST tuning: tensorflow_tune_mnist_example.py.
  - Train linear model with Ray Datasets: tensorflow_linear_dataset_example.py.
- PyTorch
  - Linear: pytorch_train_linear_example.py.
  - Fashion MNIST: pytorch_train_fashion_mnist_example.py.
  - HuggingFace Transformer: pytorch_transformers_example.py.
  - Tune linear model with Ray Datasets: pytorch_tune_linear_dataset_example.py.
Then submit the job to HPC (choose one and update the Python script within it):
- ARC4 (SGE)
  - CPU: ray_train_on_arc4_cpu.bash.
  - GPU: ray_train_on_arc4_gpu.bash.
- Bede (SLURM)
  - GPU: ray_train_on_bede.bash.
- JADE-2 (SLURM)
  - GPU: ...

It's preferable to use a static job on the HPC. To do this, you could test out different ideas locally in a Jupyter Notebook, then when ready convert this to an executable script (.py) and move it over. However, it is also possible to use Jupyter Notebooks interactively on the HPC following the instructions here: jupyter_notebook_to_hpc.md.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
README.md		README.md
distributed.yaml		distributed.yaml
install_python_environments.md		install_python_environments.md
jobqueue.yaml		jobqueue.yaml
jupyter_notebook_to_hpc.md		jupyter_notebook_to_hpc.md
pytorch1.10.0_geometric_arc4.yml		pytorch1.10.0_geometric_arc4.yml
pytorch_bede.yml		pytorch_bede.yml
pytorch_geometric_bede.yml		pytorch_geometric_bede.yml
pytorch_ray_arc4.yml		pytorch_ray_arc4.yml
pytorch_train_fashion_mnist_example.py		pytorch_train_fashion_mnist_example.py
pytorch_train_linear_dataset_example.py		pytorch_train_linear_dataset_example.py
pytorch_train_linear_example.py		pytorch_train_linear_example.py
pytorch_transformers_example.py		pytorch_transformers_example.py
pytorch_tune_linear_dataset_example.py		pytorch_tune_linear_dataset_example.py
ray_train_on_arc4_cpu.bash		ray_train_on_arc4_cpu.bash
ray_train_on_arc4_gpu.bash		ray_train_on_arc4_gpu.bash
ray_train_on_bede.slurm		ray_train_on_bede.slurm
tensorflow_linear_dataset_example.py		tensorflow_linear_dataset_example.py
tensorflow_mnist_example.py		tensorflow_mnist_example.py
tensorflow_tune_mnist_example.py		tensorflow_tune_mnist_example.py
test_if_gpu_available_jax.py		test_if_gpu_available_jax.py
test_if_gpu_available_pytorch.py		test_if_gpu_available_pytorch.py
test_if_gpu_available_tf.py		test_if_gpu_available_tf.py
tf_bede.yml		tf_bede.yml
tf_ray_arc4.yml		tf_ray_arc4.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed deep learning

Contents

About

Releases

Packages

Languages

jhodrien/distributed_deep_learning

Folders and files

Latest commit

History

Repository files navigation

Distributed deep learning

Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages