[Example] PyTorch distributed training with minGPT #4464

Michaelvll · 2024-12-12T01:18:26Z

This PR adds a more modern distributed training example.

TODOs:

Update the examples in our doc with this example

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

romilbhardwaj

Awesome, thanks @Michaelvll! Left some minor nit comments

romilbhardwaj · 2024-12-12T04:53:49Z

examples/distributed-pytorch/README.md

+### Using normal `torchrun`
+
+
+The following command spawn 2 nodes with 2 L4 GPU each. 


Suggested change

The following command spawn 2 nodes with 2 L4 GPU each.

The following command will spawn 2 nodes with 2 L4 GPU each:

romilbhardwaj · 2024-12-12T04:54:05Z

examples/distributed-pytorch/README.md

+
+The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.
+
+SkyPilot offers easy built-in environment variables to help you start distributed training easily.


nit

Suggested change

SkyPilot offers easy built-in environment variables to help you start distributed training easily.

SkyPilot offers convinient built-in environment variables to help you start distributed training easily.

romilbhardwaj · 2024-12-12T04:55:02Z

examples/distributed-pytorch/README.md

+
+The following command spawn 2 nodes with 2 L4 GPU each. 
+
+`sky launch -c train.yaml`


Missing cluster name? Also might be nice to put in a code block

Suggested change

`sky launch -c train.yaml`

\```

sky launch -c train train.yaml

\```

romilbhardwaj · 2024-12-12T05:25:46Z

examples/distributed-pytorch/README.md

+
+`sky launch -c train.yaml`
+
+In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.


Suggested change

In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

romilbhardwaj · 2024-12-12T05:26:09Z

examples/distributed-pytorch/README.md

+`rdvz` is an alternative backend for distributed training:
+
+```
+sky launch -c train-rdzv.yaml


Suggested change

sky launch -c train-rdzv.yaml

sky launch -c train-rdzv train-rdzv.yaml

romilbhardwaj · 2024-12-12T05:26:54Z

examples/distributed-pytorch/README.md

+
+
+
+### Using `rdvz` backend


Suggested change

### Using `rdvz` backend

### Using `rdzv` backend

romilbhardwaj · 2024-12-12T05:27:03Z

examples/distributed-pytorch/README.md

+
+### Using `rdvz` backend
+
+`rdvz` is an alternative backend for distributed training:


Suggested change

`rdvz` is an alternative backend for distributed training:

`rdzv` is an alternative backend for distributed training:

romilbhardwaj · 2024-12-12T05:27:38Z

examples/distributed-pytorch/README.md

+sky launch -c train-rdzv.yaml
+```
+
+In the [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.


Suggested change

In the [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

romilbhardwaj · 2024-12-12T05:28:17Z

examples/distributed-pytorch/README.md

+
+For example, the following command will spawn 4 nodes with 4 L4 GPUs each.
+
+`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`


change to num nodes 4 and L4:4

Suggested change

`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`

\```

sky launch -c train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+

\```

Michaelvll added 2 commits December 12, 2024 01:16

Add example for distributed pytorch

179520e

update

2dd3af5

romilbhardwaj reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] PyTorch distributed training with minGPT #4464

[Example] PyTorch distributed training with minGPT #4464

Michaelvll commented Dec 12, 2024

romilbhardwaj left a comment

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

		### Using normal `torchrun`


		The following command spawn 2 nodes with 2 L4 GPU each.

	The following command spawn 2 nodes with 2 L4 GPU each.
	The following command will spawn 2 nodes with 2 L4 GPU each:


		The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.

		SkyPilot offers easy built-in environment variables to help you start distributed training easily.

	SkyPilot offers easy built-in environment variables to help you start distributed training easily.
	SkyPilot offers convinient built-in environment variables to help you start distributed training easily.


		The following command spawn 2 nodes with 2 L4 GPU each.

		`sky launch -c train.yaml`


		`sky launch -c train.yaml`

		In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

	In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.
	In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

	sky launch -c train-rdzv.yaml
	sky launch -c train-rdzv train-rdzv.yaml


		### Using `rdvz` backend

		`rdvz` is an alternative backend for distributed training:

	`rdvz` is an alternative backend for distributed training:
	`rdzv` is an alternative backend for distributed training:

	In the [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.
	In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.


		For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

		`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`

[Example] PyTorch distributed training with minGPT #4464

Are you sure you want to change the base?

[Example] PyTorch distributed training with minGPT #4464

Conversation

Michaelvll commented Dec 12, 2024

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment