Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] PyTorch distributed training with minGPT #4464

Merged
merged 11 commits into from
Dec 18, 2024
81 changes: 81 additions & 0 deletions examples/distributed-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Distributed Training with PyTorch

This example demonstrates how to run distributed training with PyTorch using SkyPilot.

**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)**


## Overview

There are two ways to run distributed training with PyTorch:

1. Using normal `torchrun`
2. Using `rdvz` backend

The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.

SkyPilot offers convinient built-in environment variables to help you start distributed training easily.

### Using normal `torchrun`


The following command will spawn 2 nodes with 2 L4 GPU each:
```
sky launch -c train train.yaml
```

In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

```yaml
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py
```



### Using `rdzv` backend

`rdzv` is an alternative backend for distributed training:

```
sky launch -c train-rdzv train-rdzv.yaml
```

In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

```yaml
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"

torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
```


## Scale up

If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically.

For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

```
sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+
```

We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU.

29 changes: 29 additions & 0 deletions examples/distributed-pytorch/train-rdzv.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: minGPT-ddp-rdzv

resources:
cpus: 4+
accelerators: L4

num_nodes: 2

setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
cd examples/mingpt
export LOGLEVEL=INFO

MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"

torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
29 changes: 29 additions & 0 deletions examples/distributed-pytorch/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: minGPT-ddp

resources:
cpus: 4+
accelerators: L4

num_nodes: 2

setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
cd examples/mingpt
export LOGLEVEL=INFO

MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"

torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py
Loading