Skip to content

Commit

Permalink
Fix broken links in README
Browse files Browse the repository at this point in the history
  • Loading branch information
li-yi-dong committed Sep 4, 2023
1 parent d2e05b8 commit 827c3e1
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 9 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Therefore, to facilitate the training of LLaMA-based models and reduce the cost

# OverlappedDistributedOptimizer

In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`]("https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md") to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).
In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md) to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).

To mitigate the overhead, we try to overlap the collective communication with computation, according to the partition strategy in DeepSpeed ZeRO Stage-2. This strategy fails to scale. It takes too many small `Reduce` operations at large scale, which makes it under-utilize the inter-connection bandwidth.

Expand Down Expand Up @@ -125,7 +125,7 @@ In particular, we recommend to increase the micro-batch size to fully occupy the
| `--tokenizer-type=PretrainedFromHF` | Use a Tokenizer from Huggingface (would be loaded via `transformers.AutoTokenizer`) |
| `--distributed-checkpointing` | Distributed saving of checkpoint files. |

Megatron-LLaMA supports the canonical [data prepocessing]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing") and [evaluation]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks") as mentioned in the Megatron-LM library.
Megatron-LLaMA supports the canonical [data prepocessing](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing) and [evaluation](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks) as mentioned in the Megatron-LM library.

### Future work

Expand All @@ -144,9 +144,9 @@ Megatron-LLaMA is developed by Aicheng Technology, Alibaba Group and is based on

The following repositories are used in Megatron-LLaMA, either in close to original form or as an inspiration:

[Megatron-LM]("https://github.com/NVIDIA/Megatron-LM")
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)

[LLaMA]("https://github.com/facebookresearch/llama")
[LLaMA](https://github.com/facebookresearch/llama)

[DeepSpeed]("https://github.com/microsoft/DeepSpeed")
[DeepSpeed](https://github.com/microsoft/DeepSpeed)

8 changes: 4 additions & 4 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ LLaMA是目前大语言模型开源社区中一项重要工作。LLaMA在LLM的

## 2. Megatron-LLaMA中`OverlappedDistributedOptimizer`简介

在原生Megatron-LM中,用户可以使用[`DistributedOptimizer`]("https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md")来切分梯度和优化器状态,以减少训练中的显存占用。`DistributedOptimizer`在每次获得预设的梯度聚合组梯度后,通过`ReduceScatter`算子,将之前累积的全部梯度分发到不同的Rank。每个Rank更新完属于自己的参数后,再通过`AllGather`算子将更新后的参数复制到所有Rank。在实际训练中,我们观察到`DistributedOptimizer`的集合通信在梯度聚合较小的情况下,将引入极大的额外开销。极端情况下,不使用梯度聚合,将引入超过整体耗时50%的额外开销。
在原生Megatron-LM中,用户可以使用[`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md)来切分梯度和优化器状态,以减少训练中的显存占用。`DistributedOptimizer`在每次获得预设的梯度聚合组梯度后,通过`ReduceScatter`算子,将之前累积的全部梯度分发到不同的Rank。每个Rank更新完属于自己的参数后,再通过`AllGather`算子将更新后的参数复制到所有Rank。在实际训练中,我们观察到`DistributedOptimizer`的集合通信在梯度聚合较小的情况下,将引入极大的额外开销。极端情况下,不使用梯度聚合,将引入超过整体耗时50%的额外开销。

在尝试实现通信和计算并行的过程中,我们尝试了DeepSpeed ZeRO2中对梯度以及优化器状态的切分方式。在超大规模的场景下,我们观察到其切分方式需要大量细碎的通信Kernel,无法充分利用通信带宽,造成了通信耗时过长,模型的计算量不足以与通信充分并行。

Expand Down Expand Up @@ -142,8 +142,8 @@ Megatron-LLaMA使用Apache 2.0开源协议,允许用作商业用途。详情

### 参考工作

[Megatron-LM]("https://github.com/NVIDIA/Megatron-LM")
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)

[LLaMA]("https://github.com/facebookresearch/llama")
[LLaMA](https://github.com/facebookresearch/llama)

[DeepSpeed]("https://github.com/microsoft/DeepSpeed")
[DeepSpeed](https://github.com/microsoft/DeepSpeed)

0 comments on commit 827c3e1

Please sign in to comment.