Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
Signed-off-by: ftgreat <[email protected]>
  • Loading branch information
ftgreat committed Jun 8, 2023
1 parent c2f99c5 commit 9a5553e
Show file tree
Hide file tree
Showing 5 changed files with 37 additions and 12 deletions.
7 changes: 6 additions & 1 deletion examples/aquila/aquila-code/README_AquilaCode-7B-nv.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,17 @@ We also support [Huggingface](hflink)

我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
Expand Down
7 changes: 6 additions & 1 deletion examples/aquila/aquila-code/README_AquilaCode-7B-ts.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,17 @@ We also support [Huggingface](hflink)

我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
Expand Down
13 changes: 9 additions & 4 deletions examples/aquila/aquila-pretrain/README_Aquila-33B.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,17 @@ We also support [Huggingface](hflink)

我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
Expand All @@ -57,9 +62,9 @@ We used a series of high-quality Chinese and English datasets to train and fine-
![Screenshot](../img/data.jpg)


## 快速使用/Quick start
## 使用方式/How to use

### 预训练/Pre-training
### 1. 预训练/Pre-training
#### Step 1: 修改参数/Modify Parameters

* `cd /examples/aquila/aquila-pretrain`
Expand All @@ -79,7 +84,7 @@ bash dist_trigger_docker.sh hostfile aquila-pretrain.yaml aquila-30b [实验名]

![Screenshot](../img/info2.jpg)

### 可监督微调/Supervised Fine-tuning(SFT)
### 2. 可监督微调/Supervised Fine-tuning(SFT)
#### Step 1: 修改参数
* `cd /examples/aquila/aquila-pretrain`
* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
Expand Down
9 changes: 7 additions & 2 deletions examples/aquila/aquila-pretrain/README_Aquila-7B.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,22 @@ We also support [Huggingface](hflink)
| Aquila-33B | Apache 2.0 || xx | xx | Nvidia-A100 |
| AquilaCode-7B-nv | Apache 2.0 || 235B | 14x8x8 | Nvidia-A100 |
| AquilaCode-7B-ts | Apache 2.0 || 75B | 9x32x8 | Tianshu-BI-V100 |
| AquilaChat-7B | Apache 2.0 || 1 | dx1x8 | Nvidia-A100 |
| AquilaChat-7B | Apache 2.0 || 15万条 | 8/24x1x8 | Nvidia-A100 |


我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
Expand Down
13 changes: 9 additions & 4 deletions examples/aquila/aquila-sft/README_AquilaChat-7B.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,17 @@ We also support [Huggingface](hflink)

我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
Expand All @@ -61,9 +66,9 @@ We used a series of high-quality Chinese and English datasets to train and fine-
![Screenshot](../img/data.jpg)


## 快速使用/Quick start
## 使用方式/How to use

### 推理/Inference
### 1. 推理/Inference

```python
import os
Expand Down Expand Up @@ -171,7 +176,7 @@ with torch.no_grad():

```

### 可监督微调/Supervised Fine-tuning(SFT)
### 2. 可监督微调/Supervised Fine-tuning(SFT)
#### Step 1: 配置模型/ Setup Checkpoints
`./checkpoints_in`里新建`aquila-7b`目录。将微调后的checkpoint,以及原始`aquila-7b`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去

Expand Down

0 comments on commit 9a5553e

Please sign in to comment.