Skip to content

Commit

Permalink
modified docs and files
Browse files Browse the repository at this point in the history
Signed-off-by: ftgreat <[email protected]>
  • Loading branch information
ftgreat committed Jun 8, 2023
1 parent e4b9830 commit 5af62c6
Show file tree
Hide file tree
Showing 56 changed files with 689 additions and 1,314 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,7 @@

FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality.

<p align="center">
Platforms supported
</p>
放到后面,加logo------------------
****
Tianshu Nvidia
****


## Why should I use FlagAI?

Expand Down Expand Up @@ -299,6 +293,12 @@ The majority of FlagAI is licensed under the [Apache 2.0 license](LICENSE), howe
- [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/fine-tuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63)
- [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1)

## Platforms supported

<div align="center">
<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
</div>



## Misc
Expand Down
5 changes: 5 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,11 @@ FlagAI飞智大部分项目基于 [Apache 2.0 license](LICENSE),但是请注
* GLM 是基于协议 [MIT license](https://github.com/THUDM/GLM/blob/main/LICENSE)
* AltDiffusion 是基于协议 [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license)

## 平台支持

<div align="center">
<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
</div>


## Misc
Expand Down
16 changes: 16 additions & 0 deletions examples/Aquila/Aquila-code/Aquila-code.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
batch_size: 10
gradient_accumulation_steps: 1
lr: 2.0e-5
warm_up: 0.01
save_interval: 1000

bmt_cpu_offload: False
bmt_pre_load: False
bmt_async_load: False
bmt_loss_scale: 524288

save_optim: True
save_rng: True

load_optim: False
resume_dataset: False
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement)


# AquilaCode-7B-nv
# AquilaCode-7B

## 简介/Overview
Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed ZeRO-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。

The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.

AquilaCode-7B-nv是在Aquila-7B模型的基础上,经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下
<!-- AquilaCode-7B-NV是在Aquila-7B模型的基础上,经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下
AquilaCode-7B-nv is a foundational code model obtained by further pretraining on code data based on the Aquila-7B model. It was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows:
| 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM|
| ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- |
| [AquilaCode-7B-nv](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx |
| [AquilaCode-7B-nv](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx | -->


您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标
<!-- 您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标
You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details
You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details -->



Expand Down Expand Up @@ -49,17 +49,11 @@ We used different tokenizers to extract ten thousand data samples from English,
| gpt2_new_100k | 100000 | bpe|1575 | 477|1679 |


模型在8台8卡Nvidia A100-40G上训练14天,数据集规模为2350亿。

The model was trained on an 8 8-card Nvidia A100-40G for 14 days, and there are 235B tokens in the train set.

## 训练数据集/Training data
AquilaCode-7B-nv训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据
`AquilaCode-7B-NV``AquilaCode-7B-TS`训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据

给予我们的模型进行了continue pretrain--------
The AquilaCode-7B-nv model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text).

![Screenshot](../img/data.jpg)
The AquilaCode-7B-NV model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text).

## 使用方式/How to use

Expand Down Expand Up @@ -125,12 +119,12 @@ with torch.no_grad():

### 2. 可监督微调/Supervised Fine-tuning(SFT)
#### Step 1: 配置模型/ Setup Checkpoints
`./checkpoints_in`里新建`aquilacode-7b-nv`目录。将微调后的checkpoint,以及原始`aquilacode-7b-nv`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去
`./checkpoints_in`里新建`aquilacode-7b-nv`(或`aquilacode-7b-ts`)目录。将微调后的checkpoint,以及原始`aquilacode-7b-nv`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去

Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.
Create a new directory named `aquilacode-7b-nv` (or`aquilacode-7b-ts`) inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.

#### Step 2: 修改参数/Modify Parameters
* `cd /examples/aquila`
* `cd /examples/Aquila/Aquila-code`
* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft_code.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft_code.py`
* (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft-code.yaml`
Expand All @@ -148,7 +142,7 @@ Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place

#### Step 3: 启动可监督微调/Start SFT
```
bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名]
bash dist_trigger_docker.sh hostfile Aquila-sft.yaml [aquilacode-7b-nv/aquilacode-7b-ts] [实验名]
```
接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.

Expand Down
224 changes: 224 additions & 0 deletions examples/Aquila/Aquila-code/aquila_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import os
import torch
from torch.utils.data import Dataset
import gc
gc.collect()
torch.cuda.empty_cache()
import sys;sys.path.append("/data2/yzd/workspace/FlagAI")
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.tokenizer import Tokenizer
from flagai.env_args import EnvArgs
from flagai.env_trainer_v1 import EnvTrainer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# You can input all parameters by the command line.
# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch
env_args = EnvArgs(
env_type="bmtrain",
experiment_name="aquila",
batch_size=1,
gradient_accumulation_steps=1,
lr=2e-4,
weight_decay=1e-3,
epochs=100,
log_interval=10,
eval_interval=5000,
num_gpus=1,
load_dir=None,
pytorch_device=device,
save_dir="checkpoints_aquila",
checkpoint_activations=False,
save_interval=5000,
fp16=True,
training_script=__file__,
)
env_args = env_args.parse_args()
#env_args.wandb = False

# overwrite
if env_args.yaml_config:
import yaml
file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read()
data = yaml.load_all(file_data)
delattr(env_args, 'yaml_config')
arg_dict = env_args.__dict__
for subdata in data:
for key, value in subdata.items():
if isinstance(value, list):
for v in value:
arg_dict[key].append(v)
else:
arg_dict[key] = value
trainer = EnvTrainer(env_args)

# Trainer as Trigger
if not env_args.not_call_launch:
import sys
sys.exit(0)

print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True)

checkpoints = env_args.pre_load_dir

model_name = env_args.model_name

env_args.enable_sft_conversations_dataset_v3 = True


print('*'*20, "model_name", model_name, flush=True)

'''
auto_loader = AutoLoader(
"lm",
model_name=model_name,
model_dir=checkpoints,
only_download_config=True,
)
model = auto_loader.get_model()
tokenizer = auto_loader.get_tokenizer()
print('*'*20, "model", model)
trainer.pre_train(model)
print('*'*20, "model", model)
'''

cache_dir = os.path.join(checkpoints, model_name)
print('*'*20, "cache_dir", cache_dir)
tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir)
print('*'*20, "tokenizer", tokenizer)

# avoid sync loading models in case of Mem OOM
if env_args.bmt_async_load:
import time
time.sleep(10*60*(trainer.local_rank%4))


config_file = os.path.join(cache_dir, 'config.json')
from flagai.model.aquila_model import AQUILAModel
model = AQUILAModel.init_from_json(config_file=config_file)
print('*'*20, "model", model)

## bmt_pre_load
checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin")
if env_args.bmt_pre_load:
model.load_weights(checkpoint_path)

trainer.pre_train(model)

print('*'*20, "model", model, flush=True)

assert env_args.enable_sft_dataset_dir is not None and \
env_args.enable_sft_dataset_file is not None

cur_dir = env_args.enable_sft_dataset_dir
jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file)
max_seq_len = 2048

import jsonlines
import numpy as np
def read_file():
conversations = []
with jsonlines.open(jsonl_data) as reader:
for line in reader:
if 'chat_desc' not in line or 'instruction' not in line or 'conversations' not in line:
continue
obj = dict()
obj['chat_desc'] = line['chat_desc']
obj['conversations'] = line['conversations']
obj['instruction'] = line['instruction']
conversations.append(obj)
return conversations

class ConversationDataset(Dataset):
def __init__(self, conversations, tokenizer, maxlen=512):
super(ConversationDataset, self).__init__()
self.conversations = conversations
self.tokenizer = tokenizer
self.maxlen = maxlen

def __getitem__(self, i):
chat_desc = self.conversations[i]['chat_desc']
instruction = self.conversations[i]['instruction']
conversations = self.conversations[i]['conversations']

# chat_desc
example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
EOS_TOKEN = example[-1]
example = example[:-1] # remove eos
# instruction
instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
instruction = instruction[1:-1] # remove bos & eos
example += instruction

import copy
labels = copy.deepcopy(example)

for conversation in conversations:
role = conversation['from']
content = conversation['value']
content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
content = content[1:-1] # remove bos & eos
example += content
if role == 'gpt':
role_labels = copy.deepcopy(content)
else:
# masking
role_labels = [env_args.IGNORE_INDEX] * len(content)
labels += role_labels

example.append(EOS_TOKEN)
labels.append(EOS_TOKEN)

## maxlen
example = example[:self.maxlen]
labels = labels[:self.maxlen]

output = {
"input_ids": example,
"labels": labels,
}
return output

def __len__(self):
return len(self.conversations)

@staticmethod
def collate_fn(batch):
def padding(indice, max_length, pad_idx=0):
pad_indice = [
item + [pad_idx] * max(0, max_length - len(item)) for item in indice
]
return torch.tensor(pad_indice)

input_ids = [data["input_ids"] for data in batch]
labels = [data["labels"] for data in batch]
max_length = max_seq_len
input_ids = padding(input_ids, max_length)[:,:max_length]
labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length]

data = {
"input_ids": input_ids,
"labels": labels
}
return data

conversations = read_file()
data_len = len(conversations)
#train_size = int(data_len * 0.95)
train_size = data_len
train_conversations = conversations[:train_size]

train_dataset = ConversationDataset(train_conversations,
tokenizer=tokenizer,
maxlen=max_seq_len)

trainer.do_train(
train_dataset=train_dataset,
valid_dataset=None,
collate_fn=ConversationDataset.collate_fn,
optimizer=None,
rank_split=False)
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,7 @@
import random
import numpy as np
from flagai.model.predictor.predictor import Predictor
from pathlib import Path
from flagai.data.tokenizer import Tokenizer
import torch.distributed as dist
import json
import json, datetime

import os

model_dir = "./checkpoints_in"
device = "cuda"
Expand All @@ -32,11 +26,6 @@
model = loader.get_model()
tokenizer = loader.get_tokenizer()

# import pdb;pdb.set_trace()
# ckpt = torch.load('./checkpoints_in/aquilacode-7b-nv/pytorch_model.bin', map_location=torch.device('cpu'))
# # print(ckpt)
# model.load_state_dict(ckpt, strict=True)

model.eval()

model.to(device)
Expand All @@ -61,5 +50,5 @@
res = predictor.predict_generate_randomsample(prompt,
out_max_length=max_length,
top_p=0.95,
temperature=t0.7)
temperature=0.7)
print(res)
File renamed without changes.
Loading

0 comments on commit 5af62c6

Please sign in to comment.