diff --git a/README.md b/README.md index 85c75f45..735e8cb1 100755 --- a/README.md +++ b/README.md @@ -10,13 +10,7 @@ FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. -

-Platforms supported -

-放到后面,加logo------------------ -**** - Tianshu Nvidia -**** + ## Why should I use FlagAI? @@ -299,6 +293,12 @@ The majority of FlagAI is licensed under the [Apache 2.0 license](LICENSE), howe - [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/fine-tuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63) - [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1) +## Platforms supported + +
+ +
+ ## Misc diff --git a/README_zh.md b/README_zh.md index 0a51ff13..3800c70b 100755 --- a/README_zh.md +++ b/README_zh.md @@ -289,6 +289,11 @@ FlagAI飞智大部分项目基于 [Apache 2.0 license](LICENSE),但是请注 * GLM 是基于协议 [MIT license](https://github.com/THUDM/GLM/blob/main/LICENSE) * AltDiffusion 是基于协议 [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) +## 平台支持 + +
+ +
## Misc diff --git a/examples/Aquila/Aquila-code/Aquila-code.yaml b/examples/Aquila/Aquila-code/Aquila-code.yaml new file mode 100755 index 00000000..4de3b15e --- /dev/null +++ b/examples/Aquila/Aquila-code/Aquila-code.yaml @@ -0,0 +1,16 @@ +batch_size: 10 +gradient_accumulation_steps: 1 +lr: 2.0e-5 +warm_up: 0.01 +save_interval: 1000 + +bmt_cpu_offload: False +bmt_pre_load: False +bmt_async_load: False +bmt_loss_scale: 524288 + +save_optim: True +save_rng: True + +load_optim: False +resume_dataset: False \ No newline at end of file diff --git a/examples/aquila/aquila-code/README_AquilaCode-7B-nv.md b/examples/Aquila/Aquila-code/README_AquilaCode-7B-NV.md similarity index 85% rename from examples/aquila/aquila-code/README_AquilaCode-7B-nv.md rename to examples/Aquila/Aquila-code/README_AquilaCode-7B-NV.md index 229592ed..42baa0f9 100755 --- a/examples/aquila/aquila-code/README_AquilaCode-7B-nv.md +++ b/examples/Aquila/Aquila-code/README_AquilaCode-7B-NV.md @@ -1,25 +1,25 @@ license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement) -# AquilaCode-7B-nv +# AquilaCode-7B ## 简介/Overview Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed ZeRO-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。 The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations. -AquilaCode-7B-nv是在Aquila-7B模型的基础上,经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下 + -您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标 + @@ -49,17 +49,11 @@ We used different tokenizers to extract ten thousand data samples from English, | gpt2_new_100k | 100000 | bpe|1575 | 477|1679 | -模型在8台8卡Nvidia A100-40G上训练14天,数据集规模为2350亿。 - -The model was trained on an 8 8-card Nvidia A100-40G for 14 days, and there are 235B tokens in the train set. - ## 训练数据集/Training data -AquilaCode-7B-nv训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据 +`AquilaCode-7B-NV`和`AquilaCode-7B-TS`训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据 给予我们的模型进行了continue pretrain-------- -The AquilaCode-7B-nv model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text). - -![Screenshot](../img/data.jpg) +The AquilaCode-7B-NV model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text). ## 使用方式/How to use @@ -125,12 +119,12 @@ with torch.no_grad(): ### 2. 可监督微调/Supervised Fine-tuning(SFT) #### Step 1: 配置模型/ Setup Checkpoints -在`./checkpoints_in`里新建`aquilacode-7b-nv`目录。将微调后的checkpoint,以及原始`aquilacode-7b-nv`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 +在`./checkpoints_in`里新建`aquilacode-7b-nv`(或`aquilacode-7b-ts`)目录。将微调后的checkpoint,以及原始`aquilacode-7b-nv`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 -Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. +Create a new directory named `aquilacode-7b-nv` (or`aquilacode-7b-ts`) inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. #### Step 2: 修改参数/Modify Parameters -* `cd /examples/aquila` +* `cd /examples/Aquila/Aquila-code` * 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) * 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft_code.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft_code.py` * (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft-code.yaml` @@ -148,7 +142,7 @@ Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place #### Step 3: 启动可监督微调/Start SFT ``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名] +bash dist_trigger_docker.sh hostfile Aquila-sft.yaml [aquilacode-7b-nv/aquilacode-7b-ts] [实验名] ``` 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. diff --git a/examples/Aquila/Aquila-code/aquila_code.py b/examples/Aquila/Aquila-code/aquila_code.py new file mode 100755 index 00000000..67c6af4a --- /dev/null +++ b/examples/Aquila/Aquila-code/aquila_code.py @@ -0,0 +1,224 @@ +# Copyright © 2022 BAAI. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License") +import os +import torch +from torch.utils.data import Dataset +import gc +gc.collect() +torch.cuda.empty_cache() +import sys;sys.path.append("/data2/yzd/workspace/FlagAI") +from flagai.auto_model.auto_loader import AutoLoader +from flagai.data.tokenizer import Tokenizer +from flagai.env_args import EnvArgs +from flagai.env_trainer_v1 import EnvTrainer + +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + +# You can input all parameters by the command line. +# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch +env_args = EnvArgs( + env_type="bmtrain", + experiment_name="aquila", + batch_size=1, + gradient_accumulation_steps=1, + lr=2e-4, + weight_decay=1e-3, + epochs=100, + log_interval=10, + eval_interval=5000, + num_gpus=1, + load_dir=None, + pytorch_device=device, + save_dir="checkpoints_aquila", + checkpoint_activations=False, + save_interval=5000, + fp16=True, + training_script=__file__, +) +env_args = env_args.parse_args() +#env_args.wandb = False + +# overwrite +if env_args.yaml_config: + import yaml + file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read() + data = yaml.load_all(file_data) + delattr(env_args, 'yaml_config') + arg_dict = env_args.__dict__ + for subdata in data: + for key, value in subdata.items(): + if isinstance(value, list): + for v in value: + arg_dict[key].append(v) + else: + arg_dict[key] = value +trainer = EnvTrainer(env_args) + +# Trainer as Trigger +if not env_args.not_call_launch: + import sys + sys.exit(0) + +print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True) + +checkpoints = env_args.pre_load_dir + +model_name = env_args.model_name + +env_args.enable_sft_conversations_dataset_v3 = True + + +print('*'*20, "model_name", model_name, flush=True) + +''' +auto_loader = AutoLoader( + "lm", + model_name=model_name, + model_dir=checkpoints, + only_download_config=True, +) +model = auto_loader.get_model() +tokenizer = auto_loader.get_tokenizer() +print('*'*20, "model", model) +trainer.pre_train(model) +print('*'*20, "model", model) + +''' + +cache_dir = os.path.join(checkpoints, model_name) +print('*'*20, "cache_dir", cache_dir) +tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir) +print('*'*20, "tokenizer", tokenizer) + +# avoid sync loading models in case of Mem OOM +if env_args.bmt_async_load: + import time + time.sleep(10*60*(trainer.local_rank%4)) + + +config_file = os.path.join(cache_dir, 'config.json') +from flagai.model.aquila_model import AQUILAModel +model = AQUILAModel.init_from_json(config_file=config_file) +print('*'*20, "model", model) + +## bmt_pre_load +checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin") +if env_args.bmt_pre_load: + model.load_weights(checkpoint_path) + +trainer.pre_train(model) + +print('*'*20, "model", model, flush=True) + +assert env_args.enable_sft_dataset_dir is not None and \ + env_args.enable_sft_dataset_file is not None + +cur_dir = env_args.enable_sft_dataset_dir +jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file) +max_seq_len = 2048 + +import jsonlines +import numpy as np +def read_file(): + conversations = [] + with jsonlines.open(jsonl_data) as reader: + for line in reader: + if 'chat_desc' not in line or 'instruction' not in line or 'conversations' not in line: + continue + obj = dict() + obj['chat_desc'] = line['chat_desc'] + obj['conversations'] = line['conversations'] + obj['instruction'] = line['instruction'] + conversations.append(obj) + return conversations + +class ConversationDataset(Dataset): + def __init__(self, conversations, tokenizer, maxlen=512): + super(ConversationDataset, self).__init__() + self.conversations = conversations + self.tokenizer = tokenizer + self.maxlen = maxlen + + def __getitem__(self, i): + chat_desc = self.conversations[i]['chat_desc'] + instruction = self.conversations[i]['instruction'] + conversations = self.conversations[i]['conversations'] + + # chat_desc + example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids'] + EOS_TOKEN = example[-1] + example = example[:-1] # remove eos + # instruction + instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids'] + instruction = instruction[1:-1] # remove bos & eos + example += instruction + + import copy + labels = copy.deepcopy(example) + + for conversation in conversations: + role = conversation['from'] + content = conversation['value'] + content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids'] + content = content[1:-1] # remove bos & eos + example += content + if role == 'gpt': + role_labels = copy.deepcopy(content) + else: + # masking + role_labels = [env_args.IGNORE_INDEX] * len(content) + labels += role_labels + + example.append(EOS_TOKEN) + labels.append(EOS_TOKEN) + + ## maxlen + example = example[:self.maxlen] + labels = labels[:self.maxlen] + + output = { + "input_ids": example, + "labels": labels, + } + return output + + def __len__(self): + return len(self.conversations) + + @staticmethod + def collate_fn(batch): + def padding(indice, max_length, pad_idx=0): + pad_indice = [ + item + [pad_idx] * max(0, max_length - len(item)) for item in indice + ] + return torch.tensor(pad_indice) + + input_ids = [data["input_ids"] for data in batch] + labels = [data["labels"] for data in batch] + max_length = max_seq_len + input_ids = padding(input_ids, max_length)[:,:max_length] + labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length] + + data = { + "input_ids": input_ids, + "labels": labels + } + return data + +conversations = read_file() +data_len = len(conversations) +#train_size = int(data_len * 0.95) +train_size = data_len +train_conversations = conversations[:train_size] + +train_dataset = ConversationDataset(train_conversations, + tokenizer=tokenizer, + maxlen=max_seq_len) + +trainer.do_train( + train_dataset=train_dataset, + valid_dataset=None, + collate_fn=ConversationDataset.collate_fn, + optimizer=None, + rank_split=False) \ No newline at end of file diff --git a/examples/aquila/aquila-code/bmtrain_mgpu.sh b/examples/Aquila/Aquila-code/bmtrain_mgpu.sh similarity index 100% rename from examples/aquila/aquila-code/bmtrain_mgpu.sh rename to examples/Aquila/Aquila-code/bmtrain_mgpu.sh diff --git a/examples/aquila/aquila-code/dist_trigger_docker.sh b/examples/Aquila/Aquila-code/dist_trigger_docker.sh similarity index 100% rename from examples/aquila/aquila-code/dist_trigger_docker.sh rename to examples/Aquila/Aquila-code/dist_trigger_docker.sh diff --git a/examples/aquila/aquila-code/generate_code.py b/examples/Aquila/Aquila-code/generate_code.py similarity index 82% rename from examples/aquila/aquila-code/generate_code.py rename to examples/Aquila/Aquila-code/generate_code.py index b9dae941..8b6fdb78 100755 --- a/examples/aquila/aquila-code/generate_code.py +++ b/examples/Aquila/Aquila-code/generate_code.py @@ -11,13 +11,7 @@ import random import numpy as np from flagai.model.predictor.predictor import Predictor -from pathlib import Path from flagai.data.tokenizer import Tokenizer -import torch.distributed as dist -import json -import json, datetime - -import os model_dir = "./checkpoints_in" device = "cuda" @@ -32,11 +26,6 @@ model = loader.get_model() tokenizer = loader.get_tokenizer() -# import pdb;pdb.set_trace() -# ckpt = torch.load('./checkpoints_in/aquilacode-7b-nv/pytorch_model.bin', map_location=torch.device('cpu')) -# # print(ckpt) -# model.load_state_dict(ckpt, strict=True) - model.eval() model.to(device) @@ -61,5 +50,5 @@ res = predictor.predict_generate_randomsample(prompt, out_max_length=max_length, top_p=0.95, - temperature=t0.7) + temperature=0.7) print(res) \ No newline at end of file diff --git a/examples/aquila/aquila-code/hostfile b/examples/Aquila/Aquila-code/hostfile similarity index 100% rename from examples/aquila/aquila-code/hostfile rename to examples/Aquila/Aquila-code/hostfile diff --git a/examples/Aquila/Aquila-pretrain/Aquila-pretrain-33B.yaml b/examples/Aquila/Aquila-pretrain/Aquila-pretrain-33B.yaml new file mode 100755 index 00000000..ca3e3c59 --- /dev/null +++ b/examples/Aquila/Aquila-pretrain/Aquila-pretrain-33B.yaml @@ -0,0 +1,10 @@ +batch_size: 10 +gradient_accumulation_steps: 1 +lr: 1.5e-4 +warm_up: 0.01 +save_interval: 1000 +log_interval: 10 +bmt_loss_scale: 131072 +save_optim: True +save_rng: True +eps: 1.e-8 \ No newline at end of file diff --git a/examples/Aquila/Aquila-pretrain/Aquila-pretrain.yaml b/examples/Aquila/Aquila-pretrain/Aquila-pretrain.yaml new file mode 100755 index 00000000..49ee411b --- /dev/null +++ b/examples/Aquila/Aquila-pretrain/Aquila-pretrain.yaml @@ -0,0 +1,10 @@ +batch_size: 10 +gradient_accumulation_steps: 1 +lr: 3.0e-4 +warm_up: 0.01 +save_interval: 1000 +log_interval: 10 +bmt_loss_scale: 131072 +save_optim: True +save_rng: True +eps: 1.e-8 \ No newline at end of file diff --git a/examples/aquila/aquila-pretrain/README_Aquila-7B.md b/examples/Aquila/Aquila-pretrain/README_Aquila-7B.md similarity index 78% rename from examples/aquila/aquila-pretrain/README_Aquila-7B.md rename to examples/Aquila/Aquila-pretrain/README_Aquila-7B.md index 727f508b..625836c8 100755 --- a/examples/aquila/aquila-pretrain/README_Aquila-7B.md +++ b/examples/Aquila/Aquila-pretrain/README_Aquila-7B.md @@ -1,7 +1,7 @@ license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement) -# Aquila-7B +# Aquila ## 简介/Overview Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed zero-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。 @@ -9,7 +9,6 @@ Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点 The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations. - | 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM| | ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- | | [Acuila-7B](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx |0.xxx| 0.xxx| @@ -24,27 +23,31 @@ You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/ We also support [Huggingface](hflink) ## 模型细节/Model details -| Model | License | Commercial use? | Pretraining length [tokens] | Pretraining compute (GPU days) | GPU + +| Model | License | Commercial use? | GPU | Model link +| :---------------- | :------- | :-- |:-- | :-- | +| Aquila-7B | Apache 2.0 | ✅ | Nvidia-A100 | mhlink +| Aquila-33B | Apache 2.0 | ✅ | Nvidia-A100 | mhlink +| AquilaCode-7B-nv | Apache 2.0 | ✅ | Nvidia-A100 | mhlink +| AquilaCode-7B-ts | Apache 2.0 | ✅ | Tianshu-BI-V100 | mhlink +| AquilaChat-7B | Apache 2.0 | ✅ | Nvidia-A100 | mhlink 我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们升级了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。 -Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表: - -我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。 - +Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。Aquila tokenizer与其他tokenizer的参数对比见下表: We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process. -The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below: +The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table. The parameters of this tokenizer are compared to those of other tokenizers in the table below: + -We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table. | 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) | | ----- | ---- | ----- | ---- | ----- | ---- | @@ -55,11 +58,10 @@ We used different tokenizers to extract ten thousand data samples from English, ## 训练数据集/Training data -Aquila-7B训练使用了Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), 悟道、电子书、专利、百科、论坛, github数据等 +Aquila预训练使用了Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), 悟道中文数据集、电子书、专利、百科、论坛, github数据等 -The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), wudao、e-book、Patent, encyclopedia, forum, github etc. +The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), Wudao Corpus、e-book、Patent, encyclopedia, forum, github etc. -![Screenshot](../img/data.jpg) ## 使用方式/How to use @@ -77,13 +79,20 @@ The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://hugging | gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages | | lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum | | warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate -| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. | +| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how many iterations the model is saved during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. | * 我们的演示数据集放在`../indexed_dataset/data/demo_text_document`里。 如果想修改预训练数据集,可更改`aquila_pretrain.py`里的`data_prefix`参数; Our demo dataset is located in `../indexed_dataset/data/demo_text_document`. If you want to modify the pre-training dataset, you can change the data_prefix parameter in `aquila_pretrain.py`. #### Step 2: 启动训练/Start training +对于Aquila-7B模型 ``` bash dist_trigger_docker.sh hostfile Aquila-pretrain.yaml aquila-7b [实验名] ``` + +对于Aquila-7B模型 +``` +bash dist_trigger_docker.sh hostfile Aquila-pretrain-33B.yaml aquila-33b [实验名] +``` + 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. ![Screenshot](../img/info.jpg) @@ -103,13 +112,21 @@ bash dist_trigger_docker.sh hostfile Aquila-pretrain.yaml aquila-7b [实验名] #### Step 2: 启动可监督微调/Start SFT ``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名] +cd ../Aquila-sft/ +``` +对于Aquila-7B模型: +``` +bash dist_trigger_docker.sh hostfile Aquila-sft.yaml aquila-7b [实验名 experiment name] +``` +对于Aquila-33B模型: +``` +bash dist_trigger_docker.sh hostfile Aquila-sft.yaml aquila-33b [实验名 experiment name] ``` 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. ![Screenshot](../img/info.jpg) -成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ: +成功训练之前能在日志里看到如下信息(具体参数可能不同); Before successful training, you may see the following information in the log file with parameters that may differ: ![Screenshot](../img/info2.jpg) @@ -124,7 +141,7 @@ from flagai.data.tokenizer import Tokenizer import bminf state_dict = "./checkpoints_in/" -model_name = 'aquila-7b' +model_name = 'aquila-7b' # 'aquila-33b' loader = AutoLoader( "lm", @@ -154,7 +171,7 @@ with torch.no_grad(): ## 证书/License -Aquila-7B开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) +Aquila-7B和Aquila-33B开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) -Aquila-7B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) +Aquila-7B and Aquila-33B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) diff --git a/examples/aquila/aquila-pretrain/aquila_pretrain.py b/examples/Aquila/Aquila-pretrain/aquila_pretrain.py similarity index 96% rename from examples/aquila/aquila-pretrain/aquila_pretrain.py rename to examples/Aquila/Aquila-pretrain/aquila_pretrain.py index ce1162ec..58976322 100755 --- a/examples/aquila/aquila-pretrain/aquila_pretrain.py +++ b/examples/Aquila/Aquila-pretrain/aquila_pretrain.py @@ -15,8 +15,7 @@ #torch.autograd.set_detect_anomaly(True) -from examples.aquila.build_index_mappings import _build_train_valid_test_datasets -from examples.aquila.build_index_mappings import _build_train_valid_test_weighted_datasets +from flagai.data.datasets.indexed_dataset.build_index_mappings import _build_train_valid_test_datasets,_build_train_valid_test_weighted_datasets device = torch.device("cuda" if torch.cuda.is_available() else "cpu") diff --git a/examples/aquila/aquila-pretrain/bmtrain_mgpu.sh b/examples/Aquila/Aquila-pretrain/bmtrain_mgpu.sh similarity index 100% rename from examples/aquila/aquila-pretrain/bmtrain_mgpu.sh rename to examples/Aquila/Aquila-pretrain/bmtrain_mgpu.sh diff --git a/examples/aquila/aquila-pretrain/generate.py b/examples/Aquila/Aquila-pretrain/generate.py similarity index 92% rename from examples/aquila/aquila-pretrain/generate.py rename to examples/Aquila/Aquila-pretrain/generate.py index d82e3c42..229dbf6d 100755 --- a/examples/aquila/aquila-pretrain/generate.py +++ b/examples/Aquila/Aquila-pretrain/generate.py @@ -20,14 +20,12 @@ # from flagai.model.aquila_model import AQUILAModel # model = AQUILAModel.from # tokenizer = Tokenizer.from_pretrained('aquila-7b', cache_dir='./checkpoints_in/aquila-7b') -pl_sd = torch.load('../checkpoints_in/aquila-7b/pytorch_model.bin', map_location="cpu") -if "state_dict" in pl_sd: - sd = pl_sd["state_dict"] -else: - sd = pl_sd -model.load_state_dict(sd, strict=True) - -model.eval() +# pl_sd = torch.load('./checkpoints_in/aquila-7b/pytorch_model.bin', map_location="cpu") +# if "state_dict" in pl_sd: +# sd = pl_sd["state_dict"] +# else: +# sd = pl_sd +# model.load_state_dict(sd, strict=True) model.eval() model.half() @@ -36,7 +34,6 @@ model.cuda() - predictor = Predictor(model, tokenizer) texts = [ diff --git a/examples/aquila/aquila-pretrain/hostfile b/examples/Aquila/Aquila-pretrain/hostfile similarity index 100% rename from examples/aquila/aquila-pretrain/hostfile rename to examples/Aquila/Aquila-pretrain/hostfile diff --git a/examples/Aquila/Aquila-sft/Aquila-sft.yaml b/examples/Aquila/Aquila-sft/Aquila-sft.yaml new file mode 100755 index 00000000..2ec35729 --- /dev/null +++ b/examples/Aquila/Aquila-sft/Aquila-sft.yaml @@ -0,0 +1,13 @@ +epochs: 3 +batch_size: 4 +gradient_accumulation_steps: 1 +lr: 9.65e-6 +warm_up: 0.1 +save_interval: 1000 + +bmt_lr_decay_style: "linear" +bmt_cpu_offload: False + +bmt_pre_load: True +enable_sft_dataset_dir: './data/' +enable_sft_dataset_file: 'sft_samples.jsonl' diff --git a/examples/aquila/aquila-sft/README_AquilaChat-7B.md b/examples/Aquila/Aquila-sft/README_AquilaChat-7B.md similarity index 93% rename from examples/aquila/aquila-sft/README_AquilaChat-7B.md rename to examples/Aquila/Aquila-sft/README_AquilaChat-7B.md index 98d55e24..d06a116a 100755 --- a/examples/aquila/aquila-sft/README_AquilaChat-7B.md +++ b/examples/Aquila/Aquila-sft/README_AquilaChat-7B.md @@ -52,19 +52,12 @@ We used different tokenizers to extract ten thousand data samples from English, | gpt2_new_100k | 100000 | bpe|1575 | 477|1679 | - -模型在一台8卡Nvidia A100上训练8小时,总共对15万条数据训练了3个epoch。 - -The model was trained on an 8-card Nvidia A100 for 8 hours, and a total of 150,000 lines of data were trained for 3 epochs. - ## 训练数据集/Training data 我们采用了一系列高质量中英文数据集来训练和微调我们的对话语言模型,并且在不断更新迭代 We used a series of high-quality Chinese and English datasets to train and fine-tune our conversational language model, and continuously updated it through iterations. -![Screenshot](../img/data.jpg) - ## 使用方式/How to use @@ -178,12 +171,12 @@ with torch.no_grad(): ### 2. 可监督微调/Supervised Fine-tuning(SFT) #### Step 1: 配置模型/ Setup Checkpoints -在`./checkpoints_in`里新建`aquila-7b`目录。将微调后的checkpoint,以及原始`aquila-7b`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 +在`./checkpoints_in`里新建`aquilachat-7b`目录。将微调后的checkpoint,以及原始`aquilachat-7b`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 -Create a new directory named `aquila-7b` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquila-7b` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. +Create a new directory named `aquilachat-7b` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilachat-7b` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. #### Step 2: 修改参数/ Modify Parameters -* `cd /examples/aquila` +* `cd /examples/Aquila/Aquila-sft` * 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) * 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft.py` * (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft.yaml` @@ -204,7 +197,7 @@ Create a new directory named `aquila-7b` inside `./checkpoints_in`. Place the fi #### Step 3: 启动可监督微调/Start SFT ``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名] +bash dist_trigger_docker.sh hostfile Aquila-sft.yaml aquilachat-7b [实验名] ``` 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. diff --git a/examples/Aquila/Aquila-sft/aquila_sft.py b/examples/Aquila/Aquila-sft/aquila_sft.py new file mode 100755 index 00000000..d85e3299 --- /dev/null +++ b/examples/Aquila/Aquila-sft/aquila_sft.py @@ -0,0 +1,270 @@ +# Copyright © 2022 BAAI. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License") +import os +import torch +from torch.utils.data import Dataset +import gc +gc.collect() +torch.cuda.empty_cache() +import sys;sys.path.append("/data2/yzd/workspace/FlagAI") +from flagai.auto_model.auto_loader import AutoLoader +from flagai.data.tokenizer import Tokenizer +from flagai.env_args import EnvArgs +from flagai.env_trainer_v1 import EnvTrainer +import jsonlines +import numpy as np +from examples.Aquila import cyg_conversation as conversation_lib +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + +# You can input all parameters by the command line. +# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch +env_args = EnvArgs( + env_type="bmtrain", + batch_size=1, + gradient_accumulation_steps=1, + lr=2e-4, + weight_decay=1e-3, + epochs=100, + log_interval=10, + eval_interval=5000, + num_gpus=1, + load_dir=None, + pytorch_device=device, + save_dir="checkpoints_aquila", + checkpoint_activations=False, + save_interval=5000, + fp16=True, + training_script=__file__, +) +env_args = env_args.parse_args() +#env_args.wandb = False + +# overwrite +if env_args.yaml_config: + import yaml + file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read() + data = yaml.load_all(file_data) + delattr(env_args, 'yaml_config') + arg_dict = env_args.__dict__ + for subdata in data: + for key, value in subdata.items(): + if isinstance(value, list): + for v in value: + arg_dict[key].append(v) + else: + arg_dict[key] = value +trainer = EnvTrainer(env_args) + +# Trainer as Trigger +if not env_args.not_call_launch: + import sys + sys.exit(0) + +print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True) + +checkpoints = env_args.pre_load_dir + +model_name = env_args.model_name + +print('*'*20, "model_name", model_name, flush=True) + +''' +auto_loader = AutoLoader( + "lm", + model_name=model_name, + model_dir=checkpoints, + only_download_config=True, +) +model = auto_loader.get_model() +tokenizer = auto_loader.get_tokenizer() +print('*'*20, "model", model) +trainer.pre_train(model) +print('*'*20, "model", model) + +''' + +cache_dir = os.path.join(checkpoints, model_name) +print('*'*20, "cache_dir", cache_dir) +tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir) +print('*'*20, "tokenizer", tokenizer) + +# avoid sync loading models in case of Mem OOM +if env_args.bmt_async_load: + import time + time.sleep(10*60*(trainer.local_rank%4)) + + +config_file = os.path.join(cache_dir, 'config.json') +from flagai.model.aquila_model import AQUILAModel +model = AQUILAModel.init_from_json(config_file=config_file) +print('*'*20, "model", model) + +## bmt_pre_load +checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin") +if env_args.bmt_pre_load: + model.load_weights(checkpoint_path) + +trainer.pre_train(model) + +print('*'*20, "model", model, flush=True) + +assert env_args.enable_sft_dataset_dir is not None and \ + env_args.enable_sft_dataset_file is not None + +cur_dir = env_args.enable_sft_dataset_dir +jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file) +jsonl_data_val = None +if env_args.enable_sft_dataset_val_file is not None: + jsonl_data_val = os.path.join(cur_dir, env_args.enable_sft_dataset_file) +max_seq_len = 2048 + + +def read_file(jsonl_file): + conversations = [] + with jsonlines.open(jsonl_file) as reader: + for line in reader: + conversations.append(line) + return conversations + + +def _add_speaker_and_signal(header, source, get_conversation=True): + + """Add speaker and start/end signal on each round.""" + BEGIN_SIGNAL = "### " + END_SIGNAL = "\n" + conversation = header + unknown_role = "unknown" # use default unknown role + roles = { + "human": conversation_lib.default_conversation.roles[0], # human role + "gpt": conversation_lib.default_conversation.roles[1], # gpt role + } + if "instruction" in source and source["instruction"] is not None and len(source["instruction"]) > 0: + source["instruction"] = ( + BEGIN_SIGNAL + + conversation_lib.default_conversation.roles[2] + + ": " + + source["instruction"] + + END_SIGNAL + ) + if get_conversation: + conversation += source["instruction"] + for sentence in source["conversations"]: + sentence_from = sentence["from"].lower() + sentence["value"] = ( + BEGIN_SIGNAL + + roles.get(sentence_from, unknown_role) + + ": " + + sentence["value"] + + END_SIGNAL + ) + if get_conversation: + conversation += sentence["value"] + return conversation + +class ConversationDatasetV2(Dataset): + def __init__(self, conversations, tokenizer, maxlen=512): + super(ConversationDatasetV2, self).__init__() + self.conversations = conversations + self.tokenizer = tokenizer + self.maxlen = maxlen + + def __getitem__(self, i): + header = f"{conversation_lib.default_conversation.system}\n\n" + source = self.conversations[i] + _add_speaker_and_signal(header, source) + + source["chat_desc"] = header + chat_desc = source['chat_desc'] + instruction = source['instruction'] + conversations = source['conversations'] + + # chat_desc + example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids'] + EOS_TOKEN = example[-1] + example = example[:-1] # remove eos + # instruction + instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids'] + instruction = instruction[1:-1] # remove bos & eos + example += instruction + + import copy + labels = copy.deepcopy(example) + + for conversation in conversations: + role = conversation['from'] + content = conversation['value'] + content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids'] + content = content[1:-1] # remove bos & eos + example += content + if role == 'gpt': + role_labels = copy.deepcopy(content) + else: + # masking + role_labels = [env_args.IGNORE_INDEX] * len(content) + labels += role_labels + + example.append(EOS_TOKEN) + labels.append(EOS_TOKEN) + + ## delete bos & eos + #example = example[1:-1] + #labels = labels[1:-1] + + ## maxlen + example = example[:self.maxlen] + labels = labels[:self.maxlen] + + output = { + "input_ids": example, + "labels": labels, + } + return output + + def __len__(self): + return len(self.conversations) + + @staticmethod + def collate_fn(batch): + def padding(indice, max_length, pad_idx=0): + pad_indice = [ + item + [pad_idx] * max(0, max_length - len(item)) for item in indice + ] + return torch.tensor(pad_indice) + + input_ids = [data["input_ids"] for data in batch] + labels = [data["labels"] for data in batch] + max_length = max_seq_len + input_ids = padding(input_ids, max_length)[:,:max_length] + labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length] + + data = { + "input_ids": input_ids, + "labels": labels + } + return data + +conversations = read_file(jsonl_data) +data_len = len(conversations) +#train_size = int(data_len * 0.95) +train_size = data_len +train_conversations = conversations[:train_size] + +train_dataset = ConversationDatasetV2(train_conversations, + tokenizer=tokenizer, + maxlen=max_seq_len) +#print(f"train_dataset \n {train_dataset[0]}") + +valid_dataset = None +if jsonl_data_val is not None: + conversations_val = read_file(jsonl_data_val) + valid_dataset = ConversationDatasetV2(conversations_val, + tokenizer=tokenizer, + maxlen=max_seq_len) + +trainer.do_train( + train_dataset=train_dataset, + valid_dataset=valid_dataset, + collate_fn=ConversationDatasetV2.collate_fn, + optimizer=None, + rank_split=False) diff --git a/examples/aquila/aquila-sft/bmtrain_mgpu.sh b/examples/Aquila/Aquila-sft/bmtrain_mgpu.sh similarity index 98% rename from examples/aquila/aquila-sft/bmtrain_mgpu.sh rename to examples/Aquila/Aquila-sft/bmtrain_mgpu.sh index 236edd89..26854c25 100755 --- a/examples/aquila/aquila-sft/bmtrain_mgpu.sh +++ b/examples/Aquila/Aquila-sft/bmtrain_mgpu.sh @@ -1,6 +1,6 @@ # Defined by User export TRIGGER_FILE=bmtrain_mgpu.sh -export SCRIPT_FILE=aquila_code.py +export SCRIPT_FILE=aquila_sft.py # ENVS export PROJ_HOME=$PWD diff --git a/examples/Aquila/Aquila-sft/data/sft_samples.jsonl b/examples/Aquila/Aquila-sft/data/sft_samples.jsonl new file mode 100644 index 00000000..01cb29db --- /dev/null +++ b/examples/Aquila/Aquila-sft/data/sft_samples.jsonl @@ -0,0 +1,3 @@ +{"id": "sft_v0.8.9_val.jsonl_12", "conversations": [{"from": "human", "value": "你能画一张细致的数字化的小黄瓜拉瑞吗?"}, {"from": "gpt", "value": "作为一款文本生成模型,我并不具备图像生成能力。但我可以尝试为您调用文生图模型进行图像生成。\n#text2image#\n拉里黄瓜戴着白色长袍,宏头镜头,D&D,奇幻,精细复杂,边缘光线,神圣光芒,体积光照明,优雅,高度详细,数字绘画,ArtStation,概念艺术,光滑,清晰聚焦,插图,正统,Greg Rutkowski、Maxfield Parrish和Alphonse Mucha的作品,新艺术风格,塔罗牌。"}], "instruction": "", "raw": " #用户#你能画一张细致的数字化的小黄瓜拉瑞吗? #ai助手#作为一款文本生成模型,我并不具备图像生成能力。但我可以尝试为您调用文生图模型进行图像生成。\n#text2image#\n拉里黄瓜戴着白色长袍,宏头镜头,D&D,奇幻,精细复杂,边缘光线,神圣光芒,体积光照明,优雅,高度详细,数字绘画,ArtStation,概念艺术,光滑,清晰聚焦,插图,正统,Greg Rutkowski、Maxfield Parrish和Alphonse Mucha的作品,新艺术风格,塔罗牌。"} +{"id": "sft_v0.8.9_val.jsonl_39", "conversations": [{"from": "human", "value": "Change the girl in the photograph's eye color to brown and make her skin color appear pink."}, {"from": "gpt", "value": "作为一款文本生成模型,我并不具备图像编辑能力。但我可以尝试为您调用多步图像编辑模型进行图像编辑。\n#InstructFace#\nChange the girl's eye color to brown.\nMake the girl's skin color appear pink."}], "instruction": "", "raw": " #用户#Change the girl in the photograph's eye color to brown and make her skin color appear pink. #ai助手#作为一款文本生成模型,我并不具备图像编辑能力。但我可以尝试为您调用多步图像编辑模型进行图像编辑。\n#InstructFace#\nChange the girl's eye color to brown.\nMake the girl's skin color appear pink."} +{"id": "sft_v0.8.9_val.jsonl_0", "conversations": [{"from": "human", "value": "对下列文本进行摘要,摘要长度为两个句子: “最新的医学研究表明,服用维生素C可以缓解感冒症状,并缩短感冒的持续时间。但需要注意的是,过量服用维生素C会造成不良反应。”"}, {"from": "gpt", "value": "医学研究显示,维生素C可以减轻感冒症状并缩短持续时间。不过,应注意避免过量摄入维生素C导致的不良反应。"}], "instruction": "", "raw": " #用户#对下列文本进行摘要,摘要长度为两个句子: “最新的医学研究表明,服用维生素C可以缓解感冒症状,并缩短感冒的持续时间。但需要注意的是,过量服用维生素C会造成不良反应。” #ai助手#医学研究显示,维生素C可以减轻感冒症状并缩短持续时间。不过,应注意避免过量摄入维生素C导致的不良反应。"} diff --git a/examples/aquila/aquila-sft/dist_trigger_docker.sh b/examples/Aquila/Aquila-sft/dist_trigger_docker.sh similarity index 100% rename from examples/aquila/aquila-sft/dist_trigger_docker.sh rename to examples/Aquila/Aquila-sft/dist_trigger_docker.sh diff --git a/examples/aquila/aquila-sft/generate_sft.py b/examples/Aquila/Aquila-sft/generate_sft.py similarity index 63% rename from examples/aquila/aquila-sft/generate_sft.py rename to examples/Aquila/Aquila-sft/generate_sft.py index a5d720fb..7d7f453b 100755 --- a/examples/aquila/aquila-sft/generate_sft.py +++ b/examples/Aquila/Aquila-sft/generate_sft.py @@ -18,8 +18,6 @@ model = loader.get_model() tokenizer = loader.get_tokenizer() cache_dir = os.path.join(state_dict, model_name) -# tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir) -#print('*'*20, "tokenizer", tokenizer) model.eval() model.half() @@ -29,42 +27,12 @@ predictor = Predictor(model, tokenizer) - -texts = [ - #"I am ", - #"1月7日,五华区召开“中共昆明市五华区委十届三次全体(扩大)会议”,", - #"1月7日,五华区召开“中共昆明市五华区委十届三次全体(扩大)会议”,区委书记金幼和作了《深入学习贯彻党的十八大精神,奋力开创五华跨越发展新局面》的工作报告。", - "拥有美丽身材是大多数女人追求的梦想,甚至有不少mm为了实现这个梦而心甘情愿付出各种代价,", - "2007年乔布斯向人们展示iPhone并宣称它将会改变世界", - "从前有座山,", - "如何摆脱无效焦虑?", - "北京在哪儿?", - #"北京", - "汽车EDR是什么", - "My favorite animal is", - "今天天气不错", - "如何评价许嵩?", - "汽车EDR是什么", - "给妈妈送生日礼物,怎么选好?", - "1加1等于18497是正确的吗?", - "如何给汽车换胎?", - "以初春、黄山为题,做一首诗。", - "What is machine learning?", - #"Machine learning is", - #"Nigerian billionaire Aliko Dangote says he is planning a bid to buy the UK Premier League football club.", - #"The capital of Germany is the city of ", - ] - texts = [ "北京为什么是中国的首都?", "1+1=", "为什么湘菜那么甜?", "东三省和海南岛的区别?", ] - -texts = [ - "Ask the user for their name and say 'Hello'" - ] ## def pack_obj(text): obj = dict() @@ -128,21 +96,16 @@ def convo_tokenize(convo_obj, tokenizer): print('-'*80) print(f"text is {text}") - from examples.aquila.cyg_conversation import default_conversation + from examples.Aquila.cyg_conversation import default_conversation conv = default_conversation.copy() conv.append_message(conv.roles[0], text) conv.append_message(conv.roles[1], None) - #print(conv.get_prompt()) tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids'] tokens = tokens[1:-1] - #print(f"tokens \n {tokens}") with torch.no_grad(): - #out = predictor.predict_generate_randomsample(text, out_max_length=200,top_p=0.95) - #out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0) - #out = llama_generate(tokenizer, model, [text], max_gen_len:=200, temperature=0, prompts_tokens=[tokens]) out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens]) print(f"pred is {out}") diff --git a/examples/aquila/aquila-sft/hostfile b/examples/Aquila/Aquila-sft/hostfile similarity index 100% rename from examples/aquila/aquila-sft/hostfile rename to examples/Aquila/Aquila-sft/hostfile diff --git a/examples/aquila/cyg_conversation.py b/examples/Aquila/cyg_conversation.py similarity index 100% rename from examples/aquila/cyg_conversation.py rename to examples/Aquila/cyg_conversation.py diff --git a/examples/aquila/dist_trigger_docker.sh b/examples/Aquila/dist_trigger_docker.sh similarity index 100% rename from examples/aquila/dist_trigger_docker.sh rename to examples/Aquila/dist_trigger_docker.sh diff --git a/examples/aquila/img/data.jpg b/examples/Aquila/img/data.jpg similarity index 100% rename from examples/aquila/img/data.jpg rename to examples/Aquila/img/data.jpg diff --git a/examples/aquila/img/info.jpg b/examples/Aquila/img/info.jpg similarity index 100% rename from examples/aquila/img/info.jpg rename to examples/Aquila/img/info.jpg diff --git a/examples/aquila/img/info2.jpg b/examples/Aquila/img/info2.jpg similarity index 100% rename from examples/aquila/img/info2.jpg rename to examples/Aquila/img/info2.jpg diff --git a/examples/aquila/img/merged_platform.jpg b/examples/Aquila/img/merged_platform.jpg similarity index 100% rename from examples/aquila/img/merged_platform.jpg rename to examples/Aquila/img/merged_platform.jpg diff --git a/examples/aquila/aquila-code/Aquila-sft-code.yaml b/examples/aquila/aquila-code/Aquila-sft-code.yaml deleted file mode 100755 index d4e8a37c..00000000 --- a/examples/aquila/aquila-code/Aquila-sft-code.yaml +++ /dev/null @@ -1,20 +0,0 @@ -# comments -batch_size: 10 -gradient_accumulation_steps: 1 -lr: 2.e-4 -warm_up: 0.001 -save_interval: 500 - -bmt_cpu_offload: False -bmt_pre_load: False -bmt_async_load: False -bmt_loss_scale: 65536 - -save_optim: True -save_rng: True - -load_optim: False - -enable_sft_conversations_dataset: True -enable_sft_dataset_dir: './datasets/' -enable_sft_dataset_file: 'convo_v2.jsonl' diff --git a/examples/aquila/aquila-code/README_AquilaCode-7B-ts.md b/examples/aquila/aquila-code/README_AquilaCode-7B-ts.md deleted file mode 100755 index 39651745..00000000 --- a/examples/aquila/aquila-code/README_AquilaCode-7B-ts.md +++ /dev/null @@ -1,168 +0,0 @@ -license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement) - - -# AquilaCode-7B-tianshu - -## 简介/Overview -Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed ZeRO-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。 - -The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations. - - -AquilaCode-7B-tianshu是在Aquila-7B模型的基础上,经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下 - -AquilaCode-7B-tianshu is a foundational code model obtained by further pretraining on code data based on the Aquila-7B model. It was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows: - -| 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM| -| ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- | -| [AquilaCode-7B-tianshu](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx | - -您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标 - -You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details - - - -我们的模型也同时支持[Huggingface平台](hflink) - -We also support [Huggingface](hflink) - - -## 模型细节/Model details - -我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。 - -Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表: - -我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。 - - -We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process. - -The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below: - -We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table. - -| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) | -| ----- | ---- | ----- | ---- | ----- | ---- | -| gpt2 | 50527 | bpe|1717 | 1764|2323 | -| llama | 32000 | sp(bpe)|1805| 1257|1970 | -| gpt2_new_100k | 100000 | bpe|1575 | 477|1679 | - -模型在32台8卡天数显卡上训练9天,数据集规模为750亿。 - -The model was trained on an 8 8-card Nvidia A100-40G for 9 days, and there are 75B tokens in the train set. - - - -## 训练数据集/Training data -AquilaCode-7B-tianshu训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的Python, jupyter-scripts, jupyter-structured-text数据 - -The AquilaCode-7B-tianshu model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(Python, jupyter-scripts, jupyter-structured-textt). - -![Screenshot](../img/data.jpg) - -## 使用方式/How to use - -### 1. 推断/Inference - -```python -import torch -import os -import argparse -import sys -from flagai import mpu -from flagai.auto_model.auto_loader import AutoLoader -import numpy as np -from flagai.model.predictor.predictor import Predictor -from pathlib import Path -from flagai.data.tokenizer import Tokenizer -import time -import torch.distributed as dist -import json, datetime - -import os - -model_dir = "./checkpoints_in" -device = "cuda" - -print(f"building model...") -loader = AutoLoader("lm", model_name="aquilacode-7b-ts", - only_download_config=True, - use_cache=True, - fp16=True, - model_dir=model_dir) - -model = loader.get_model() -tokenizer = loader.get_tokenizer() - -model.eval() - -model.to(device) - -vocab = tokenizer.get_vocab() - -id2word = {v:k for k, v in vocab.items()} - -predictor = Predictor(model, tokenizer) - -max_new_tokens = 256 - -test_file = "./datasets/code_test.txt" -with open(test_file) as fin: - prompt = '\n'+fin.read()+'\n' - -input_ids = tokenizer.encode_plus_non_glm(prompt)["input_ids"][:-1] -input_length = len(input_ids) - -max_length = input_length+max_new_tokens -with torch.no_grad(): - res = predictor.predict_generate_randomsample(prompt, - out_max_length=max_length, - top_p=0.95, - temperature=t0.7) - print(res) -``` - -### 2. 可监督微调/Supervised Fine-tuning(SFT) -#### Step 1: 配置模型/ Setup Checkpoints -在`./checkpoints_in`里新建`aquilacode-7b-ts`目录。将微调后的checkpoint,以及原始`aquilacode-7b-ts`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 - -Create a new directory named `aquilacode-7b-ts` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-ts` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. - -#### Step 2: 修改参数/Modify Parameters -* `cd /examples/aquila` -* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) -* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft_code.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft_code.py` -* (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft-code.yaml` - -| 参数名 Parameter | 类型 Type | 描述 Description | -|--------------------------------|------------|-------------------------------------------------------| -| batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory | -| gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages | -| lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum | -| warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate -| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. | -| enable_sft_conversations_dataset_v3 | bool | 数据处理方式; Data preprocessing method | -| enable_sft_dataset_dir | str | 可监督微调的数据集目录; Dataset directory of SFT dataset | -| enable_sft_dataset_file | str | 可监督微调的数据集文件名; Filename of SFT dataset | | - - -#### Step 3: 启动可监督微调/Start SFT -``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml AquilaCode-7B-ts [实验名] -``` -接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. - -![Screenshot](../img/info.jpg) - -成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ: - -![Screenshot](../img/info2.jpg) - -## 证书/License - -AquilaCode-7B-TS开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) - - -AquilaCode-7B-TS open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) \ No newline at end of file diff --git a/examples/aquila/aquila-code/aquila_code.py b/examples/aquila/aquila-code/aquila_code.py deleted file mode 100755 index e0cbc222..00000000 --- a/examples/aquila/aquila-code/aquila_code.py +++ /dev/null @@ -1,158 +0,0 @@ -# Copyright © 2022 BAAI. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License") -import os -import torch -from torch.utils.data import Dataset -import gc -gc.collect() -torch.cuda.empty_cache() -import sys;sys.path.append("/data2/yzd/workspace/FlagAI") -from flagai.auto_model.auto_loader import AutoLoader -from flagai.data.tokenizer import Tokenizer -from flagai.env_args import EnvArgs -from flagai.env_trainer_v1 import EnvTrainer - -#torch.autograd.set_detect_anomaly(True) - -from examples.aquila.build_index_mappings import _build_train_valid_test_datasets -from examples.aquila.build_index_mappings import _build_train_valid_test_weighted_datasets - -device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -# You can input all parameters by the command line. -# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch -env_args = EnvArgs( - env_type="bmtrain", - experiment_name="aquila", - batch_size=1, - gradient_accumulation_steps=1, - lr=2e-4, - weight_decay=1e-3, - epochs=100, - log_interval=10, - eval_interval=5000, - num_gpus=1, - load_dir=None, - pytorch_device=device, - save_dir="checkpoints_aquila", - checkpoint_activations=False, - save_interval=5000, - fp16=True, - training_script=__file__, -) -env_args = env_args.parse_args() -#env_args.wandb = False - -# overwrite -if env_args.yaml_config: - import yaml - file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read() - data = yaml.load_all(file_data) - delattr(env_args, 'yaml_config') - arg_dict = env_args.__dict__ - for subdata in data: - for key, value in subdata.items(): - if isinstance(value, list): - for v in value: - arg_dict[key].append(v) - else: - arg_dict[key] = value -trainer = EnvTrainer(env_args) - -# Trainer as Trigger -if not env_args.not_call_launch: - import sys - sys.exit(0) - -print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True) - -checkpoints = env_args.pre_load_dir - -model_name = env_args.model_name - -env_args.enable_sft_conversations_dataset_v3 = True - - -print('*'*20, "model_name", model_name, flush=True) - -''' -auto_loader = AutoLoader( - "lm", - model_name=model_name, - model_dir=checkpoints, - only_download_config=True, -) -model = auto_loader.get_model() -tokenizer = auto_loader.get_tokenizer() -print('*'*20, "model", model) -trainer.pre_train(model) -print('*'*20, "model", model) - -''' - -cache_dir = os.path.join(checkpoints, model_name) -print('*'*20, "cache_dir", cache_dir) -tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir) -print('*'*20, "tokenizer", tokenizer) - -# avoid sync loading models in case of Mem OOM -if env_args.bmt_async_load: - import time - time.sleep(10*60*(trainer.local_rank%4)) - - -config_file = os.path.join(cache_dir, 'config.json') -from flagai.model.aquila_model import AQUILAModel -model = AQUILAModel.init_from_json(config_file=config_file) -print('*'*20, "model", model) - -## bmt_pre_load -checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin") -if env_args.bmt_pre_load: - model.load_weights(checkpoint_path) - -trainer.pre_train(model) - -print('*'*20, "model", model, flush=True) - - -## Use Prebuilt DataSets -data_prefix = '../indexed_dataset/data/demo_text_document' -data_impl = 'mmap' -splits_string = '90,10' -train_valid_test_num_samples = [90, 10] -seq_length = 1024 -seed = 2023 -skip_warmup = True - -train_dataset, valid_dataset, _ = _build_train_valid_test_datasets( - data_prefix, data_impl, splits_string, - train_valid_test_num_samples, - seq_length, seed, skip_warmup) -print("Total train_dataset: ", len(train_dataset), flush=True) -print("Total valid_dataset: ", len(valid_dataset), flush=True) - -def collate_fn(batch): - def padding(indice, max_length, pad_idx=tokenizer.token_end_id): - pad_indice = [ - item.tolist() + [pad_idx] * max(0, max_length - len(item.tolist())) for item in indice - ] - return torch.tensor(pad_indice) - - input_ids = [data["input_ids"] for data in batch] - max_length = max([len(t) for t in input_ids]) - input_ids = padding(input_ids, max_length)[:,:seq_length] - - data = { - "input_ids": input_ids, - "labels": input_ids - } - return data - -trainer.do_train( - train_dataset=train_dataset, - valid_dataset=None, - collate_fn=collate_fn, - optimizer=None, - rank_split=False) diff --git a/examples/aquila/aquila-pretrain/Aquila-pretrain.yaml b/examples/aquila/aquila-pretrain/Aquila-pretrain.yaml deleted file mode 100755 index 7f526eed..00000000 --- a/examples/aquila/aquila-pretrain/Aquila-pretrain.yaml +++ /dev/null @@ -1,11 +0,0 @@ -# comments -batch_size: 8 -gradient_accumulation_steps: 1 -lr: 6.0e-5 -warm_up: 0.15 - -bmt_cpu_offload: False -bmt_pre_load: False - -save_optim: True -save_rng: True diff --git a/examples/aquila/aquila-pretrain/README_Aquila-33B.md b/examples/aquila/aquila-pretrain/README_Aquila-33B.md deleted file mode 100755 index b88f9be7..00000000 --- a/examples/aquila/aquila-pretrain/README_Aquila-33B.md +++ /dev/null @@ -1,155 +0,0 @@ -license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement) - - -# AquilaChat-33B - -## 简介/Overview -Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed zero-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。 - -The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations. - -Aquila-33B模型由智源研究院研发,其在主流评测数据集上的评测结果如下 - -AquilaChat-33B model was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows: - -| 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM| -| ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- | -| [Acuila-33B](https://model.baai.ac.cn/model-detail/xxxxx) | 0.292 | 0.385|0.269 | 0.731|0.347 |0.939| 0.443| -| [BiLLa-7B-LLM](https://model.baai.ac.cn/model-detail/xxxxx) | 0.279 | 0.374|0.257 | 0.76|0.205 |0.864| 0.514| -| [Ziya-LLaMA-13B-v1](https://model.baai.ac.cn/model-detail/xxxxx) | 0.273 | 0.404|0.406 | 0.786|0.284 |0.762| 0.191| - -您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标 - -You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details - - - -我们的模型也同时支持[Huggingface平台](hflink) - -We also support [Huggingface](hflink) - - - -## 模型细节/Model details - -我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。 - -Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表: - -我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。 - - -We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process. - -The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below: - -We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table. - -| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) | -| ----- | ---- | ----- | ---- | ----- | ---- | -| gpt2 | 50527 | bpe|1717 | 1764|2323 | -| llama | 32000 | sp(bpe)|1805| 1257|1970 | -| gpt2_new_100k | 100000 | bpe|1575 | 477|1679 | - - - -## 训练数据集/Training data - -我们采用了一系列高质量中英文数据集来训练和微调我们的对话语言模型,并且在不断更新迭代 - -We used a series of high-quality Chinese and English datasets to train and fine-tune our conversational language model, and continuously updated it through iterations. - -![Screenshot](../img/data.jpg) - - -## 使用方式/How to use - -### 1. 预训练/Pre-training -#### Step 1: 修改参数/Modify Parameters - -* `cd /examples/aquila/aquila-pretrain` -* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) -* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_pretrain.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_pretrain.py` -* 在`Aquila-pretrain.yaml`文件里更改参数 (可选) -* 我们的演示数据集放在`../indexed_dataset/data/demo_text_document`里。 如果想修改预训练数据集,可更改`aquila_pretrain.py`里的`data_prefix`参数 -#### Step 2: 启动训练/Start training -``` -bash dist_trigger_docker.sh hostfile aquila-pretrain.yaml aquila-30b [实验名] -``` -接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. - -![Screenshot](../img/info.jpg) - -成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ: - -![Screenshot](../img/info2.jpg) - -### 2. 可监督微调/Supervised Fine-tuning(SFT) -#### Step 1: 修改参数 -* `cd /examples/aquila/aquila-pretrain` -* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) -* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_pretrain.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_pretrain.py` -* (可选) 在`Aquila-pretrain.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-pretrain.yaml` - -| 参数名 Parameter | 类型 Type | 描述 Description | -|--------------------------------|------------|-------------------------------------------------------| -| batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory | -| gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages | -| lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum | -| warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate -| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. - -#### Step 2: 启动微调 -``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名] -``` -接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. -![Screenshot](../img/info.jpg) - -成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ: - -![Screenshot](../img/info2.jpg) -### 推理/Inference - -```python -import os -import torch -from flagai.auto_model.auto_loader import AutoLoader -from flagai.model.predictor.predictor import Predictor -from flagai.data.tokenizer import Tokenizer -import bminf - -state_dict = "./checkpoints_in/" -model_name = 'aquila-33b' - -loader = AutoLoader( - "lm", - model_dir=state_dict, - model_name=model_name, - use_cache=True) -model = loader.get_model() -tokenizer = loader.get_tokenizer() - -model.eval() -model.half() -model.cuda() - -predictor = Predictor(model, tokenizer) - -text = "北京在哪儿?" -text = f'{text}' -print(f"text is {text}") -with torch.no_grad(): - out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0) - print(f"pred is {out}") - -``` - - - -## 证书/License - -Aquila-33B开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) - - -Aquila-33B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) \ No newline at end of file diff --git a/examples/aquila/aquila-pretrain/aquila_sft.py b/examples/aquila/aquila-pretrain/aquila_sft.py deleted file mode 100755 index 829239bf..00000000 --- a/examples/aquila/aquila-pretrain/aquila_sft.py +++ /dev/null @@ -1,438 +0,0 @@ -# Copyright © 2022 BAAI. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License") -import os -import torch -from torch.utils.data import Dataset -import gc -gc.collect() -torch.cuda.empty_cache() - -from flagai.auto_model.auto_loader import AutoLoader -from flagai.data.tokenizer import Tokenizer -from flagai.env_args import EnvArgs -from flagai.env_trainer_v1 import EnvTrainer - -device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -# You can input all parameters by the command line. -# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch -env_args = EnvArgs( - env_type="bmtrain", - batch_size=1, - gradient_accumulation_steps=1, - lr=2e-4, - weight_decay=1e-3, - epochs=100, - log_interval=10, - eval_interval=5000, - num_gpus=1, - load_dir=None, - pytorch_device=device, - save_dir="checkpoints_aquila", - checkpoint_activations=False, - save_interval=5000, - fp16=True, - training_script=__file__, -) -env_args = env_args.parse_args() -#env_args.wandb = False - -# overwrite -if env_args.yaml_config: - import yaml - file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read() - data = yaml.load_all(file_data) - delattr(env_args, 'yaml_config') - arg_dict = env_args.__dict__ - for subdata in data: - for key, value in subdata.items(): - if isinstance(value, list): - for v in value: - arg_dict[key].append(v) - else: - arg_dict[key] = value -trainer = EnvTrainer(env_args) - -# Trainer as Trigger -if not env_args.not_call_launch: - import sys - sys.exit(0) - -print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True) - -#checkpoints = "/share/project/ldwang/sft/state_dict/" -checkpoints = env_args.pre_load_dir -# model_name = env_args.model_name - -# checkpoints = "/data/yzd/FlagAI/examples/aquila/checkpoints_in/" -model_name = env_args.model_name -# model_name = "aquila-7b" -env_args.enable_sft_conversations_dataset_v3 = True - - -print('*'*20, "model_name", model_name, flush=True) - -''' -auto_loader = AutoLoader( - "lm", - model_name=model_name, - model_dir=checkpoints, - only_download_config=True, -) -model = auto_loader.get_model() -tokenizer = auto_loader.get_tokenizer() -print('*'*20, "model", model) -trainer.pre_train(model) -print('*'*20, "model", model) - -''' - -cache_dir = os.path.join(checkpoints, model_name) -print('*'*20, "cache_dir", cache_dir) -tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir) -print('*'*20, "tokenizer", tokenizer) - -# avoid sync loading models in case of Mem OOM -if env_args.bmt_async_load: - import time - time.sleep(10*60*(trainer.local_rank%4)) - - -config_file = os.path.join(cache_dir, 'config.json') -from flagai.model.aquila_model import AQUILAModel -model = AQUILAModel.init_from_json(config_file=config_file) -print('*'*20, "model", model) - -## bmt_pre_load -checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin") -if env_args.bmt_pre_load: - model.load_weights(checkpoint_path) - -trainer.pre_train(model) - -print('*'*20, "model", model, flush=True) - - -if env_args.enable_sft_conversations_dataset_v3: - assert env_args.enable_sft_dataset_dir is not None and \ - env_args.enable_sft_dataset_file is not None - - cur_dir = env_args.enable_sft_dataset_dir - jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file) - max_seq_len = 2048 - - import jsonlines - import numpy as np - def read_file(): - conversations = [] - with jsonlines.open(jsonl_data) as reader: - for line in reader: - conversations.append(line) - return conversations - - from examples.aquila import cyg_conversation as conversation_lib - """Add speaker and start/end signal on each round.""" - BEGIN_SIGNAL = "### " - END_SIGNAL = "\n" - unknown_role = "unknown" # use default unknown role - roles = { - "human": conversation_lib.default_conversation.roles[0], # human role - "gpt": conversation_lib.default_conversation.roles[1], # gpt role - } - - def _add_speaker_and_signal(header, source, get_conversation=True): - conversation = header - - if "instruction" in source and source["instruction"] is not None and len(source["instruction"]) > 0: - source["instruction"] = ( - BEGIN_SIGNAL - + conversation_lib.default_conversation.roles[2] - + ": " - + source["instruction"] - + END_SIGNAL - ) - if get_conversation: - conversation += source["instruction"] - for sentence in source["conversations"]: - sentence_from = sentence["from"].lower() - sentence["value"] = ( - BEGIN_SIGNAL - + roles.get(sentence_from, unknown_role) - + ": " - + sentence["value"] - + END_SIGNAL - ) - if get_conversation: - conversation += sentence["value"] - return conversation - - class ConversationDatasetV3(Dataset): - def __init__(self, conversations, tokenizer, maxlen=512): - super(ConversationDatasetV3, self).__init__() - self.conversations = conversations - self.tokenizer = tokenizer - self.maxlen = maxlen - - def __getitem__(self, i): - header = f"{conversation_lib.default_conversation.system}\n\n" - source = self.conversations[i] - _add_speaker_and_signal(header, source) - - source["chat_desc"] = header - chat_desc = source['chat_desc'] - instruction = source['instruction'] - conversations = source['conversations'] - - # chat_desc - example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids'] - EOS_TOKEN = example[-1] - example = example[:-1] # remove eos - # instruction - instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids'] - instruction = instruction[1:-1] # remove bos & eos - example += instruction - - import copy - labels = copy.deepcopy(example) - - for conversation in conversations: - role = conversation['from'] - content = conversation['value'] - - if role == 'gpt': - prefix_gpt = BEGIN_SIGNAL + roles.get(role, unknown_role) + ": " - content_gpt = content[len(prefix_gpt):] - - prefix_gpt = self.tokenizer.encode_plus(f"{prefix_gpt}", None, max_length=None)['input_ids'] - prefix_gpt = prefix_gpt[1:-1] # remove bos & eos - example += prefix_gpt - role_labels = [env_args.IGNORE_INDEX] * len(prefix_gpt) - - content_gpt = self.tokenizer.encode_plus(f"{content_gpt}", None, max_length=None)['input_ids'] - content_gpt = content_gpt[1:-1] # remove bos & eos - example += content_gpt - role_labels += copy.deepcopy(content_gpt) - else: - content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids'] - content = content[1:-1] # remove bos & eos - example += content - # masking - role_labels = [env_args.IGNORE_INDEX] * len(content) - labels += role_labels - - example.append(EOS_TOKEN) - labels.append(EOS_TOKEN) - assert len(example) == len(labels) - - ## maxlen - example = example[:self.maxlen] - labels = labels[:self.maxlen] - - output = { - "input_ids": example, - "labels": labels, - } - return output - - def __len__(self): - return len(self.conversations) - - @staticmethod - def collate_fn(batch): - def padding(indice, max_length, pad_idx=0): - pad_indice = [ - item + [pad_idx] * max(0, max_length - len(item)) for item in indice - ] - return torch.tensor(pad_indice) - - input_ids = [data["input_ids"] for data in batch] - labels = [data["labels"] for data in batch] - max_length = max_seq_len - input_ids = padding(input_ids, max_length)[:,:max_length] - labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length] - - data = { - "input_ids": input_ids, - "labels": labels - } - return data - - conversations = read_file() - data_len = len(conversations) - #train_size = int(data_len * 0.95) - train_size = data_len - train_conversations = conversations[:train_size] - - train_dataset = ConversationDatasetV3(train_conversations, - tokenizer=tokenizer, - maxlen=max_seq_len) - #print(f"train_dataset \n {train_dataset[0]}") - - trainer.do_train( - train_dataset=train_dataset, - valid_dataset=None, - collate_fn=ConversationDatasetV3.collate_fn, - optimizer=None, - rank_split=False) -elif env_args.enable_sft_conversations_dataset_v3: - assert env_args.enable_sft_dataset_dir is not None and \ - env_args.enable_sft_dataset_file is not None - - cur_dir = env_args.enable_sft_dataset_dir - jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file) - max_seq_len = 2048 - - import jsonlines - import numpy as np - def read_file(): - conversations = [] - with jsonlines.open(jsonl_data) as reader: - for line in reader: - conversations.append(line) - return conversations - - from examples.gpt3_pretrain.llama import ym_conversation as conversation_lib - """Add speaker and start/end signal on each round.""" - BEGIN_SIGNAL = "### " - END_SIGNAL = "\n" - unknown_role = "unknown" # use default unknown role - roles = { - "human": conversation_lib.default_conversation.roles[0], # human role - "gpt": conversation_lib.default_conversation.roles[1], # gpt role - } - - def _add_speaker_and_signal(header, source, get_conversation=True): - conversation = header - - if "instruction" in source and source["instruction"] is not None and len(source["instruction"]) > 0: - source["instruction"] = ( - BEGIN_SIGNAL - + conversation_lib.default_conversation.roles[2] - + ": " - + source["instruction"] - + END_SIGNAL - ) - if get_conversation: - conversation += source["instruction"] - for sentence in source["conversations"]: - sentence_from = sentence["from"].lower() - sentence["value"] = ( - BEGIN_SIGNAL - + roles.get(sentence_from, unknown_role) - + ": " - + sentence["value"] - + END_SIGNAL - ) - if get_conversation: - conversation += sentence["value"] - return conversation - - class ConversationDatasetV3(Dataset): - def __init__(self, conversations, tokenizer, maxlen=512): - super(ConversationDatasetV3, self).__init__() - self.conversations = conversations - self.tokenizer = tokenizer - self.maxlen = maxlen - - def __getitem__(self, i): - header = f"{conversation_lib.default_conversation.system}\n\n" - source = self.conversations[i] - _add_speaker_and_signal(header, source) - - source["chat_desc"] = header - chat_desc = source['chat_desc'] - instruction = source['instruction'] - conversations = source['conversations'] - - # chat_desc - example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids'] - EOS_TOKEN = example[-1] - example = example[:-1] # remove eos - # instruction - instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids'] - instruction = instruction[1:-1] # remove bos & eos - example += instruction - - import copy - labels = copy.deepcopy(example) - - for conversation in conversations: - role = conversation['from'] - content = conversation['value'] - - if role == 'gpt': - prefix_gpt = BEGIN_SIGNAL + roles.get(role, unknown_role) + ": " - content_gpt = content[len(prefix_gpt):] - - prefix_gpt = self.tokenizer.encode_plus(f"{prefix_gpt}", None, max_length=None)['input_ids'] - prefix_gpt = prefix_gpt[1:-1] # remove bos & eos - example += prefix_gpt - role_labels = [env_args.IGNORE_INDEX] * len(prefix_gpt) - - content_gpt = self.tokenizer.encode_plus(f"{content_gpt}", None, max_length=None)['input_ids'] - content_gpt = content_gpt[1:-1] # remove bos & eos - example += content_gpt - role_labels += copy.deepcopy(content_gpt) - else: - content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids'] - content = content[1:-1] # remove bos & eos - example += content - # masking - role_labels = [env_args.IGNORE_INDEX] * len(content) - labels += role_labels - - example.append(EOS_TOKEN) - labels.append(EOS_TOKEN) - assert len(example) == len(labels) - - ## maxlen - example = example[:self.maxlen] - labels = labels[:self.maxlen] - - output = { - "input_ids": example, - "labels": labels, - } - return output - - def __len__(self): - return len(self.conversations) - - @staticmethod - def collate_fn(batch): - def padding(indice, max_length, pad_idx=0): - pad_indice = [ - item + [pad_idx] * max(0, max_length - len(item)) for item in indice - ] - return torch.tensor(pad_indice) - - input_ids = [data["input_ids"] for data in batch] - labels = [data["labels"] for data in batch] - max_length = max_seq_len - input_ids = padding(input_ids, max_length)[:,:max_length] - labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length] - - data = { - "input_ids": input_ids, - "labels": labels - } - return data - - conversations = read_file() - data_len = len(conversations) - #train_size = int(data_len * 0.95) - train_size = data_len - train_conversations = conversations[:train_size] - - train_dataset = ConversationDatasetV3(train_conversations, - tokenizer=tokenizer, - maxlen=max_seq_len) - #print(f"train_dataset \n {train_dataset[0]}") - - trainer.do_train( - train_dataset=train_dataset, - valid_dataset=None, - collate_fn=ConversationDatasetV3.collate_fn, - optimizer=None, - rank_split=False) diff --git a/examples/aquila/aquila-sft/Aquila-sft.yaml b/examples/aquila/aquila-sft/Aquila-sft.yaml deleted file mode 100755 index a7ffabe6..00000000 --- a/examples/aquila/aquila-sft/Aquila-sft.yaml +++ /dev/null @@ -1,20 +0,0 @@ -# comments -batch_size: 10 -gradient_accumulation_steps: 1 -lr: 2.e-4 -warm_up: 0.001 -save_interval: 500 - -bmt_cpu_offload: False -bmt_pre_load: False -bmt_async_load: False -bmt_loss_scale: 65536 - -save_optim: True -save_rng: True - -load_optim: False - -enable_sft_conversations_dataset_v3: true -enable_sft_dataset_dir: './datasets/' -enable_sft_dataset_file: 'convo_v2.jsonl' diff --git a/examples/aquila/aquila-sft/README_AquilaChat-33B.md b/examples/aquila/aquila-sft/README_AquilaChat-33B.md deleted file mode 100755 index 52550ee4..00000000 --- a/examples/aquila/aquila-sft/README_AquilaChat-33B.md +++ /dev/null @@ -1,223 +0,0 @@ -license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement) - - -# AquilaChat-33B - -## 简介/Overview -Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed zero-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。 - -The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations. - -AquilaChat-33B是在Aquila-33B模型的基础上,进行SFT微调后的支持中英双语的对话式语言模型。AquilaChat-33B模型由智源研究院研发,其在主流评测数据集上的评测结果如下 - -AquilaChat-33B is a conversational language model that supports Chinese-English dialogue. It is based on the Aquila-33B model and fine-tuned using SFT. AquilaChat-33B model was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows: - -| 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM| -| ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- | -| [AcuilaChat-33B](https://model.baai.ac.cn/model-detail/xxxxx) | 0.292 | 0.385|0.269 | 0.731|0.347 |0.939| 0.443| -| [BiLLa-7B-LLM](https://model.baai.ac.cn/model-detail/xxxxx) | 0.279 | 0.374|0.257 | 0.76|0.205 |0.864| 0.514| -| [Ziya-LLaMA-13B-v1](https://model.baai.ac.cn/model-detail/xxxxx) | 0.273 | 0.404|0.406 | 0.786|0.284 |0.762| 0.191| - -您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标 - -You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details - - - -我们的模型也同时支持[Huggingface平台](hflink) - -We also support [Huggingface](hflink) - - - -## 模型细节/Model details - -我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。 - -Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表: - -我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。 - - -We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process. - -The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below: - -We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table. - -| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) | -| ----- | ---- | ----- | ---- | ----- | ---- | -| gpt2 | 50527 | bpe|1717 | 1764|2323 | -| llama | 32000 | sp(bpe)|1805| 1257|1970 | -| gpt2_new_100k | 100000 | bpe|1575 | 477|1679 | - - - -模型在一台8卡Nvidia A100上训练8小时,总共对15万条数据训练了3个epoch。 - -The model was trained on an 8-card Nvidia A100 for 8 hours, and a total of 150,000 lines of data were trained for 3 epochs. - -## 训练数据集/Training data - -我们采用了一系列高质量中英文数据集来训练和微调我们的对话语言模型,并且在不断更新迭代 - -We used a series of high-quality Chinese and English datasets to train and fine-tune our conversational language model, and continuously updated it through iterations. - -![Screenshot](../img/data.jpg) - - -## 使用方式/How to use - -### 1. 推理/Inference - -```python -import os -import torch -from flagai.auto_model.auto_loader import AutoLoader -from flagai.model.predictor.predictor import Predictor -from flagai.model.predictor.aquila import aquila_generate -from flagai.data.tokenizer import Tokenizer -import bminf - -state_dict = "./checkpoints_in" -model_name = 'aquilachat-30b' - -loader = AutoLoader( - "lm", - model_dir=state_dict, - model_name=model_name, - use_cache=True) - -model = loader.get_model() -tokenizer = loader.get_tokenizer() -cache_dir = os.path.join(state_dict, model_name) -model.eval() -model.half() -model.cuda() - -predictor = Predictor(model, tokenizer) - -text = "北京为什么是中国的首都?" - -def pack_obj(text): - obj = dict() - obj['id'] = 'demo' - - obj['conversations'] = [] - human = dict() - human['from'] = 'human' - human['value'] = text - obj['conversations'].append(human) - # dummy bot - bot = dict() - bot['from'] = 'gpt' - bot['value'] = '' - obj['conversations'].append(bot) - - obj['instruction'] = '' - - return obj - -def delete_last_bot_end_singal(convo_obj): - conversations = convo_obj['conversations'] - assert len(conversations) > 0 and len(conversations) % 2 == 0 - assert conversations[0]['from'] == 'human' - - last_bot = conversations[len(conversations)-1] - assert last_bot['from'] == 'gpt' - - ## from _add_speaker_and_signal - END_SIGNAL = "\n" - len_end_singal = len(END_SIGNAL) - len_last_bot_value = len(last_bot['value']) - last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal] - return - -def convo_tokenize(convo_obj, tokenizer): - chat_desc = convo_obj['chat_desc'] - instruction = convo_obj['instruction'] - conversations = convo_obj['conversations'] - - # chat_desc - example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids'] - EOS_TOKEN = example[-1] - example = example[:-1] # remove eos - # instruction - instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids'] - instruction = instruction[1:-1] # remove bos & eos - example += instruction - - for conversation in conversations: - role = conversation['from'] - content = conversation['value'] - print(f"role {role}, raw content {content}") - content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids'] - content = content[1:-1] # remove bos & eos - print(f"role {role}, content {content}") - example += content - return example - -print('-'*80) -print(f"text is {text}") - -from examples.aquila.cyg_conversation import default_conversation - -conv = default_conversation.copy() -conv.append_message(conv.roles[0], text) -conv.append_message(conv.roles[1], None) - -tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids'] -tokens = tokens[1:-1] - -with torch.no_grad(): - out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens]) - print(f"pred is {out}") - - -``` - -### 2. 可监督微调/Supervised Fine-tuning(SFT) -#### Step 1: 配置模型/ Setup Checkpoints -在`./checkpoints_in`里新建`aquilachat-33b`目录。将微调后的checkpoint,以及原始`aquilachat-33b`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去 - -Create a new directory named `aquila-33b` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquila-33b` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory. - -#### Step 2: 修改参数/ Modify Parameters -* `cd /examples/aquila` -* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md) -* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft.py` -* (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft.yaml` - -| 参数名 Parameter | 类型 Type | 描述 Description | -|--------------------------------|------------|-------------------------------------------------------| -| batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory | -| gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages | -| lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum | -| warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate -| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. | -| enable_sft_conversations_dataset_v3 | bool | 数据处理方式; Data preprocessing method | -| enable_sft_dataset_dir | str | 可监督微调的数据集目录; Dataset directory of SFT dataset | -| enable_sft_dataset_file | str | 可监督微调的数据集文件名; Filename of SFT dataset | | - - - - -#### Step 3: 启动可监督微调/Start SFT -``` -bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquilachat-33b [实验名] -``` -接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run. - -![Screenshot](../img/info.jpg) - -成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ: - -![Screenshot](../img/info2.jpg) - - -## 证书/License - -Aquila-33B开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) - - -Aquila-33B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0) \ No newline at end of file diff --git a/examples/aquila/aquila-sft/aquila_sft.py b/examples/aquila/aquila_sft.py old mode 100755 new mode 100644 similarity index 99% rename from examples/aquila/aquila-sft/aquila_sft.py rename to examples/aquila/aquila_sft.py index 829239bf..032d146c --- a/examples/aquila/aquila-sft/aquila_sft.py +++ b/examples/aquila/aquila_sft.py @@ -292,7 +292,7 @@ def read_file(): conversations.append(line) return conversations - from examples.gpt3_pretrain.llama import ym_conversation as conversation_lib + from examples.gpt3_pretrain.aquila import ym_conversation as conversation_lib """Add speaker and start/end signal on each round.""" BEGIN_SIGNAL = "### " END_SIGNAL = "\n" diff --git a/examples/aquila/generate_sft_code.py b/examples/aquila/generate_sft_code.py new file mode 100644 index 00000000..c3ef7ae8 --- /dev/null +++ b/examples/aquila/generate_sft_code.py @@ -0,0 +1,67 @@ +import torch +import os +import argparse +import sys +sys.path.append("/data2/yzd/workspace/FlagAI") +from flagai import mpu +from flagai.auto_model.auto_loader import AutoLoader +import random +import numpy as np +from flagai.model.predictor.predictor import Predictor +from pathlib import Path +from flagai.data.tokenizer import Tokenizer +import time +import torch.distributed as dist +import json +import json, datetime + +import os + +model_dir = "./checkpoints_in" +device = "cuda" + +print(f"building model...") +loader = AutoLoader("lm", model_name="aquilacode-7b-nv", + only_download_config=True, + use_cache=True, + fp16=True, + model_dir=model_dir) + +model = loader.get_model() +tokenizer = loader.get_tokenizer() + +# import pdb;pdb.set_trace() +# ckpt = torch.load('./checkpoints_in/aquilacode-7b-nv/pytorch_model.bin', map_location=torch.device('cpu')) +# # print(ckpt) +# model.load_state_dict(ckpt, strict=True) + +model.eval() + +model.to(device) + +vocab = tokenizer.get_vocab() + +id2word = {v:k for k, v in vocab.items()} + +predictor = Predictor(model, tokenizer) + +max_new_tokens = 256 + +test_file = "./datasets/code_test.txt" +with open(test_file) as fin: + prompt = '\n'+fin.read()+'\n' + +input_ids = tokenizer.encode_plus_non_glm(prompt)["input_ids"][:-1] +input_length = len(input_ids) + +max_length = input_length+max_new_tokens +with torch.no_grad(): + + # prompt = "#用户#" + prompt + " " + "#ai助手#" + + prompt = '''"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."''' + '''### Human: ''' + prompt.strip() + '''### Assistant:''' + res = predictor.predict_generate_randomsample(prompt, + out_max_length=max_length, + top_p=0.95, + temperature=t0.7) + print(res) \ No newline at end of file diff --git a/examples/gpt2_title_generation/debug.txt b/examples/gpt2_title_generation/debug.txt old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/deepspeed.json b/examples/gpt2_title_generation/deepspeed.json old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/dev.txt b/examples/gpt2_title_generation/dev.txt old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/generate.py b/examples/gpt2_title_generation/generate.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/hostfile b/examples/gpt2_title_generation/hostfile old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/run_multi.sh b/examples/gpt2_title_generation/run_multi.sh old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/run_train.sh b/examples/gpt2_title_generation/run_train.sh old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/tokens_stat.py b/examples/gpt2_title_generation/tokens_stat.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/train.py b/examples/gpt2_title_generation/train.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/train_bmtrain.py b/examples/gpt2_title_generation/train_bmtrain.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/train_env_xl_bmtrain.py b/examples/gpt2_title_generation/train_env_xl_bmtrain.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/train_multi_gpu.py b/examples/gpt2_title_generation/train_multi_gpu.py old mode 100644 new mode 100755 index d3788469..76457691 --- a/examples/gpt2_title_generation/train_multi_gpu.py +++ b/examples/gpt2_title_generation/train_multi_gpu.py @@ -48,7 +48,7 @@ maxlen = 1024 auto_loader = AutoLoader( "lm", - model_name="llama-7b-en", + model_name="aquila-7b", model_dir=model_dir, only_download_config=True, use_cache=False diff --git a/examples/gpt2_title_generation/train_xl.py b/examples/gpt2_title_generation/train_xl.py old mode 100644 new mode 100755 diff --git a/examples/gpt2_title_generation/train_xl_bmtrain.py b/examples/gpt2_title_generation/train_xl_bmtrain.py old mode 100644 new mode 100755 diff --git a/examples/aquila/build_index_mappings.py b/flagai/data/dataset/indexed_dataset/build_index_mappings.py similarity index 100% rename from examples/aquila/build_index_mappings.py rename to flagai/data/dataset/indexed_dataset/build_index_mappings.py diff --git a/flagai/model/base_model.py b/flagai/model/base_model.py index bc507f2a..b925bc0b 100755 --- a/flagai/model/base_model.py +++ b/flagai/model/base_model.py @@ -188,8 +188,6 @@ def load_diffusion_local(yaml_path, only_download_config=False, **kwargs): model_id = _get_model_id(model_name) except: print("Model hub is not reachable!") - - import pdb;pdb.set_trace() # prepare the download path # downloading the files model: Union[Module, None]