modified docs and files

Signed-off-by: ftgreat <[email protected]>
qhduan · Jun 8, 2023 · 5af62c6 · 5af62c6
1 parent e4b9830
commit 5af62c6
Show file tree

Hide file tree

Showing 56 changed files with 689 additions and 1,314 deletions.
diff --git a/README.md b/README.md
@@ -10,13 +10,7 @@
 
 FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality.
 
-<p align="center">
-Platforms supported
-</p>
-放到后面，加logo------------------
-****
-                             Tianshu                                      Nvidia            
-****
+
 
 ## Why should I use FlagAI?
 
@@ -299,6 +293,12 @@ The majority of FlagAI is licensed under the [Apache 2.0 license](LICENSE), howe
 - [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/fine-tuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63)
 - [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1)
 
+## Platforms supported
+
+<div  align="center">    
+<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
+</div>
+
 
 
 ## Misc

diff --git a/README_zh.md b/README_zh.md
@@ -289,6 +289,11 @@ FlagAI飞智大部分项目基于 [Apache 2.0 license](LICENSE)，但是请注
 * GLM 是基于协议 [MIT license](https://github.com/THUDM/GLM/blob/main/LICENSE)
 * AltDiffusion 是基于协议 [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license)
 
+## 平台支持
+
+<div  align="center">    
+<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
+</div>
 
 
 ## Misc

diff --git a/examples/Aquila/Aquila-code/Aquila-code.yaml b/examples/Aquila/Aquila-code/Aquila-code.yaml
@@ -0,0 +1,16 @@
+batch_size: 10
+gradient_accumulation_steps: 1
+lr: 2.0e-5
+warm_up: 0.01
+save_interval: 1000
+
+bmt_cpu_offload: False
+bmt_pre_load: False
+bmt_async_load: False
+bmt_loss_scale: 524288
+
+save_optim: True
+save_rng: True
+
+load_optim: False
+resume_dataset: False
diff --git a/...la/aquila-code/README_AquilaCode-7B-nv.md → ...la/Aquila-code/README_AquilaCode-7B-NV.md b/...la/aquila-code/README_AquilaCode-7B-nv.md → ...la/Aquila-code/README_AquilaCode-7B-NV.md
@@ -1,25 +1,25 @@
 license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement)
 
 
-# AquilaCode-7B-nv
+# AquilaCode-7B
 
 ## 简介/Overview
 Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点，替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer，升级了BMTrain并行训练方法，在Aquila的训练过程中实现了比Magtron+DeepSpeed ZeRO-2将近８倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从０开始训练的，通过数据质量的控制、多种训练的优化方法，实现在更小的数据集、更短的训练时间，获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。
 
 The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.
 
-AquilaCode-7B-nv是在Aquila-7B模型的基础上，经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下
+<!-- AquilaCode-7B-NV是在Aquila-7B模型的基础上，经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下
 
 AquilaCode-7B-nv is a foundational code model obtained by further pretraining on code data based on the Aquila-7B model. It was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows:
 
 | 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM|
 |  -----  | ----  | -----  | ----  | -----  | ----  | -----  | -----  |
-| [AquilaCode-7B-nv](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx |
+| [AquilaCode-7B-nv](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx | -->
 
 
-您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标
+<!-- 您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标
 
-You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details
+You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details -->
 
 
 
@@ -49,17 +49,11 @@ We used different tokenizers to extract ten thousand data samples from English,
 | gpt2_new_100k | 100000 | bpe|1575 | 477|1679 |
 
 
-模型在8台8卡Nvidia A100-40G上训练14天，数据集规模为2350亿。
-
-The model was trained on an 8 8-card Nvidia A100-40G for 14 days, and there are 235B tokens in the train set.
-
 ## 训练数据集/Training data 
-AquilaCode-7B-nv训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql，C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据
+`AquilaCode-7B-NV`和`AquilaCode-7B-TS`训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql，C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据
 
 给予我们的模型进行了continue pretrain--------
-The AquilaCode-7B-nv model was  supervised fine-tuning on  [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql，C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text).
-
-![Screenshot](../img/data.jpg)
+The AquilaCode-7B-NV model was supervised fine-tuning on  [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql，C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text).
 
 ## 使用方式/How to use
 
@@ -125,12 +119,12 @@ with torch.no_grad():
 
 ### 2. 可监督微调/Supervised Fine-tuning(SFT)
 #### Step 1: 配置模型/ Setup Checkpoints
-在`./checkpoints_in`里新建`aquilacode-7b-nv`目录。将微调后的checkpoint，以及原始`aquilacode-7b-nv`模型里的其余文件，包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去
+在`./checkpoints_in`里新建`aquilacode-7b-nv`(或`aquilacode-7b-ts`)目录。将微调后的checkpoint，以及原始`aquilacode-7b-nv`模型里的其余文件，包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去
 
-Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.
+Create a new directory named `aquilacode-7b-nv` (or`aquilacode-7b-ts`) inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.
 
 #### Step 2: 修改参数/Modify Parameters
-* `cd /examples/aquila`
+* `cd /examples/Aquila/Aquila-code`
 * 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
 * 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft_code.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft_code.py`
 * (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft-code.yaml`
@@ -148,7 +142,7 @@ Create a new directory named `aquilacode-7b-nv` inside `./checkpoints_in`. Place
 
 #### Step 3: 启动可监督微调/Start SFT
 ```
-bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名]
+bash dist_trigger_docker.sh hostfile Aquila-sft.yaml [aquilacode-7b-nv/aquilacode-7b-ts] [实验名]
 ```
 接下来会输出下列信息，注意`NODES_NUM`应该与节点数相等，`LOGFILE`是模型运行的日志文件；The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.
 

diff --git a/examples/Aquila/Aquila-code/aquila_code.py b/examples/Aquila/Aquila-code/aquila_code.py
@@ -0,0 +1,224 @@
+# Copyright © 2022 BAAI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License")
+import os
+import torch
+from torch.utils.data import Dataset
+import gc
+gc.collect()
+torch.cuda.empty_cache()
+import sys;sys.path.append("/data2/yzd/workspace/FlagAI")
+from flagai.auto_model.auto_loader import AutoLoader
+from flagai.data.tokenizer import Tokenizer
+from flagai.env_args import EnvArgs
+from flagai.env_trainer_v1 import EnvTrainer
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+# You can input all parameters by the command line.
+# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch
+env_args = EnvArgs(
+    env_type="bmtrain",
+    experiment_name="aquila",
+    batch_size=1,
+    gradient_accumulation_steps=1,
+    lr=2e-4,
+    weight_decay=1e-3,
+    epochs=100,
+    log_interval=10,
+    eval_interval=5000,
+    num_gpus=1,
+    load_dir=None,
+    pytorch_device=device,
+    save_dir="checkpoints_aquila",
+    checkpoint_activations=False,
+    save_interval=5000,
+    fp16=True,
+    training_script=__file__,
+)
+env_args = env_args.parse_args()
+#env_args.wandb = False
+
+# overwrite
+if env_args.yaml_config:
+    import yaml
+    file_data = open(env_args.yaml_config, 'r', encoding="utf-8").read()
+    data = yaml.load_all(file_data)
+    delattr(env_args, 'yaml_config')
+    arg_dict = env_args.__dict__
+    for subdata in data:
+        for key, value in subdata.items():
+            if isinstance(value, list):
+                for v in value:
+                    arg_dict[key].append(v)
+            else:
+                arg_dict[key] = value
+trainer = EnvTrainer(env_args)
+
+# Trainer as Trigger
+if not env_args.not_call_launch:
+    import sys
+    sys.exit(0)
+
+print(f"Trainer effective env_args={env_args} local_rank={trainer.local_rank}", flush=True)
+
+checkpoints = env_args.pre_load_dir
+
+model_name = env_args.model_name
+
+env_args.enable_sft_conversations_dataset_v3 = True
+
+
+print('*'*20, "model_name", model_name, flush=True)
+
+'''
+auto_loader = AutoLoader(
+    "lm",
+    model_name=model_name,
+    model_dir=checkpoints,
+    only_download_config=True,
+)
+model = auto_loader.get_model()
+tokenizer = auto_loader.get_tokenizer()
+print('*'*20, "model", model)
+trainer.pre_train(model)
+print('*'*20, "model", model)
+
+'''
+
+cache_dir = os.path.join(checkpoints, model_name)
+print('*'*20, "cache_dir", cache_dir)
+tokenizer = Tokenizer.from_pretrained(model_name, cache_dir=cache_dir)
+print('*'*20, "tokenizer", tokenizer)
+
+# avoid sync loading models in case of Mem OOM
+if env_args.bmt_async_load:
+    import time
+    time.sleep(10*60*(trainer.local_rank%4))
+
+
+config_file = os.path.join(cache_dir, 'config.json')
+from flagai.model.aquila_model import AQUILAModel
+model = AQUILAModel.init_from_json(config_file=config_file)
+print('*'*20, "model", model)
+
+## bmt_pre_load
+checkpoint_path = os.path.join(cache_dir, "pytorch_model.bin")
+if env_args.bmt_pre_load:
+    model.load_weights(checkpoint_path)
+
+trainer.pre_train(model)
+
+print('*'*20, "model", model, flush=True)
+
+assert env_args.enable_sft_dataset_dir is not None and \
+        env_args.enable_sft_dataset_file is not None
+
+cur_dir = env_args.enable_sft_dataset_dir
+jsonl_data = os.path.join(cur_dir, env_args.enable_sft_dataset_file)
+max_seq_len = 2048
+
+import jsonlines
+import numpy as np
+def read_file():
+    conversations = []
+    with jsonlines.open(jsonl_data) as reader:
+        for line in reader:
+            if 'chat_desc' not in line or 'instruction' not in line or 'conversations' not in line:
+                continue
+            obj = dict()
+            obj['chat_desc'] = line['chat_desc']
+            obj['conversations'] = line['conversations']
+            obj['instruction'] = line['instruction']
+            conversations.append(obj)
+    return conversations
+
+class ConversationDataset(Dataset):
+    def __init__(self, conversations, tokenizer, maxlen=512):
+        super(ConversationDataset, self).__init__()
+        self.conversations = conversations
+        self.tokenizer = tokenizer
+        self.maxlen = maxlen
+
+    def __getitem__(self, i):
+        chat_desc = self.conversations[i]['chat_desc']
+        instruction = self.conversations[i]['instruction']
+        conversations = self.conversations[i]['conversations']
+
+        # chat_desc
+        example = self.tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
+        EOS_TOKEN = example[-1]
+        example = example[:-1] # remove eos
+        # instruction
+        instruction = self.tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
+        instruction = instruction[1:-1] # remove bos & eos
+        example += instruction
+
+        import copy
+        labels = copy.deepcopy(example)
+
+        for conversation in conversations:
+            role = conversation['from']
+            content = conversation['value']
+            content = self.tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
+            content = content[1:-1] # remove bos & eos
+            example += content
+            if role == 'gpt':
+                role_labels = copy.deepcopy(content)
+            else:
+                # masking
+                role_labels = [env_args.IGNORE_INDEX] * len(content)
+            labels += role_labels
+
+        example.append(EOS_TOKEN)
+        labels.append(EOS_TOKEN)
+
+        ## maxlen
+        example = example[:self.maxlen]
+        labels = labels[:self.maxlen]
+
+        output = {
+            "input_ids": example,
+            "labels": labels,
+        }
+        return output
+
+    def __len__(self):
+        return len(self.conversations)
+
+    @staticmethod
+    def collate_fn(batch):
+        def padding(indice, max_length, pad_idx=0):
+            pad_indice = [
+                item + [pad_idx] * max(0, max_length - len(item)) for item in indice
+            ]
+            return torch.tensor(pad_indice)
+
+        input_ids = [data["input_ids"] for data in batch]
+        labels = [data["labels"] for data in batch]
+        max_length = max_seq_len
+        input_ids = padding(input_ids, max_length)[:,:max_length]
+        labels = padding(labels, max_length, pad_idx=env_args.IGNORE_INDEX)[:,:max_length]
+
+        data = {
+            "input_ids": input_ids,
+            "labels": labels
+        }
+        return data
+
+conversations = read_file()
+data_len = len(conversations)
+#train_size = int(data_len * 0.95)
+train_size = data_len
+train_conversations = conversations[:train_size]
+
+train_dataset = ConversationDataset(train_conversations,
+                                    tokenizer=tokenizer,
+                                    maxlen=max_seq_len)
+
+trainer.do_train(
+    train_dataset=train_dataset,
+    valid_dataset=None,
+    collate_fn=ConversationDataset.collate_fn,
+    optimizer=None,
+    rank_split=False)
diff --git a/examples/aquila/aquila-code/bmtrain_mgpu.sh → examples/Aquila/Aquila-code/bmtrain_mgpu.sh b/examples/aquila/aquila-code/bmtrain_mgpu.sh → examples/Aquila/Aquila-code/bmtrain_mgpu.sh
diff --git a/...aquila/aquila-code/dist_trigger_docker.sh → ...Aquila/Aquila-code/dist_trigger_docker.sh b/...aquila/aquila-code/dist_trigger_docker.sh → ...Aquila/Aquila-code/dist_trigger_docker.sh
diff --git a/examples/aquila/aquila-code/generate_code.py → examples/Aquila/Aquila-code/generate_code.py b/examples/aquila/aquila-code/generate_code.py → examples/Aquila/Aquila-code/generate_code.py
@@ -11,13 +11,7 @@
 import random
 import numpy as np
 from flagai.model.predictor.predictor import Predictor
-from pathlib import Path 
 from flagai.data.tokenizer import Tokenizer
-import torch.distributed as dist
-import json 
-import json, datetime
-
-import os
 
 model_dir = "./checkpoints_in"
 device = "cuda"
@@ -32,11 +26,6 @@
 model = loader.get_model()
 tokenizer = loader.get_tokenizer()
 
-# import pdb;pdb.set_trace()
-# ckpt = torch.load('./checkpoints_in/aquilacode-7b-nv/pytorch_model.bin', map_location=torch.device('cpu'))
-# # print(ckpt)
-# model.load_state_dict(ckpt, strict=True)
-
 model.eval()
 
 model.to(device)
@@ -61,5 +50,5 @@
     res = predictor.predict_generate_randomsample(prompt, 
                                                     out_max_length=max_length, 
                                                     top_p=0.95, 
-                                                    temperature=t0.7)
+                                                    temperature=0.7)
     print(res)
diff --git a/examples/aquila/aquila-code/hostfile → examples/Aquila/Aquila-code/hostfile b/examples/aquila/aquila-code/hostfile → examples/Aquila/Aquila-code/hostfile