From 1cc73836d4750272ed3be41ef474078fad5b7d26 Mon Sep 17 00:00:00 2001 From: gongjy <2474590974@qq.com> Date: Fri, 27 Sep 2024 16:38:18 +0800 Subject: [PATCH] update readme info --- README.md | 5 ++--- README_en.md | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index d57023a..5515b81 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
2024-09-27 -- 09-27更新pretrain数据集的预处理方式,为了保证文本完整性,放弃预处理成.bin训练的形式(轻微牺牲训练速度)。 +- 👉09-27更新pretrain数据集的预处理方式,为了保证文本完整性,放弃预处理成.bin训练的形式(轻微牺牲训练速度)。 - 目前pretrain预处理后的文件命名为:pretrain_data.csv。 @@ -252,8 +252,7 @@ streamlit run fast_inference.py minimind tokenizer6,400自定义 - > [!TIP] - > 2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。 + > 👉2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。 > 尽管minimind_tokenizer长度很小,编解码效率弱于qwen2、glm等中文友好型分词器。 > 但minimind模型选择了自己训练的minimind_tokenizer作为分词器,以保持整体参数轻量,避免编码层和计算层占比失衡,头重脚轻,因为minimind的词表大小只有6400。 diff --git a/README_en.md b/README_en.md index fd0f86f..3fc8f35 100644 --- a/README_en.md +++ b/README_en.md @@ -87,7 +87,7 @@ We hope this open-source project helps LLM beginners get started quickly!
2024-09-27 -- Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed). +- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed). - The current filename for the pretrain data after preprocessing is: pretrain_data.csv. @@ -282,8 +282,7 @@ git clone https://github.com/jingyaogong/minimind.git minimind tokenizer6,400Custom - > [!IMPORTANT] - > Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now + > 👉Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now use the Minimind_tokenizer for tokenization, and all versions of the Mistral_tokenizer have been deprecated. > Although the Minimind_tokenizer has a small length and its encoding/decoding efficiency is weaker compared to