Skip to content

UbiquitousLearning/PhoneLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the code and documents in pre-training, fine-tuning, and evaluating PhoneLM, a highly capable and efficient small language model family. The end-to-end demo of PhoneLM running on smartphone is available at mllm.

Model Downloads

HuggingFace
PhoneLM-1.5B
PhoneLM-1.5B-Instruct
PhoneLM-1.5B-Call
PhoneLM-0.5B
PhoneLM-0.5B-Instruct

Evaluation Results

Comprehensive Evaluation

Model HellaSwag WinoGrande PIQA SciQ BoolQ ARC Easy ARC Challenge Average
PhoneLM-1.5B 66.9 63.0 77.3 88.8 65.5 69.7 39.9 67.31
Pythia-1.4B 52.0 57.2 71.1 79.2 63.2 53.9 28.3 57.84
OPT-1.3B 53.7 59.0 71.0 78.1 57.2 51.3 28.0 56.90
BLOOM-1.1B 43.0 54.9 67.2 74.6 59.1 45.4 25.6 52.83
TinyLlama-1.1B 59.1 58.9 73.0 82.3 58.6 55.7 31.0 59.80
MobileLLaMA-1.4B 56.1 59.4 73.0 81.9 56.7 55.8 30.3 59.03
MobiLlama-1B 62.2 59.3 74.8 82.8 60.3 56.4 31.7 61.07
OpenELM-1.1B 64.8 61.7 75.6 83.6 63.6 55.4 32.3 62.43
DCLM-1.4B 53.6 66.3 77.0 94.0 71.4 74.8 41.2 68.33
SmolLM-1.7B 49.6 60.9 75.8 93.2 66.0 76.4 43.5 66.49
Qwen 1.5-1.8B 60.9 60.5 74.2 89.4 66.5 59.1 34.7 63.61
Galactica-1.3B 41.0 54.4 63.8 87.7 62.0 58.6 30.5 56.86
StableLM 2-1.6B 68.8 64.1 75.1 76.9 80.0 60.3 39.2 66.34
Cerebras-GPT-1.3B 38.4 51.9 66.8 73.0 59.3 45.8 25.3 51.50
MiniCPM-1B 67.5 63.7 75.1 91.0 70.5 62.9 38.1 66.97
MiniCPM-2B 67.2 63.9 76.1 92.5 74.6 69.0 42.7 69.43
Gemma-2B 71.4 65.2 78.4 91.4 69.9 72.3 42.0 70.09
Gemma 2-2B 55.0 68.7 78.7 96.0 73.6 80.3 46.9 71.31

Android Function Call

To enhance the model’s capability in smartphone operation, we fine-tuned the PhoneLM on the DroidCall dataset, a synthetic dataset specifically focused on Android intent invocations generated by GPT4.

Currently we use two simple metric to reflect the ability of function calling:

  • Accuracy: A sample contains a user query and its corresponding ground-truth function calls. A sample is considered correct only if the model generates all function calls with both correct functions and parameters.Accuracy is defined as the ratio of correctly predicted samples to the total number of samples.
  • Soft Accuracy: To provide a more fine-grained evaluation when the model generates partially correct results (i.e., correct functions with partially correct parame- ters), we define soft accuracy. For each function call, a score is calculated as the ratio of correctly predicted parameters to the total number of parameters. Soft ac- curacy is then computed as the average of these scores across all function calls.
Model Accuracy Soft Accuracy
PhoneLM-1.5B-Instruct 17.5 17.8
PhoneLM-1.5B-Call 76.5 89.3
Qwen2.5-Coder-1.5B 50.0 63.5
Qwen2.5-1.5B-Instruct 58.5 75.3
Phi-3.5-mini-instruct 62.0 77.7
MiniCPM3-4B 70.0 85.7
Gemma-2-2b-it 56.5 75.8
TinyLlama-1.1B-Chat-v1.0 18.0 18.7
Llama-3.2-1B-Instruct 36.0 43.8
Llama-3.2-3B-Instruct 47.5 57.9
GPT-40-mini 71.0 86.1

Runnning PhoneLM

Huggingface

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'mllmTeam/PhoneLM-1.5B-Instruct'
question = "Hello, who are you?"
prompt = [{"role": "user", "content": question}]

model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

inp = tokenizer(input_text, return_tensors="pt")
inp = {k: v.to('cuda') for k, v in inp.items()}
out = model.generate(**inp, 
                     max_length=256,
                     do_sample=True,
                     temperature=0.7,
                     top_p=0.7
                     )
text = tokenizer.decode(out[0], skip_special_tokens=True)
print(text)

mllm

We have provided the mllm formats of PhoneLM, which can be used in mllm.

Install mllm

git clone https://github.com/UbiquitousLearning/mllm.git
cd mllm/scripts/
build.sh

Inference

cd ../bin
./demo_phonelm -m /path/to/model.mllm 

Training PhoneLM

Install Python Environment

pip install -r requirement.txt

Stable Training Stage

We use the following dataset in stable training stage.

type dataset token
web DCLM-baseline 1.35T
code StarCoderData 112.75B
math OpenWebMath 13.25B
academic Dolma-algebraic 12.75B
academic Dolma-arxiv 29B
total 1.5T

Download The Original Data

You can download the dataset from the links provided in the table above using any method.As an example, we use huggingface-cli to download DCLM-baseline. Here is an example command:

huggingface-cli download --repo-type dataset --local-dir ./dclm-baseline --local-dir-use-symlinks False --resume-download mlfoundations/dclm-baseline-1.0-parquet

You can decide how to download the dataset through the links in the table above.

Preprocess the dataset

Before pretraining, it is necessary to perform tokenization on the dataset in advance. Before tokenization, you should first know the format of the dataset and the field in the dataset used to pretrain. Take dclm-baseline as an example, the data files format is parquet. And in its Dataset Card, it can be seen that the text field of each data entry is used for pretraining. After knowing the format type, we can use the following command to tokenize the data in advance

python path/to/dataset path/to/output_dir\
  --prefix prefix_of_output_file\ 
  --handler file_format\
  --field field_used_to_pretrain\
  --num_workers  workers_to_process\
  --tokenizer_path path/to/tokenizer\
  --max_size max_tokens_of_each_output_file

For example, to tokenize dclm-baseline, use following command in PhoneLM

python pretokenize.py path/to/dclm-baseline ./train_datasets/dclm-baseline 
  --prefix dclm-baseline 
  --handler parquet 
  --field text
  --tokenizer_path tokenizer

The output will look like:

train_datasets/
└── dclm-baseline
    ├── dclm-baseline-000-00000.data
    ├── dclm-baseline-001-00000.data
    ├── dclm-baseline-002-00000.data
    ├── dclm-baseline-003-00000.data
    ...

Train

After performing the same operation on all datasets, the tokenized datasets are stored in train_datasets. Subsequently, you can start pretraining with the following command:

deepspeed train.py --config config_phonelm_1.5b.yaml

Decay Stage

In the decay stage, the data contains some dataset from stable training stage, including DCLM-baseline, StarCoderData, and Dolma. And it also contains some high-quality fine-tuning data, which is used in fine-tuning stage. Following table shows the data

Type Dataset Token
web DCLM-baseline 10B
code StarCoderData 1.575B
code The Stack Smol 0.95B
acadamic Dolma-arxiv 2.325B
acadamic Dolma-pes2o 2.35B
math instruct MathInstruct 65.25M
chat instruct UltraChat 1.775B
chat instruct OpenAssistant 2 42.25M
chat instruct OpenHermes 77.25M
code instruct Magicoder Evol Instruct 30.25M
code instruct CommitPackFT 0.35B
code instruct Magicoder OSS Instruct 43.5M
function calling SlimOrca 209.75M
function calling APIGen 48.25M
function calling Glaive Function Calling 57.5M
total 20B

Unfortunately, the datasets in the table above, excluding those used for pretraining, each have their own format. To standardize the datasets in this phase, we have processed all SFT data into a chat format and formatted them as text using a unified template.

We will show you an example. First download the dataset as shown above.Then use the following command to process:

python prepare_chat.py path/to/MathInstruct chat/MathInstruct --dataset_name MathInstruct # process MathInstruct

python prepare_chat.py ../datasets/Magicoder-OSS-Instruct-75K/ chat/Magicoder --dataset_name Magicoder # process Magicoder

After processing the dataset, the chat directory will looks like

chat/
├── Magicoder
│   └── 000_Magicoder_00000.parquet
└── MathInstruct
    └── 000_MathInstruct_00000.parquet

Format of processed data is as following:

{
  "text": "pretrain data",
  "chat": [
    {"role": "...", "content": "..."},
    ...
  ]
}

Then you can tokenize the text field to get the Decay Stage pretrain data using pretokenize.py.

Train

Subsequently, you can start decay stage training with the following command:

deepspeed train.py --config config_phonelm_1.5b_stage2.yaml

Instruct Following Tuning

In this stage you need to initial dataset structure as followed:

train_datasets_instructs/
├── commitpackft
│   ├── 000_commitpackft_00000.parquet
│   └── ...
└── ...

The dataset construction is the same as in Decay Stage.

Train

Launch train command

deepspeed train_instruct.py --config config_phonelm_1.5b_instruct.yaml

If it is the first time loading train_datasets_instruct, two directories train_dataset_test and val_dataset_test will be generated in the train_datasets_instruct directory. Subsequently, data will be read directly from these two directories.

Function Call Tuning

We fine-tuned our model on the DroidCall datasets to quip the model with the capability to operate Android phones. We have provided an instance for fine-tuning on DroidCall, you can also use your own way to fine-tune.

First, download the DroidCall dataset and rename it to train_datasets_DroidCall. The dataset structure is as follows:

train_datasets_DroidCall/
└── DroidCall_code_short.jsonl

Train

We provide a simple config to run the fine-tuning on DroidCall, you can simply start the training using the following command

deepspeed train_instruct.py --config config_phonelm_1.5b_call.yaml

License

The source code of PhoneLM is under the License of GPL-2.0.

Citation

@misc{yi2024phonelmanefficientcapablesmall,
      title={PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training}, 
      author={Rongjie Yi and Xiang Li and Weikai Xie and Zhenyan Lu and Chenghua Wang and Ao Zhou and Shangguang Wang and Xiwen Zhang and Mengwei Xu},
      year={2024},
      eprint={2411.05046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.05046}, 
}