diff --git a/yoda/README.md b/yoda/README.md index 8efadd46..ae20f905 100644 --- a/yoda/README.md +++ b/yoda/README.md @@ -135,26 +135,20 @@ source/yoda_env/bin/activate To generate pretraining data, run this script: ```bash -python -m src/gen_data.py - --config ./sn_expert_conf.yaml - --purpose pretrain +python src/gen_data.py --config ./sn_expert_conf.yaml --purpose pretrain ``` ### Generate finetuning data To generate finetuning data, run this script: ```bash -python src/gen_data.py - --config ./sn_expert_conf.yaml - --purpose finetune +python src/gen_data.py --config ./sn_expert_conf.yaml --purpose finetune ``` ### Generate both pretraining and finetuning data Run this script: ```bash -python -m src.gen_data - --config ./sn_expert_conf.yaml - --purpose both +python src.gen_data --config ./sn_expert_conf.yaml --purpose both ``` ## Preprocess the data @@ -166,6 +160,10 @@ To preprocess the data: 2. Replace the variables `ROOT_GEN_DATA_PREP_DIR` with the path to your [generative data preparation](https://github.com/sambanova/generative_data_prep) directory. Also note that `PATH_TO_TOKENIZER` is the path to either a downloaded tokenizer or the huggingface name of the model. For example, `meta-llama/Llama-2-7b-chat-hf`. +> Note: if you want only to pre-train the JSON to use as input is article_data.jsonl, +> if you used finetune as --purpose ,the JSON to use as input is synthetic_qa_train.jsonl +> if you want to do both in the same training job ,the JSON to use as input is qa_article_mix.jsonl + 3. In `scripts/preprocess.sh`, set the `INPUT_FILE` parameter to the absolute path of the output JSONL from [pretraining/finetuning](#data-generation-1) and set `OUTPUT_DIR` to the location where you want your hdf5 files to be dumped before you upload them to SambaStudio Datasets.