Skip to content

Commit

Permalink
added a note about how to add information about the tokenizer field
Browse files Browse the repository at this point in the history
  • Loading branch information
snova-imranr committed Apr 23, 2024
1 parent fea73d8 commit a7f6d70
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion yoda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,8 @@ To pretrain and finetune on SambaStudio, the data must be hdf5 files that you ca
To preprocess the data:
1. open `scripts/preprocess.sh`
2. Replace the variables `ROOT_GEN_DATA_PREP_DIR` with the path to your [generative data preparation](https://github.com/sambanova/generative_data_prep)
directory
directory. Also note that `PATH_TO_TOKENIZER` is the path to either a downloaded tokenizer or the huggingface name of
the model. For example, `meta-llama/Llama-2-7b-chat-hf`.
3. In `scripts/preprocess.sh`, set the `INPUT_FILE` parameter to the absolute path of the output JSONL from [pretraining/finetuning](#data-generation-1) and
set `OUTPUT_DIR` to the location where you want your hdf5 files to be dumped before you upload them to
SambaStudio Datasets.
Expand Down

0 comments on commit a7f6d70

Please sign in to comment.