YoDa is an acronym for Your Data, Your Model. This project aims to train a Language Model (LLM) using customer's private data. The goal is to compete with general solutions on tasks that are related to the customer's data.
This phase involves the generation of synthetic data relevant to the customer's domain. Two main data generation methods are employed, which may vary depending on the task requirements:
Pretraining Generation: This method generates a JSONL file containing sections of the provided data. It will enable the model to do completion over queries.
Finetuning Generation: Utilizing a powerful LLM Llama 2 70B
and a pipeline composed of prompting and postprocessing techniques, this step processes each document to create a series of synthetic questions and answers based on the content. The generated data is stored in JSONL files, this will teach the model to follow instructions and solve questions beyond mere completion.
Data preparation involves preprocessing and formatting the generated data to make it suitable for training. This step transforms the data into the required format and structure necessary for training the large language model.
In this stage, the large language model is finetuned in SambaStudio using your data. Finetuning includes updating the model's parameters to adapt it to the specific characteristics and patterns present in the prepared dataset.
The evaluation phase create a set of responses to assesses the performance of the finetuned language model on relevant queries.
It involves using the set of evaluation queries for:
- Obtaining responses from a baseline model.
- Obtaining responses from your custom model.
- Obtaining responses from your custom model giving them in the promop the exact context used in question generation of the evaluation querys.
- Obtaining responses from your custom model employing a simple RAG pipeline for response generation.
This will facilitate further analysis of your model's effectiveness in solving the domain specific tasks.
These instructions will guide you on how to generate training data, preprocess it, train the model, launch the online inference service, and evaluate it.
Begin by deploying an powerfull LLM (e.g. Llama 2 70B chat) to an endpoint for inference in SambaStudio either through the GUI or CLI, as described in the SambaStudio endpoint documentation.
Then deploy your baseline model (e.g. Llama 2 7B) to an endpoint for inference in SambaStudio either through the GUI or CLI
Optional In this Starter kit you can use the Sambanova SDK
SKSDK
to run training inference jobs in SambaStudio, you will only need to set your environment API Authorization Key (The Authorization Key will be used to access to the API Resources on SambaStudio), the steps for getting this key is decribed here
-
Clone repo.
git clone https://github.com/sambanova/ai-starter-kit.git
-
Update API information for the SambaNova LLM and your environment sambastudio key.
These are represented as configurable variables in the environment variables file in the root repo directory
sn-ai-starter-kit/.env
. For example, a Llama70B chat endpoint with the URLa Lama7B basekline model with the URL
and and a samba studio key
"1234567890abcdef987654321fedcba0123456789abcdef"
would be entered in the environment file (with no spaces) as:BASE_URL="https://api-stage.sambanova.net" PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123" API_KEY="89abcdef-0123-4567-89ab-cdef01234567" YODA_BASE_URL="https://api-stage.sambanova.net" YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876" BASELINE_API_KEY="12fedcba-9876-1234-abcd76543" SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
-
Install requirements. It is recommended to use virtualenv or conda environment for installation, and to update pip.
cd ai-starter-kit/yoda python3 -m venv yoda_env source/yoda_env/bin/activate pip install -r requirements.txt
-
Download the exaple dataset from here and update the
src_folder
variable in your sn expert config file with the path of the folder and sub folders insrc_subfolders
, for including your own data follow the same step. -
Optionaly Download and install Sambanova SNDK. Follow the instructions in this guide for installing Sambanova SNSDK and SNAPI, (you can omit the Create a virtual environment step since you are using the just created
yoda_env
environment) -
Download the Samabnova data preparation repository
deactivate cd ../.. git clone https://github.com/sambanova/generative_data_prep cd generative_data_prep python3 -m venv generative_data_prep_env source/generative_data_prep_env/bin/activate
Then follow the instalation guide
For Domain adaptive pre-training and Instruction Finetune dtaa generation run une of the following scripts
Note: You will need a SambaStudio endpoint to the LLAMA 70B Chat model and add the configurations to your env file, which is used for synthetic data generation.
you shold have requested access to the meta Llama2 tokenizer and have a local copy or Hugging Face model granted access, then put the path of the tokenizer or name of the HF model in the config file
Please replace the value of --config param with your actual config file path. An example config is shown in ./sn_expert_conf.yaml
and this is set as the default parameter for the data generation scripts below.
set in your config file the
dest_folder
,tokenizer
andn_eval_samples
parameters
Activate your YoDa starter kit environment
deactivate
cd ../..
cd ai-starter-kit/yoda
source/yoda_env/bin/activate
python -m src/gen_data.py
--config ./sn_expert_conf.yaml
--purpose pretrain
python src/gen_data.py
--config ./sn_expert_conf.yaml
--purpose finetune
python -m src.gen_data
--config ./sn_expert_conf.yaml
--purpose both
In order to pretrain and finetune on SambaStudio,
we fist need the data to be in the format of hdf5 files that we can upload as dataset in SambaStudio
To preprocess the data, open scripts/preprocess.sh
and replace
the variables ROOT_GEN_DATA_PREP_DIR
with the path to your generative data preparation
directory, set absoluthe path of the output JSONL from pretraining/finetuning In the INPUT_FILE
parameter of the scripts/preprocess.sh; and an
OUTPUT_DIR` where you want your hdf5 files to be dumped before you upload them to
SambaStudio Datasets.
Activate the generative_data_prep_env
deactivate
source ../../generative_data_prep_env/bin/activate
Then run the script
sh scripts/preprocess.sh
Then is needed to create and host your model checkpoints which needs to be done on SambaStudio. This can be done on the SambaStudio GUI following the next steps
-
First upload your genrated Dataset from gen_data_prep step
-
Create a project
-
Run a trainin job
-
Create an endpoint for your trained model
-
Add the enpoind details to the
.env
file, now your .env file should look like this:BASE_URL="https://api-stage.sambanova.net" PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123" API_KEY="89abcdef-0123-4567-89ab-cdef01234567" YODA_BASE_URL="https://api-stage.sambanova.net" YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876" BASELINE_API_KEY="12fedcba-9876-1234-abcd76543" #finetuned model endpoint details FINETUNED_ENDPOINT_ID="your endpoint ID" FINETUNED_API_KEY="your endpoint API key" SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
This training process can also be done as well as with snapapi and snapsdk. If you are interested in how this done via SNSDK, please have a look at the WIP notebook using the yoda env
For our evaluation, we pose the finetuned model questions from the held-out synthetic question-answer pairs we procured when we were generating the finetuning data. We benchmark the approach against responses we get from also using RAG as well as from a golden context.
Reactivate Activate the YoDa env
deactivate
source yoda_env/bin/activate
To assess the trained model, execute the following script:
python src/evaluate.py
--config sn_expert_conf.yaml
Please replace --config
paramether with your actual config file path.
All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:
scikit-learn (version 1.4.1.post1) jsonlines (version 4.0.0) transformers (version4.33) wordcloud (version 1.9.3) sacrebleu (version 2.4.0) datasets (version 2.18.0) sqlitedict (version 2.1.0) accelerate (version 0.27.2) omegaconf (version 2.3.0) evaluate (version 0.4.1) pycountry (version 23.12.11) rouge_score (version 0.1.2) parallelformers (version 1.2.7) peft (version 0.9.0) plotly (version 5.18.0) langchain (version 0.1.2) pydantic (version1.10.13) python-dotenv (version 1.0.0) sseclient (version 0.0.27)