YoDa is an acronym for Your Data, Your Model. This starter kit aims to train a Language Model (LLM) using private data. The goal is to compete with general solutions on tasks that are related to the private data.
When you work with YoDa, you'll go through several phases until you arrive at a trained and tested model.
- Data generation. Generation of synthetic data relevant to your domain. Two main data generation methods, which vary depending on the task requirements, can be used:
- Pretraining Generation: Generate a JSONL file containing sections of the provided data. Enables the model to do completion over queries.
- Finetuning Generation: Process each document to create a series of synthetic questions and answers based on the content. This method uses a powerful LLM (Llama 2 70B) and a pipeline composed of prompting and postprocessing techniques. The generated data is stored in JSONL files. This method teaches the model to follow instructions and answer questions.
- Data preparation. Preprocessing and formatting the generated data to make it suitable for training. This step transforms the data into the required format and structure necessary for training the large language model.
- Training / Finetuning. In this stage, you fine tune the model in SambaStudio using your data. Finetuning includes updating the model's parameters to adapt it to the specific characteristics and patterns present in the prepared dataset. Note that this starter kit does not support Sambaverse as the model needs to be finetuned.
- Evaluation. The evaluation phase creates a set of responses to assess the performance of the finetuned language model. It involves using the set of evaluation queries for:
- Obtaining responses from a baseline model.
- Obtaining responses from your custom model.
- Obtaining responses from your custom model giving them in the exact context used in question generation of the evaluation queries.
- Obtaining responses from your custom model employing a simple RAG pipeline for response generation. Evaluation facilitates further analysis of your model's effectiveness in solving the domain specific tasks.
These instructions will guide you on how to generate training data, preprocess it, train the model, launch the online inference service, and evaluate it.
SambaStudio includes a rich set of open source models that have been customized to run efficiently on RDU. Deploy the LLM of choice (e.g. Llama 2 13B chat, etc) to an endpoint for inference in SambaStudio either through the GUI or CLI. See the SambaStudio endpoint documentation.
Optional In this Starter kit you can use the SambaNova SDK
SKSDK
to run training inference jobs in SambaStudio, you will only need to set your environment API Authorization Key (The Authorization Key will be used to access to the API Resources on SambaStudio), the steps for getting this key is described here
- Clone the repo.
git clone https://github.com/sambanova/ai-starter-kit.git
- Update the LLM API information for SambaStudio.
(Step 1) Update the environment variables file in the root repo directory
sn-ai-starter-kit/.env
to point to the SambaStudio endpoint. For example, for an endpoint with the URL "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef012 update the env file (with no spaces) as:BASE_URL="https://api-stage.sambanova.net" PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123" API_KEY="89abcdef-0123-4567-89ab-cdef01234567" YODA_BASE_URL="https://api-stage.sambanova.net" YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876" BASELINE_API_KEY="12fedcba-9876-1234-abcd76543" SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
(Step 2) In the config file file, set the variable api
to "sambastudio"
3. (Optional) Set up a virtual environment.
We recommend that you use virtualenv or a conda environment for installation and run pip update
.
bash cd ai-starter-kit/yoda python3 -m venv yoda_env source/yoda_env/bin/activate pip install -r requirements.txt
4. Download your dataset and update
the path to the data source folder in src_folder
and the list of subfolders in the src_subfolders
variable in your sn expert config file . The dataset structure consists of the src_folder
(str) which contains one or more subfolders that represent a different file. Each subfolder should contain at least one
txt file containing the content of that file. The txt files will be used as context retrievals for RAG. We have added an illustration of the data structure in the data
folder which acts as our src_folder
and ['sambanova_resources_blogs','sambastudio']
which are our src_subfolders
.
-
(Optional) Download and install SambaNova SNSDK. Follow the instructions in this guide for installing Sambanova SNSDK and SNAPI, (you can skip the Create a virtual environment step since you are using the
yoda_env
environment you just created). -
Clone the SambaNova data preparation repository
deactivate cd ../.. git clone https://github.com/sambanova/generative_data_prep cd generative_data_prep python3 -m venv generative_data_prep_env source/generative_data_prep_env/bin/activate
-
Install the data prep tools following the installation instructions.
Prerequisites for data generation:
- Follow the steps above to set up a SambaStudio endpoint to the LLAMA 70B Chat model and add to update the env file.
- Request access to the Meta Llama2 tokenizer or download a copy, then put the path of the tokenizer or name of the Hugging Face model in the config file.
- Replace the value of
--config param
with your actual config file path. An example config is shown in./sn_expert_conf.yaml
and this is set as the default parameter for the data generation scripts below. - In your config file, set the
dest_folder
,tokenizer
andn_eval_samples
parameters. - Activate your YoDa starter kit environment
deactivate
cd ../..
cd ai-starter-kit/yoda
source/yoda_env/bin/activate
To generate pretraining data, run this script:
python src/gen_data.py --config ./sn_expert_conf.yaml --purpose pretrain
To generate finetuning data, run this script:
python src/gen_data.py --config ./sn_expert_conf.yaml --purpose finetune
Run this script:
python src.gen_data --config ./sn_expert_conf.yaml --purpose both
To pretrain and finetune on SambaStudio, the data must be hdf5 files that you can upload to SambaStudio as dataset.
To preprocess the data:
- open
scripts/preprocess.sh
- Replace the variables
ROOT_GEN_DATA_PREP_DIR
with the path to your generative data preparation directory. Also note thatPATH_TO_TOKENIZER
is the path to either a downloaded tokenizer or the huggingface name of the model. For example,meta-llama/Llama-2-7b-chat-hf
.
Note: if you want only to pre-train the JSON to use as input is article_data.jsonl, if you used finetune as --purpose ,the JSON to use as input is synthetic_qa_train.jsonl if you want to do both in the same training job ,the JSON to use as input is qa_article_mix.jsonl
- In
scripts/preprocess.sh
, set theINPUT_FILE
parameter to the absolute path of the output JSONL from pretraining/finetuning and setOUTPUT_DIR
to the location where you want your hdf5 files to be dumped before you upload them to SambaStudio Datasets. - Activate
generative_data_prep_env
:
deactivate
source ../../generative_data_prep_env/bin/activate
- Then run the script to preprocess the data.
sh scripts/preprocess.sh
In SambaStudio, you need to create and host your model checkpoints. Connect to the SambaStudio GUI and follow these steps:
-
Upload your generated dataset from gen_data_prep step.
-
Create a project.
-
Run a training job .
-
Create an endpoint for your trained model.
5 Add the endpoint details to the .env
file. Now your .env file should look like this:
```yaml
BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
YODA_BASE_URL="https://api-stage.sambanova.net"
YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"
#finetuned model endpoint details
FINETUNED_ENDPOINT_ID="your endpoint ID"
FINETUNED_API_KEY="your endpoint API key"
SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
```
For evaluation, you can ask the finetuned model questions from the synthetic question-answer pairs that you procured when you were generating the finetuning data. You benchmark the approach against responses we get from also using RAG as well as from a golden context.
Reactivate the YoDa environment:
deactivate
source yoda_env/bin/activate
To assess the trained model, run the following script, passing in your config file:
python src/evaluate.py
--config <sn_expert_conf.yaml>
All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:
- scikit-learn (version 1.4.1.post1)
- jsonlines (version 4.0.0)
- transformers (version4.33)
- wordcloud (version 1.9.3)
- sacrebleu (version 2.4.0)
- datasets (version 2.18.0)
- sqlitedict (version 2.1.0)
- accelerate (version 0.27.2)
- omegaconf (version 2.3.0)
- evaluate (version 0.4.1)
- pycountry (version 23.12.11)
- rouge_score (version 0.1.2)
- parallelformers (version 1.2.7)
- peft (version 0.9.0)
- plotly (version 5.18.0)
- langchain (version 0.1.2)
- pydantic (version1.10.13)
- python-dotenv (version 1.0.0)
- sseclient (version 0.0.27)