YoDa

Overview
- About this kit
Workflow
Getting Started
Starterkit: Usage
Third-party tools and data sources

Overview

About this kit

YoDa is an acronym for Your Data, Your Model. This project aims to train a Language Model (LLM) using customer's private data. The goal is to compete with general solutions on tasks that are related to the customer's data.

Workflow

Data generation

This phase involves the generation of synthetic data relevant to the customer's domain. Two main data generation methods are employed, which may vary depending on the task requirements:

Pretraining Generation: This method generates a JSONL file containing sections of the provided data. It will enable the model to do completion over queries.

Finetuning Generation: Utilizing a powerful LLM Llama 2 70B and a pipeline composed of prompting and postprocessing techniques, this step processes each document to create a series of synthetic questions and answers based on the content. The generated data is stored in JSONL files, this will teach the model to follow instructions and solve questions beyond mere completion.

Data preparation

Data preparation involves preprocessing and formatting the generated data to make it suitable for training. This step transforms the data into the required format and structure necessary for training the large language model.

Trainign / Finetuning

In this stage, the large language model is finetuned in SambaStudio using your data. Finetuning includes updating the model's parameters to adapt it to the specific characteristics and patterns present in the prepared dataset.

Evaluation

The evaluation phase create a set of responses to assesses the performance of the finetuned language model on relevant queries.

It involves using the set of evaluation queries for:

Obtaining responses from a baseline model.
Obtaining responses from your custom model.
Obtaining responses from your custom model giving them in the promop the exact context used in question generation of the evaluation querys.
Obtaining responses from your custom model employing a simple RAG pipeline for response generation.

This will facilitate further analysis of your model's effectiveness in solving the domain specific tasks.

Getting Started

These instructions will guide you on how to generate training data, preprocess it, train the model, launch the online inference service, and evaluate it.

Deploy your models in SambaStudio

Begin by deploying an powerfull LLM (e.g. Llama 2 70B chat) to an endpoint for inference in SambaStudio either through the GUI or CLI, as described in the SambaStudio endpoint documentation.

Then deploy your baseline model (e.g. Llama 2 7B) to an endpoint for inference in SambaStudio either through the GUI or CLI

Get your SambaStudio API key

Optional In this Starter kit you can use the Sambanova SDK SKSDK to run training inference jobs in SambaStudio, you will only need to set your environment API Authorization Key (The Authorization Key will be used to access to the API Resources on SambaStudio), the steps for getting this key is decribed here

Set the starter kit environment

Clone repo.

git clone https://github.com/sambanova/ai-starter-kit.git

Update API information for the SambaNova LLM and your environment sambastudio key.

These are represented as configurable variables in the environment variables file in the root repo directory sn-ai-starter-kit/.env. For example, a Llama70B chat endpoint with the URL

"https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef0123"

a Lama7B basekline model with the URL

"https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/987654ef-fedc-9876-1234-01fedbac9876"

and and a samba studio key "1234567890abcdef987654321fedcba0123456789abcdef" would be entered in the environment file (with no spaces) as:
```
BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"

YODA_BASE_URL="https://api-stage.sambanova.net"
YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"

SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
```
Install requirements. It is recommended to use virtualenv or conda environment for installation, and to update pip.
```
cd ai-starter-kit/yoda
python3 -m venv yoda_env
source/yoda_env/bin/activate
pip install -r requirements.txt
```
Download the exaple dataset from here and update the src_folder variable in your sn expert config file with the path of the folder and sub folders in src_subfolders, for including your own data follow the same step.
Optionaly Download and install Sambanova SNDK. Follow the instructions in this guide for installing Sambanova SNSDK and SNAPI, (you can omit the Create a virtual environment step since you are using the just created yoda_env environment)

Download the Samabnova data preparation repository

 deactivate
 cd ../..
 git clone https://github.com/sambanova/generative_data_prep
 cd generative_data_prep
 python3 -m venv generative_data_prep_env
 source/generative_data_prep_env/bin/activate

Then follow the instalation guide

Starterkit: Usage

Data Generation

For Domain adaptive pre-training and Instruction Finetune dtaa generation run une of the following scripts

Note: You will need a SambaStudio endpoint to the LLAMA 70B Chat model and add the configurations to your env file, which is used for synthetic data generation.

you shold have requested access to the meta Llama2 tokenizer and have a local copy or Hugging Face model granted access, then put the path of the tokenizer or name of the HF model in the config file

Please replace the value of --config param with your actual config file path. An example config is shown in ./sn_expert_conf.yaml and this is set as the default parameter for the data generation scripts below.

set in your config file the dest_folder, tokenizer and n_eval_samples parameters

Activate your YoDa starter kit environment

deactivate
cd ../..
cd ai-starter-kit/yoda
source/yoda_env/bin/activate

To Generate pretraining data

python -m src/gen_data.py
    --config ./sn_expert_conf.yaml
    --purpose pretrain

To generate finetuning data

python src/gen_data.py
    --config ./sn_expert_conf.yaml
    --purpose finetune

Both pretraining and fine tuning data generation

python -m src.gen_data
    --config ./sn_expert_conf.yaml
    --purpose both

Data Preprocessing

In order to pretrain and finetune on SambaStudio, we fist need the data to be in the format of hdf5 files that we can upload as dataset in SambaStudio To preprocess the data, open scripts/preprocess.sh and replace the variables ROOT_GEN_DATA_PREP_DIR with the path to your generative data preparation directory, set absoluthe path of the output JSONL from pretraining/finetuning In the INPUT_FILE parameter of the scripts/preprocess.sh; and an OUTPUT_DIR` where you want your hdf5 files to be dumped before you upload them to SambaStudio Datasets.

Activate the generative_data_prep_env

deactivate
source ../../generative_data_prep_env/bin/activate

Then run the script

sh scripts/preprocess.sh

Launching pretraining/finetuning and hosting endpoints on SambaStudio

Then is needed to create and host your model checkpoints which needs to be done on SambaStudio. This can be done on the SambaStudio GUI following the next steps

First upload your genrated Dataset from gen_data_prep step
Create a project
Run a trainin job
Create an endpoint for your trained model

Add the enpoind details to the .env file, now your .env file should look like this:

BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"

YODA_BASE_URL="https://api-stage.sambanova.net"
YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"

#finetuned model endpoint details
FINETUNED_ENDPOINT_ID="your endpoint ID"
FINETUNED_API_KEY="your endpoint API key"

SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"

This training process can also be done as well as with snapapi and snapsdk. If you are interested in how this done via SNSDK, please have a look at the WIP notebook using the yoda env

Evaluation

For our evaluation, we pose the finetuned model questions from the held-out synthetic question-answer pairs we procured when we were generating the finetuning data. We benchmark the approach against responses we get from also using RAG as well as from a golden context.

Reactivate Activate the YoDa env

deactivate 
source yoda_env/bin/activate

To assess the trained model, execute the following script:

python src/evaluate.py 
    --config sn_expert_conf.yaml

Please replace --config paramether with your actual config file path.

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

scikit-learn (version 1.4.1.post1) jsonlines (version 4.0.0) transformers (version4.33) wordcloud (version 1.9.3) sacrebleu (version 2.4.0) datasets (version 2.18.0) sqlitedict (version 2.1.0) accelerate (version 0.27.2) omegaconf (version 2.3.0) evaluate (version 0.4.1) pycountry (version 23.12.11) rouge_score (version 0.1.2) parallelformers (version 1.2.7) peft (version 0.9.0) plotly (version 5.18.0) langchain (version 0.1.2) pydantic (version1.10.13) python-dotenv (version 1.0.0) sseclient (version 0.0.27)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

YoDa

Overview

About this kit

Workflow

Data generation

Data preparation

Trainign / Finetuning

Evaluation

Getting Started

Deploy your models in SambaStudio

Get your SambaStudio API key

Set the starter kit environment

Starterkit: Usage

Data Generation

To Generate pretraining data

To generate finetuning data

Both pretraining and fine tuning data generation

Data Preprocessing

Launching pretraining/finetuning and hosting endpoints on SambaStudio

Evaluation

Third-party tools and data sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

YoDa

Overview

About this kit

Workflow

Data generation

Data preparation

Trainign / Finetuning

Evaluation

Getting Started

Deploy your models in SambaStudio

Get your SambaStudio API key

Set the starter kit environment

Starterkit: Usage

Data Generation

To Generate pretraining data

To generate finetuning data

Both pretraining and fine tuning data generation

Data Preprocessing

Launching pretraining/finetuning and hosting endpoints on SambaStudio

Evaluation

Third-party tools and data sources