Skip to content

Latest commit

 

History

History
296 lines (216 loc) · 12.5 KB

README.md

File metadata and controls

296 lines (216 loc) · 12.5 KB
SambaNova logo

YoDa

YoDa

Overview

About this kit

YoDa is an acronym for Your Data, Your Model. This project aims to train a Language Model (LLM) using customer's private data. The goal is to compete with general solutions on tasks that are related to the customer's data.

Workflow

Data generation

This phase involves the generation of synthetic data relevant to the customer's domain. Two main data generation methods are employed, which may vary depending on the task requirements:

Pretraining Generation: This method generates a JSONL file containing sections of the provided data. It will enable the model to do completion over queries.

Finetuning Generation: Utilizing a powerful LLM Llama 2 70B and a pipeline composed of prompting and postprocessing techniques, this step processes each document to create a series of synthetic questions and answers based on the content. The generated data is stored in JSONL files, this will teach the model to follow instructions and solve questions beyond mere completion.

Data preparation

Data preparation involves preprocessing and formatting the generated data to make it suitable for training. This step transforms the data into the required format and structure necessary for training the large language model.

Trainign / Finetuning

In this stage, the large language model is finetuned in SambaStudio using your data. Finetuning includes updating the model's parameters to adapt it to the specific characteristics and patterns present in the prepared dataset.

Evaluation

The evaluation phase create a set of responses to assesses the performance of the finetuned language model on relevant queries.

It involves using the set of evaluation queries for:

  • Obtaining responses from a baseline model.
  • Obtaining responses from your custom model.
  • Obtaining responses from your custom model giving them in the promop the exact context used in question generation of the evaluation querys.
  • Obtaining responses from your custom model employing a simple RAG pipeline for response generation.

This will facilitate further analysis of your model's effectiveness in solving the domain specific tasks.

Getting Started

These instructions will guide you on how to generate training data, preprocess it, train the model, launch the online inference service, and evaluate it.

Deploy your models in SambaStudio

Begin by deploying an powerfull LLM (e.g. Llama 2 70B chat) to an endpoint for inference in SambaStudio either through the GUI or CLI, as described in the SambaStudio endpoint documentation.

Then deploy your baseline model (e.g. Llama 2 7B) to an endpoint for inference in SambaStudio either through the GUI or CLI

Get your SambaStudio API key

Optional In this Starter kit you can use the Sambanova SDK SKSDK to run training inference jobs in SambaStudio, you will only need to set your environment API Authorization Key (The Authorization Key will be used to access to the API Resources on SambaStudio), the steps for getting this key is decribed here

Set the starter kit environment

  1. Clone repo.

    git clone https://github.com/sambanova/ai-starter-kit.git
  2. Update API information for the SambaNova LLM and your environment sambastudio key.

    These are represented as configurable variables in the environment variables file in the root repo directory sn-ai-starter-kit/.env. For example, a Llama70B chat endpoint with the URL

    "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef0123"

    a Lama7B basekline model with the URL

    "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/987654ef-fedc-9876-1234-01fedbac9876"

    and and a samba studio key "1234567890abcdef987654321fedcba0123456789abcdef" would be entered in the environment file (with no spaces) as:

    BASE_URL="https://api-stage.sambanova.net"
    PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
    API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
    
    YODA_BASE_URL="https://api-stage.sambanova.net"
    YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
    BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"
    
    SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
  3. Install requirements. It is recommended to use virtualenv or conda environment for installation, and to update pip.

    cd ai-starter-kit/yoda
    python3 -m venv yoda_env
    source/yoda_env/bin/activate
    pip install -r requirements.txt
  4. Download the exaple dataset from here and update the src_folder variable in your sn expert config file with the path of the folder and sub folders in src_subfolders, for including your own data follow the same step.

  5. Optionaly Download and install Sambanova SNDK. Follow the instructions in this guide for installing Sambanova SNSDK and SNAPI, (you can omit the Create a virtual environment step since you are using the just created yoda_env environment)

  6. Download the Samabnova data preparation repository

     deactivate
     cd ../..
     git clone https://github.com/sambanova/generative_data_prep
     cd generative_data_prep
     python3 -m venv generative_data_prep_env
     source/generative_data_prep_env/bin/activate

    Then follow the instalation guide

Starterkit: Usage

Data Generation

For Domain adaptive pre-training and Instruction Finetune dtaa generation run une of the following scripts

Note: You will need a SambaStudio endpoint to the LLAMA 70B Chat model and add the configurations to your env file, which is used for synthetic data generation.

you shold have requested access to the meta Llama2 tokenizer and have a local copy or Hugging Face model granted access, then put the path of the tokenizer or name of the HF model in the config file

Please replace the value of --config param with your actual config file path. An example config is shown in ./sn_expert_conf.yaml and this is set as the default parameter for the data generation scripts below.

set in your config file the dest_folder, tokenizer and n_eval_samples parameters

Activate your YoDa starter kit environment

deactivate
cd ../..
cd ai-starter-kit/yoda
source/yoda_env/bin/activate

To Generate pretraining data

python -m src/gen_data.py
    --config ./sn_expert_conf.yaml
    --purpose pretrain 

To generate finetuning data

python src/gen_data.py
    --config ./sn_expert_conf.yaml
    --purpose finetune 

Both pretraining and fine tuning data generation

python -m src.gen_data
    --config ./sn_expert_conf.yaml
    --purpose both 

Data Preprocessing

In order to pretrain and finetune on SambaStudio, we fist need the data to be in the format of hdf5 files that we can upload as dataset in SambaStudio To preprocess the data, open scripts/preprocess.sh and replace the variables ROOT_GEN_DATA_PREP_DIR with the path to your generative data preparation directory, set absoluthe path of the output JSONL from pretraining/finetuning In the INPUT_FILE parameter of the scripts/preprocess.sh; and an OUTPUT_DIR` where you want your hdf5 files to be dumped before you upload them to SambaStudio Datasets.

Activate the generative_data_prep_env

deactivate
source ../../generative_data_prep_env/bin/activate

Then run the script

sh scripts/preprocess.sh

Launching pretraining/finetuning and hosting endpoints on SambaStudio

Then is needed to create and host your model checkpoints which needs to be done on SambaStudio. This can be done on the SambaStudio GUI following the next steps

  1. First upload your genrated Dataset from gen_data_prep step

  2. Create a project

  3. Run a trainin job

  4. Create an endpoint for your trained model

  5. Add the enpoind details to the .env file, now your .env file should look like this:

    BASE_URL="https://api-stage.sambanova.net"
    PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
    API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
    
    YODA_BASE_URL="https://api-stage.sambanova.net"
    YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
    BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"
    
    #finetuned model endpoint details
    FINETUNED_ENDPOINT_ID="your endpoint ID"
    FINETUNED_API_KEY="your endpoint API key"
    
    SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"

This training process can also be done as well as with snapapi and snapsdk. If you are interested in how this done via SNSDK, please have a look at the WIP notebook using the yoda env

Evaluation

For our evaluation, we pose the finetuned model questions from the held-out synthetic question-answer pairs we procured when we were generating the finetuning data. We benchmark the approach against responses we get from also using RAG as well as from a golden context.

Reactivate Activate the YoDa env

deactivate 
source yoda_env/bin/activate

To assess the trained model, execute the following script:

python src/evaluate.py 
    --config sn_expert_conf.yaml

Please replace --config paramether with your actual config file path.

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

scikit-learn (version 1.4.1.post1) jsonlines (version 4.0.0) transformers (version4.33) wordcloud (version 1.9.3) sacrebleu (version 2.4.0) datasets (version 2.18.0) sqlitedict (version 2.1.0) accelerate (version 0.27.2) omegaconf (version 2.3.0) evaluate (version 0.4.1) pycountry (version 23.12.11) rouge_score (version 0.1.2) parallelformers (version 1.2.7) peft (version 0.9.0) plotly (version 5.18.0) langchain (version 0.1.2) pydantic (version1.10.13) python-dotenv (version 1.0.0) sseclient (version 0.0.27)