Skip to content

Latest commit

 

History

History
254 lines (197 loc) · 12.4 KB

README.md

File metadata and controls

254 lines (197 loc) · 12.4 KB
SambaNova logo

YoDa

Overview

YoDa is an acronym for Your Data, Your Model. This starter kit aims to train a Language Model (LLM) using private data. The goal is to compete with general solutions on tasks that are related to the private data.

Workflow overview

When you work with YoDa, you'll go through several phases until you arrive at a trained and tested model.

  1. Data generation. Generation of synthetic data relevant to your domain. Two main data generation methods, which vary depending on the task requirements, can be used:
    • Pretraining Generation: Generate a JSONL file containing sections of the provided data. Enables the model to do completion over queries.
    • Finetuning Generation: Process each document to create a series of synthetic questions and answers based on the content. This method uses a powerful LLM (Llama 2 70B) and a pipeline composed of prompting and postprocessing techniques. The generated data is stored in JSONL files. This method teaches the model to follow instructions and answer questions.
  2. Data preparation. Preprocessing and formatting the generated data to make it suitable for training. This step transforms the data into the required format and structure necessary for training the large language model.
  3. Training / Finetuning. In this stage, you fine tune the model in SambaStudio using your data. Finetuning includes updating the model's parameters to adapt it to the specific characteristics and patterns present in the prepared dataset. Note that this starter kit does not support Sambaverse as the model needs to be finetuned.
  4. Evaluation. The evaluation phase creates a set of responses to assess the performance of the finetuned language model. It involves using the set of evaluation queries for:
    • Obtaining responses from a baseline model.
    • Obtaining responses from your custom model.
    • Obtaining responses from your custom model giving them in the exact context used in question generation of the evaluation queries.
    • Obtaining responses from your custom model employing a simple RAG pipeline for response generation. Evaluation facilitates further analysis of your model's effectiveness in solving the domain specific tasks.

Getting Started

These instructions will guide you on how to generate training data, preprocess it, train the model, launch the online inference service, and evaluate it.

Deploy a SambaStudio inference endpoint

SambaStudio includes a rich set of open source models that have been customized to run efficiently on RDU. Deploy the LLM of choice (e.g. Llama 2 13B chat, etc) to an endpoint for inference in SambaStudio either through the GUI or CLI. See the SambaStudio endpoint documentation.

Get your SambaStudio API key

Optional In this Starter kit you can use the SambaNova SDK SKSDK to run training inference jobs in SambaStudio, you will only need to set your environment API Authorization Key (The Authorization Key will be used to access to the API Resources on SambaStudio), the steps for getting this key is described here

Set the starter kit environment

  1. Clone the repo.
    git clone https://github.com/sambanova/ai-starter-kit.git
  2. Update the LLM API information for SambaStudio. (Step 1) Update the environment variables file in the root repo directory sn-ai-starter-kit/.env to point to the SambaStudio endpoint. For example, for an endpoint with the URL "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef012 update the env file (with no spaces) as:
    BASE_URL="https://api-stage.sambanova.net"
    PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
    API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
    
    YODA_BASE_URL="https://api-stage.sambanova.net"
    YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
    BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
    BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"
    
    SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
    

(Step 2) In the config file file, set the variable api to "sambastudio"
3. (Optional) Set up a virtual environment. We recommend that you use virtualenv or a conda environment for installation and run pip update. bash cd ai-starter-kit/yoda python3 -m venv yoda_env source/yoda_env/bin/activate pip install -r requirements.txt 4. Download your dataset and update the path to the data source folder in src_folder and the list of subfolders in the src_subfolders variable in your sn expert config file . The dataset structure consists of the src_folder (str) which contains one or more subfolders that represent a different file. Each subfolder should contain at least one txt file containing the content of that file. The txt files will be used as context retrievals for RAG. We have added an illustration of the data structure in the data folder which acts as our src_folder and ['sambanova_resources_blogs','sambastudio'] which are our src_subfolders.

  1. (Optional) Download and install SambaNova SNSDK. Follow the instructions in this guide for installing Sambanova SNSDK and SNAPI, (you can skip the Create a virtual environment step since you are using the yoda_env environment you just created).

  2. Clone the SambaNova data preparation repository

     deactivate
     cd ../..
     git clone https://github.com/sambanova/generative_data_prep
     cd generative_data_prep
     python3 -m venv generative_data_prep_env
     source/generative_data_prep_env/bin/activate
  3. Install the data prep tools following the installation instructions.

Starterkit Usage

Data preparation

Prerequisites for data generation:

  1. Follow the steps above to set up a SambaStudio endpoint to the LLAMA 70B Chat model and add to update the env file.
  2. Request access to the Meta Llama2 tokenizer or download a copy, then put the path of the tokenizer or name of the Hugging Face model in the config file.
  3. Replace the value of --config param with your actual config file path. An example config is shown in ./sn_expert_conf.yaml and this is set as the default parameter for the data generation scripts below.
  4. In your config file, set the dest_folder, tokenizer and n_eval_samples parameters.
  5. Activate your YoDa starter kit environment
deactivate
cd ../..
cd ai-starter-kit/yoda
source/yoda_env/bin/activate

Generate pretraining data

To generate pretraining data, run this script:

python src/gen_data.py  --config ./sn_expert_conf.yaml --purpose pretrain 

Generate finetuning data

To generate finetuning data, run this script:

python src/gen_data.py --config ./sn_expert_conf.yaml --purpose finetune 

Generate both pretraining and finetuning data

Run this script:

python src.gen_data --config ./sn_expert_conf.yaml --purpose both 

Preprocess the data

To pretrain and finetune on SambaStudio, the data must be hdf5 files that you can upload to SambaStudio as dataset.

To preprocess the data:

  1. open scripts/preprocess.sh
  2. Replace the variables ROOT_GEN_DATA_PREP_DIR with the path to your generative data preparation directory. Also note that PATH_TO_TOKENIZER is the path to either a downloaded tokenizer or the huggingface name of the model. For example, meta-llama/Llama-2-7b-chat-hf.

Note: if you want only to pre-train the JSON to use as input is article_data.jsonl, if you used finetune as --purpose ,the JSON to use as input is synthetic_qa_train.jsonl if you want to do both in the same training job ,the JSON to use as input is qa_article_mix.jsonl

  1. In scripts/preprocess.sh, set the INPUT_FILE parameter to the absolute path of the output JSONL from pretraining/finetuning and set OUTPUT_DIR to the location where you want your hdf5 files to be dumped before you upload them to SambaStudio Datasets.
  2. Activate generative_data_prep_env:
deactivate
source ../../generative_data_prep_env/bin/activate
  1. Then run the script to preprocess the data.
sh scripts/preprocess.sh

Perform pretraining/finetuning and host endpoints on SambaStudio

In SambaStudio, you need to create and host your model checkpoints. Connect to the SambaStudio GUI and follow these steps:

  1. Upload your generated dataset from gen_data_prep step.

  2. Create a project.

  3. Run a training job .

  4. Create an endpoint for your trained model.

5 Add the endpoint details to the .env file. Now your .env file should look like this: ```yaml

BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"

YODA_BASE_URL="https://api-stage.sambanova.net"
YODA_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
BASELINE_ENDPOINT_ID="987654ef-fedc-9876-1234-01fedbac9876"
BASELINE_API_KEY="12fedcba-9876-1234-abcd76543"

#finetuned model endpoint details
FINETUNED_ENDPOINT_ID="your endpoint ID"
FINETUNED_API_KEY="your endpoint API key"

SAMBASTUDIO_KEY="1234567890abcdef987654321fedcba0123456789abcdef"
```

Evaluation

For evaluation, you can ask the finetuned model questions from the synthetic question-answer pairs that you procured when you were generating the finetuning data. You benchmark the approach against responses we get from also using RAG as well as from a golden context.

Reactivate the YoDa environment:

deactivate 
source yoda_env/bin/activate

To assess the trained model, run the following script, passing in your config file:

python src/evaluate.py 
    --config <sn_expert_conf.yaml>

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

  • scikit-learn (version 1.4.1.post1)
  • jsonlines (version 4.0.0)
  • transformers (version4.33)
  • wordcloud (version 1.9.3)
  • sacrebleu (version 2.4.0)
  • datasets (version 2.18.0)
  • sqlitedict (version 2.1.0)
  • accelerate (version 0.27.2)
  • omegaconf (version 2.3.0)
  • evaluate (version 0.4.1)
  • pycountry (version 23.12.11)
  • rouge_score (version 0.1.2)
  • parallelformers (version 1.2.7)
  • peft (version 0.9.0)
  • plotly (version 5.18.0)
  • langchain (version 0.1.2)
  • pydantic (version1.10.13)
  • python-dotenv (version 1.0.0)
  • sseclient (version 0.0.27)