Skip to content

Commit

Permalink
yoda tutorial with all scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
snova-imranr committed Feb 28, 2024
1 parent 170d6a9 commit 7233d08
Show file tree
Hide file tree
Showing 16 changed files with 1,290 additions and 0 deletions.
103 changes: 103 additions & 0 deletions yoda/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# YoDa
YoDa is an acronym for "Your Data, Your Model". This project aims to train a Language Model (LLM) using customer's private data. The goal is to compete with general solutions on tasks that are related to the customer's data.

<p align="center">
<img src="YoDa.png" alt="YoDa" width="300">
</p>

## Getting Started

These instructions will guide you on how to generate training data, preprocess it, train the model, evaluate it and finally launch the online service.

1. Update API information for the SambaNova LLM in your `.env` file.

In the example below
```
BASE_URL="https://sjc3-demo2.sambanova.net"
PROJECT_ID="60774d44-3cc3-47eb-aa91-87fae2e8655e"
ENDPOINT_ID="b0e414eb-4863-4a8c-9839-3c2dfa718ae5"
API_KEY=""
FINETUNED_BASE_URL="https://sjc1-demo1.sambanova.net"
FINETUNED_PROJECT_ID=""
FINETUNED_ENDPOINT_ID=""
FINETUNED_API_KEY=""
DEMO1_API_KEY=""
```


2. Activate Python virtual environment.
```
conda create -n yoda python=3.10
conda activate yoda
pip install -r requirements.txt
```

3. Download the dataset from [here](https://drive.google.com/drive/folders/10chGQIgJJgBNvIdj8RL2sVwh8txnNkpO) and update
the `src_folder` variable in your config with this path.


#### For Domain adaptive pre-training and Instruction Finetune

Note: You will need a SambaStudio endpoint to the LLAMA 70B Chat model and add the configurations to your env file, which is used for synthetic data generation.
Please replace /path/to/config with your actual paths. An example config is shown in `configs/sn_expert_conf.yaml`
and this iss set as the default parameter for the data generation scripts below.

#### To Generate pretraining data
```
python -m src.gen_data
--config /path/to/config
--purpose pretrain
```

#### To generate finetuning data
```
python -m src.gen_data
--config /path/to/config
--purpose finetune
```

#### Or to do both in one go
```
python -m src.gen_data
--config /path/to/config
--purpose both
```

### Preprocessing
In order to pretrain and finetune on SambaStudio,
we fist need the data to be in the format of hdf5 files that we can upload
To preprocess the data, open `scripts/preprocess.sh` and replace
the variables `ROOT_GEN_DATA_PREP_DIR` with the path to your [generative data preparation](https://github.com/sambanova/generative_data_prep)
directory, your output json from pretraining/finetuning with`INPUT_FILE`; and
an `OUTPUT_DIR` where you want your hdf5 files to be dumped before you upload them to
SambaStudio Datasets.:

```
sh scripts/preprocess.sh
```

### Launching pretraining/finetuning and hosting endpoints on SambaStudio

In our tutorial, we are creating and hosting checkpoints which needs to be done on SambaStudio.
This can be done on the **SambaStudio GUI** as well as with **snapapi** and **snapsdk**. For those
interested in how this looks like with **snapsdk**, please have a look at the WIP notebook `SambaStudio_job_spinup.ipynb`

### Evaluation

For our evaluation, we pose the finetuned model questions from the held-out synthetic question-answer pairs we procured
when we were generating the finetuning data. We benchmark the approach against responses we get from also using RAG (not to dissimilar to the approach in chipnemo) as well as from
a golden context.\
\
To assess the trained model, execute the following script:
```
python -m src.evaluate
--config /path/to/config.yaml
```
Please replace `/path/to/config.yaml` with your actual paths.


- add baseline for normal llama 7b model
- light refactoring - tokenizer - can I safely delete helper.py
- dont push datasets yet - upload to gdrive
- top_k lower and chunking within article
295 changes: 295 additions & 0 deletions yoda/SambaStudio_job_spinup.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4854495d-93fe-4ce2-b6aa-d92d7ce2a1e0",
"metadata": {},
"source": [
"### The aim of this notebook is to create our SambaStudio jobs and endpoints"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42b7be23-13cb-4b5b-a3ed-a698647bd590",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"from pprint import pprint\n",
"\n",
"from dotenv import load_dotenv\n",
"load_dotenv('.env')\n",
"\n",
"import json\n",
"from snsdk import SnSdk"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6582fd7-5baa-475e-8996-5af2f6e1382f",
"metadata": {},
"outputs": [],
"source": [
"!pwd"
]
},
{
"cell_type": "markdown",
"id": "ad503230-b644-45fd-a732-ec0ca43fbfdf",
"metadata": {},
"source": [
"For our tutorial we are going to be interacting with SambaStudio at a range of points:\n",
"- source the LLAMA 70B Chat endpoint already hosted on our environment to run inference\n",
"- Upload our target dataset to SambaStudio env1\n",
"- Create a project and a job for domain-adaptive pretraining with our target dataset\n",
"- Finetune the latest checkpoint of the previous job\n",
"- Host the finetuned model at an endpoint\n",
"\n",
"The first of these points is better handled through our `SambaNovaEndpoint` helper function and the others can be done directly on\\\n",
"the SambaStudio GUI or through **snapapi** and **snsdk**.\n",
"\n",
"We will walk you through how to use **snsdk** for our key functions.\n",
"\n",
"To begin with, your `.env` file will have some missing environment variables. Namely, `FINETUNED_PROJECT_ID`, `FINETUNED_ENDPOINT_ID`, and `FINETUNED_API_KEY` which we will create as we go through the tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc11dc51-30ad-4a54-a715-6e592b6349c8",
"metadata": {},
"outputs": [],
"source": [
"!cat .env"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "427cfde1-750e-480e-b621-aefd01e2b095",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from snsdk import SnSdk\n",
"\n",
"sdk = SnSdk(host_url=os.getenv('FINETUNED_BASE_URL'),\n",
" access_key=os.getenv('DEMO1_API_KEY'))"
]
},
{
"cell_type": "markdown",
"id": "055b3b18-cbdc-4c59-bdd2-2793a89a9c0f",
"metadata": {},
"source": [
"If you haven't received an error at this point, it means that you're connected. Well done!"
]
},
{
"cell_type": "markdown",
"id": "24b879aa-44f2-4a3e-a3b3-ecbdf049370f",
"metadata": {},
"source": [
"### Create a project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42185303-c8b7-4e94-82fa-c0e25a95be57",
"metadata": {},
"outputs": [],
"source": [
"response = sdk.create_project(project_name = 'yoda_tutorial2', description = \"A tutorial on using the YODA recipe\")\n",
"response"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "329c0541-f06e-4816-b8d7-fe49de81477f",
"metadata": {},
"outputs": [],
"source": [
"project_id = response['data']['project_id']\n",
"project_id"
]
},
{
"cell_type": "markdown",
"id": "aa0b8072-d881-4009-bd56-9da0fd80d98b",
"metadata": {},
"source": [
"You can fill in `FINETUNED_PROJECT_ID` in your environment variable with this project id."
]
},
{
"cell_type": "markdown",
"id": "06fc04ae-f9b2-42c2-a163-303e68c3d666",
"metadata": {},
"source": [
"## Upload our dataset [later]"
]
},
{
"cell_type": "markdown",
"id": "0880b0f6-62be-4955-b204-306f7b566d3b",
"metadata": {},
"source": [
"## DAPT/Finetune the llama7b model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8afbfe8-d22f-4505-93aa-1386ad9cb977",
"metadata": {},
"outputs": [],
"source": [
"# We can check the datasets we have available - we're looking for yoda_qamixed_7btokenized\n",
"sdk.list_datasets()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b21da73-e2b9-4f99-a58f-4bbe8c0f5463",
"metadata": {},
"outputs": [],
"source": [
"dataset_id = sdk.search_dataset('yoda_qamixed_7btokenized')['data']['dataset_id']\n",
"dataset_id"
]
},
{
"cell_type": "markdown",
"id": "e80b79ed-4a28-45cc-8b14-c43b7e53d9b1",
"metadata": {},
"source": [
"We've got our dataset ID which we'll need to reference for finetuning. We also need the model_id for the llama7b model...."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "deea56bc-c300-459e-bd52-7e4913380907",
"metadata": {},
"outputs": [],
"source": [
"model_id = sdk.search_model('Llama-2-7b-chat-hf')['data']['model_id']\n",
"model_id"
]
},
{
"cell_type": "markdown",
"id": "5181b59c-2306-45df-a1bb-8b43a124bb5a",
"metadata": {},
"source": [
"We now have everything to create the training job. TODO: get more infor on the hparams dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32867cc5-2e73-40da-af98-b7c273bb62a7",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"response = sdk.create_job(\n",
" job_type=\"train\",\n",
" project= project_id,\n",
" model_checkpoint= model_id,\n",
" job_name= \"firstjob\",\n",
" description= \"empty description\",\n",
" dataset= dataset_id,\n",
" hyperparams= \"\",\n",
" load_state= True, \n",
" sub_path= \"\",\n",
" parallel_instances= 1,\n",
" )\n",
"response\n",
"\"\"\"\n"
]
},
{
"cell_type": "markdown",
"id": "c53551e2-1fd1-4369-a4b8-aa14c27f9afd",
"metadata": {},
"source": [
"To get the job_id browse through the list of jobs in your project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35e261d4-960f-4018-b4f5-f704fb365175",
"metadata": {},
"outputs": [],
"source": [
"response = sdk.job_info(project=project_id,job=job_id)\n",
"response"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d78401ce-dda0-456d-977e-d321422f4b7d",
"metadata": {},
"outputs": [],
"source": [
"response = sdk.job_info(project=project_id,job=job_id)\n",
"job_status = response['data']['status']\n",
"job_status"
]
},
{
"cell_type": "markdown",
"id": "22d73b33-b6a7-461c-a31b-befa8109e4e9",
"metadata": {},
"source": [
"The job status will print out **'TRAINING'** while it's training and when it is completed it will dosplay **'EXIT_WITH_0'**"
]
},
{
"cell_type": "markdown",
"id": "8a0cea46-3f4c-4d3e-8828-49f3b103d0be",
"metadata": {},
"source": [
"## HOST THE LATEST CHECKPOINT AS AN ENDPOINT [LATER]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3aad072d-80cd-4bd2-8ee6-03b716a70e27",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "py310",
"language": "python",
"name": "py310"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Binary file added yoda/YoDa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 7233d08

Please sign in to comment.