yoda tutorial with all scripts

Nymbo · Feb 28, 2024 · 7233d08 · 7233d08
1 parent 170d6a9
commit 7233d08
Show file tree

Hide file tree

Showing 16 changed files with 1,290 additions and 0 deletions.
diff --git a/yoda/README.md b/yoda/README.md
@@ -0,0 +1,103 @@
+# YoDa
+YoDa is an acronym for "Your Data, Your Model". This project aims to train a Language Model (LLM) using customer's private data. The goal is to compete with general solutions on tasks that are related to the customer's data.
+
+<p align="center">
+  <img src="YoDa.png" alt="YoDa" width="300">
+</p>
+
+## Getting Started
+
+These instructions will guide you on how to generate training data, preprocess it, train the model, evaluate it and finally launch the online service.
+
+1. Update API information for the SambaNova LLM in your `.env` file.
+
+In the example below 
+```
+BASE_URL="https://sjc3-demo2.sambanova.net"
+PROJECT_ID="60774d44-3cc3-47eb-aa91-87fae2e8655e"
+ENDPOINT_ID="b0e414eb-4863-4a8c-9839-3c2dfa718ae5"
+API_KEY=""
+
+FINETUNED_BASE_URL="https://sjc1-demo1.sambanova.net"
+FINETUNED_PROJECT_ID=""
+FINETUNED_ENDPOINT_ID=""
+FINETUNED_API_KEY=""
+DEMO1_API_KEY=""
+``` 
+
+
+2. Activate Python virtual environment.
+```
+conda create -n yoda python=3.10
+conda activate yoda
+pip install -r requirements.txt
+```
+
+3. Download the dataset from [here](https://drive.google.com/drive/folders/10chGQIgJJgBNvIdj8RL2sVwh8txnNkpO) and update
+the `src_folder` variable in your config with this path.
+
+
+#### For Domain adaptive pre-training and Instruction Finetune
+
+Note: You will need a SambaStudio endpoint to the LLAMA 70B Chat model and add the configurations to your env file, which is used for synthetic data generation.
+Please replace /path/to/config with your actual paths. An example config is shown in `configs/sn_expert_conf.yaml`
+and this iss set as the default parameter for the data generation scripts below.
+
+#### To Generate pretraining data
+```
+python -m src.gen_data
+    --config /path/to/config
+    --purpose pretrain 
+```
+
+#### To generate finetuning data
+```
+python -m src.gen_data
+    --config /path/to/config
+    --purpose finetune 
+```
+
+#### Or to do both in one go
+```
+python -m src.gen_data
+    --config /path/to/config
+    --purpose both 
+```
+
+### Preprocessing
+In order to pretrain and finetune on SambaStudio,
+we fist need the data to be in the format of hdf5 files that we can upload
+To preprocess the data, open `scripts/preprocess.sh` and replace
+the variables `ROOT_GEN_DATA_PREP_DIR` with the path to your [generative data preparation](https://github.com/sambanova/generative_data_prep)
+directory, your output json from pretraining/finetuning with`INPUT_FILE`; and 
+an `OUTPUT_DIR` where you want your hdf5 files to be dumped before you upload them to 
+SambaStudio Datasets.:
+
+```
+sh scripts/preprocess.sh
+```
+
+### Launching pretraining/finetuning and hosting endpoints on SambaStudio
+
+In our tutorial, we are creating and hosting checkpoints which needs to be done on SambaStudio. 
+This can be done on the **SambaStudio GUI** as well as with **snapapi** and **snapsdk**. For those
+interested in how this looks like with **snapsdk**, please have a look at the WIP notebook `SambaStudio_job_spinup.ipynb`
+
+### Evaluation
+
+For our evaluation, we pose the finetuned model questions from the held-out synthetic question-answer pairs we procured
+when we were generating the finetuning data. We benchmark the approach against responses we get from also using RAG (not to dissimilar to the approach in chipnemo) as well as from 
+a golden context.\
+\
+To assess the trained model, execute the following script:
+```
+python -m src.evaluate 
+    --config /path/to/config.yaml 
+```
+Please replace  `/path/to/config.yaml`  with your actual paths.
+
+
+- add baseline for normal llama 7b model
+- light refactoring - tokenizer - can I safely delete helper.py
+- dont push datasets yet - upload to gdrive
+- top_k lower and chunking within article
diff --git a/yoda/SambaStudio_job_spinup.ipynb b/yoda/SambaStudio_job_spinup.ipynb
@@ -0,0 +1,295 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4854495d-93fe-4ce2-b6aa-d92d7ce2a1e0",
+   "metadata": {},
+   "source": [
+    "### The aim of this notebook is to create our SambaStudio jobs and endpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "42b7be23-13cb-4b5b-a3ed-a698647bd590",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "from pprint import pprint\n",
+    "\n",
+    "from dotenv import load_dotenv\n",
+    "load_dotenv('.env')\n",
+    "\n",
+    "import json\n",
+    "from snsdk import SnSdk"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f6582fd7-5baa-475e-8996-5af2f6e1382f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pwd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad503230-b644-45fd-a732-ec0ca43fbfdf",
+   "metadata": {},
+   "source": [
+    "For our tutorial we are going to be interacting with SambaStudio at a range of points:\n",
+    "- source the LLAMA 70B Chat endpoint already hosted on our environment to run inference\n",
+    "- Upload our target dataset to SambaStudio env1\n",
+    "- Create a project and a job for domain-adaptive pretraining with our target dataset\n",
+    "- Finetune the latest checkpoint of the previous job\n",
+    "- Host the finetuned model at an endpoint\n",
+    "\n",
+    "The first of these points is better handled through our `SambaNovaEndpoint` helper function and the others can be done directly on\\\n",
+    "the SambaStudio GUI or through **snapapi** and **snsdk**.\n",
+    "\n",
+    "We will walk you through how to use **snsdk** for our key functions.\n",
+    "\n",
+    "To begin with, your `.env` file will have some missing environment variables. Namely, `FINETUNED_PROJECT_ID`, `FINETUNED_ENDPOINT_ID`, and `FINETUNED_API_KEY` which we will create as we go through the tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bc11dc51-30ad-4a54-a715-6e592b6349c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat .env"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "427cfde1-750e-480e-b621-aefd01e2b095",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from snsdk import SnSdk\n",
+    "\n",
+    "sdk = SnSdk(host_url=os.getenv('FINETUNED_BASE_URL'),\n",
+    "            access_key=os.getenv('DEMO1_API_KEY'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "055b3b18-cbdc-4c59-bdd2-2793a89a9c0f",
+   "metadata": {},
+   "source": [
+    "If you haven't received an error at this point, it means that you're connected. Well done!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24b879aa-44f2-4a3e-a3b3-ecbdf049370f",
+   "metadata": {},
+   "source": [
+    "### Create a project"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "42185303-c8b7-4e94-82fa-c0e25a95be57",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = sdk.create_project(project_name = 'yoda_tutorial2', description = \"A tutorial on using the YODA recipe\")\n",
+    "response"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "329c0541-f06e-4816-b8d7-fe49de81477f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "project_id = response['data']['project_id']\n",
+    "project_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa0b8072-d881-4009-bd56-9da0fd80d98b",
+   "metadata": {},
+   "source": [
+    "You can fill in `FINETUNED_PROJECT_ID` in your environment variable with this project id."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06fc04ae-f9b2-42c2-a163-303e68c3d666",
+   "metadata": {},
+   "source": [
+    "## Upload our dataset [later]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0880b0f6-62be-4955-b204-306f7b566d3b",
+   "metadata": {},
+   "source": [
+    "## DAPT/Finetune the llama7b model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e8afbfe8-d22f-4505-93aa-1386ad9cb977",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We can check the datasets we have available - we're looking for yoda_qamixed_7btokenized\n",
+    "sdk.list_datasets()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b21da73-e2b9-4f99-a58f-4bbe8c0f5463",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset_id = sdk.search_dataset('yoda_qamixed_7btokenized')['data']['dataset_id']\n",
+    "dataset_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e80b79ed-4a28-45cc-8b14-c43b7e53d9b1",
+   "metadata": {},
+   "source": [
+    "We've got our dataset ID which we'll need to reference for finetuning. We also need the model_id for the llama7b model...."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "deea56bc-c300-459e-bd52-7e4913380907",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_id = sdk.search_model('Llama-2-7b-chat-hf')['data']['model_id']\n",
+    "model_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5181b59c-2306-45df-a1bb-8b43a124bb5a",
+   "metadata": {},
+   "source": [
+    "We now have everything to create the training job. TODO: get more infor on the hparams dict"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32867cc5-2e73-40da-af98-b7c273bb62a7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "response = sdk.create_job(\n",
+    "    job_type=\"train\",\n",
+    "    project= project_id,\n",
+    "    model_checkpoint= model_id,\n",
+    "    job_name= \"firstjob\",\n",
+    "    description= \"empty description\",\n",
+    "    dataset= dataset_id,\n",
+    "    hyperparams= \"\",\n",
+    "    load_state= True, \n",
+    "    sub_path= \"\",\n",
+    "    parallel_instances= 1,\n",
+    "    )\n",
+    "response\n",
+    "\"\"\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c53551e2-1fd1-4369-a4b8-aa14c27f9afd",
+   "metadata": {},
+   "source": [
+    "To get the job_id browse through the list of jobs in your project"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "35e261d4-960f-4018-b4f5-f704fb365175",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = sdk.job_info(project=project_id,job=job_id)\n",
+    "response"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d78401ce-dda0-456d-977e-d321422f4b7d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = sdk.job_info(project=project_id,job=job_id)\n",
+    "job_status = response['data']['status']\n",
+    "job_status"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "22d73b33-b6a7-461c-a31b-befa8109e4e9",
+   "metadata": {},
+   "source": [
+    "The job status will print out **'TRAINING'** while it's training and when it is completed it will dosplay **'EXIT_WITH_0'**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a0cea46-3f4c-4d3e-8828-49f3b103d0be",
+   "metadata": {},
+   "source": [
+    "## HOST THE LATEST CHECKPOINT AS AN ENDPOINT [LATER]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3aad072d-80cd-4bd2-8ee6-03b716a70e27",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py310",
+   "language": "python",
+   "name": "py310"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/yoda/YoDa.png b/yoda/YoDa.png