Skip to content

Commit

Permalink
Readme updated
Browse files Browse the repository at this point in the history
  • Loading branch information
snova-jorgep committed Jul 9, 2024
1 parent 4d496ed commit b79f327
Showing 1 changed file with 81 additions and 9 deletions.
90 changes: 81 additions & 9 deletions fine_tuning_sql/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,76 @@
# AISK: SQL Model using SambaStudio

<a href="https://sambanova.ai/">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="../images/SambaNova-light-logo-1.png" height="60">
<img alt="SambaNova logo" src="../images/SambaNova-dark-logo-1.png" height="60">
</picture>
</a>

Fine-tuning SQL
======================

Questions? Just <a href="https://discord.gg/XF5Sf2sa" target="_blank">message us</a> on Discord <a href="https://discord.gg/XF5Sf2sa" target="_blank"><img src="https://github.com/sambanova/ai-starter-kit/assets/150964187/aef53b52-1dc0-4cbf-a3be-55048675f583" alt="Discord" width="22"/></a> or <a href="https://github.com/sambanova/ai-starter-kit/issues/new/choose" target="_blank">create an issue</a> in GitHub. We're happy to help live!

<!-- TOC -->

- [Fine-tuning SQL](#fine-tuning-sql)
- [Overview](#overview)
- [Before you begin](#before-you-begin)
- [Clone this repository](#clone-this-repository)
- [Install dependencies](#install-dependencies)
- [Use the starter kit](#use-the-starter-kit)
- [Data download](#data-download)
- [Data preparation](#data-preparation)
- [Pretraining data prep](#pretraining-data-prep)
- [Fine-tuning data prep](#fine-tuning-data-prep)
- [Basic QA-QC](#basic-qa-qc)
- [Load the dataset on Sambastudio](#load-the-dataset-on-sambastudio)
- [Training](#training)
- [Pre-training](#pre-training)
- [Fine-tuning](#fine-tuning)
- [Hosting](#hosting)
- [Inference](#inference)
- [Benchmarking](#benchmarking)

<!-- /TOC -->

# Overview

This Starter Kit is an example of LLM fine-tuning process leveraging SambaStudio platform, this workflow shows how to fine-tune an SQL model for Question-Answering purpose, enhancing SQL generation tasks performance. The Kit includes:

- A Jupyter Notebook for downloading pre-training and fine-tuning SQL datasets
- A detailed guide for generating the training files
- A Notebook for quality control and evaluation of the generated training files
- A guide on uploading datasets and fine-tuning a model of choice using the SambaStudio graphical user interface
- A Notebook for performing inference with the trained model

# Before you begin

You have to set up your environment before you can run the starter kit.

## Clone this repository

Clone the starter kit repo.

```bash
git clone --recurse-submodules https://github.com/sambanova/ai-starter-kit.git
```

## Install dependencies

We recommend that you run the starter kit in a virtual environment

```bash
cd ai_starter_kit/
git submodule update --init.
cd ai_starter_kit/fine_tuning_sql
cd fine_tuning_sql
python3 -m venv fine_tuning_sql_env
source fine_tuning_sql_env/enterprise_knowledge_env/bin/activate
pip install -r requirements.txt
```

# Use the starter kit

## Data download
Follow the notebook [1_download_data.ipynb](notebooks/1_download_data.ipynb) to download and store pre-training and fine-tuning datasets.

Expand Down Expand Up @@ -50,26 +109,26 @@ python -m generative_data_prep pipeline \
--keep_split_jsonls
```

### Basic QA-QC
## Basic QA-QC

One can do basic QA-QC by loading the HDF5 and jsonl files as shown in the notebook [2_qa_data.ipynb](notebooks/2_qa_data.ipynb).

## Load the dataset on Sambastudio

Once the data preparation is done the datasets can be uploaded directly to SambaStudio.
Please refer to the SambaStudio documentation for uploading datasets ##TODO: LINK
Please refer to the [SambaStudio documentation for uploading datasets](https://docs.sambanova.ai/sambastudio/latest/add-datasets.html)

## SambaStudio
## Training

### Pre-training
Once the datasets are uploaded, one can pre-train and fine-tune base-models on these datasets.
We use **Llama7B Base** as the starting model for further training. Below is a snapshot showing the hyperparameters
We use **Llama7B Base** as the starting model for further training. Below is a snapshot showing the hyperparameters
for the training job.

![](images/Pretraining_SN.png)

### Fine-tuning
We start with a continuously pretrained model to do further fine-tuning. Below is a snapshot showing the hyperparameters
We start with a continuously pretrained model to do further fine-tuning. Below is a snapshot showing the hyperparameters
for the fine-tuning job.

![](images/Fine_tuning_SN.png)
Expand All @@ -78,7 +137,7 @@ The training loss curve for the fine-tuning job is shown below.

![](images/Fine_tuning_loss_SN.png)

### Hosting
## Hosting

The final fine-tuned model can then be hosted on SambaStudio. Once hosted, the API information, including environmental variables such as BASE_URL, PROJECT_ID, ENDPOINT_ID, and API_KEY, can be utilized to execute inference.

Expand All @@ -88,4 +147,17 @@ The notebook [3_inference__model.ipynb](notebooks/3_inference__model.ipynb) uses

## Benchmarking

How to benchmark the fine-tuned SQL model: WIP
The [Archerfix repository](https://github.com/archerfish-bench/benchmark) can be used to benchmark your fine-tuned SQL model

# Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

langchain (version 0.2.6)
langchain-community (version 0.2.6)
transformers (version 4.41.2)
datasets (version 2.20.0)
jupyter_client (version 8.6.0)
jupyter_core (version 5.7.1)
jupyterlab-widgets (version 3.0.9)
SQLAlchemy (version 2.0.30)

0 comments on commit b79f327

Please sign in to comment.