Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
agarwalishika committed Oct 29, 2024
1 parent 4bf9f78 commit 56d7da5
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 48 deletions.
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# DELIFT: Data Efficient Language model Instruction Fine Tuning

This repo contains the code for [DELIFT: Data Efficient Language model Instruction Fine Tuning](https://arxiv.org/abs/LINK). DELIFT is a unified fine-tuning algorithm that optimizes data subset selection across the three stages of fine-tuning:
- **Stage 1**: Instruction Tuning: enhancing a model's ability to follow general instructions
- **Stage 2**: Task-Specific Fine-Tuning: refining a model's expertise in specific domains
- **Stage 3**: Continual Fine-Tuning: integrating new information into a model while mitigating catastrophic forgetting.

## Running DELIFT
### Datasets
Dataset specifics can be defined in `huggingface_datasets.json` (used for Stage 1) and `benchmark_datasets.json` (used for Stage 2 - stage 3 uses a combination of these datasets). The following attributes can be defined for each dataset:
- "input": the column name that corresponds with the input. This is mandatory.
- "output": the column name that corresponds with the output. This is also mandatory.
-"split_name": the names of the training, validation, and testing splits.
- "instruction": [optional] the column name that corresponds with the instruction. This can be left out, and an empty instruction will be added.
- "subset": [optional] whether there is a specific subset of the data that needs to be loaded.

### Files to run

Any of the `run_....sh` files can be used to reproduce our results. Each file follows the below structure.

First, run data pre-processing with:
```
py visualization/create_embeddings.py
```

Next, load all the experimental results (this will take time):
```
py visualization/load_all_experiments.py
```

Finally, load the visualization:
```
py visualization/visualization.py
```

Note: the middle step can be skipped, as `visualization.py` also has the same code to load all the experimental results. Still, it is recommended to load all experiments before rendering the visualization.

## Citation
Please cite our paper:

(Please be patient, our work is under submission and we'd like the anonimity to remain until after the review period. Thank you!)
15 changes: 15 additions & 0 deletions installs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
pip install streamlit
pip install scikit-learn
pip install plotly
export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
pip install sklearn
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
pip install sentence-transformers
pip install faiss-gpu
pip install peft
pip install evaluate
pip install torch
pip install transformers
pip install trl
pip install bert-score
pip install numpy
17 changes: 1 addition & 16 deletions run_version.sh → run_continual_fine_tuning.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,4 @@
pip installs
pip install streamlit
pip install scikit-learn
pip install plotly
export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
pip install sklearn
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
pip install sentence-transformers
pip install faiss-gpu
pip install peft
pip install evaluate
pip install torch
pip install transformers
pip install trl
pip install bert-score
pip install numpy
source install.sh

# use case 3: Given a model, its training data, and a new dataset, fine-tune a model on a subset of points from the new dataset that adds new knowledge to the existing dataset
python3 visualization/create_embeddings.py --use_case 3
Expand Down
17 changes: 1 addition & 16 deletions run_same_dataset.sh → run_instruction_tuning.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,4 @@
pip installs
pip install streamlit
pip install scikit-learn
pip install plotly
export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
pip install sklearn
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
pip install sentence-transformers
pip install faiss-gpu
pip install peft
pip install evaluate
pip install torch
pip install transformers
pip install trl
pip install bert-score
pip install numpy
source install.sh

# use case 1: Given a dataset, fine-tune a model on a subset of points that improves the performance on the entire dataset.
python3 visualization/create_embeddings.py --use_case 1
Expand Down
17 changes: 1 addition & 16 deletions run_benchmark.sh → run_task_specific_fine_tuning.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,4 @@
pip installs
pip install streamlit
pip install scikit-learn
pip install plotly
export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
pip install sklearn
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
pip install sentence-transformers
pip install faiss-gpu
pip install peft
pip install evaluate
pip install torch
pip install transformers
pip install trl
pip install bert-score
pip install numpy
source install.sh

# use case 2: Given a model and a new dataset, fine-tune a model on a subset of points that improves the performance on a benchmark.
python3 visualization/create_embeddings.py --use_case 2
Expand Down

0 comments on commit 56d7da5

Please sign in to comment.