update readme

agarwalishika · Oct 29, 2024 · 56d7da5 · 56d7da5
1 parent 4bf9f78
commit 56d7da5
Show file tree

Hide file tree

Showing 5 changed files with 59 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,41 @@
+# DELIFT: Data Efficient Language model Instruction Fine Tuning
+
+This repo contains the code for [DELIFT: Data Efficient Language model Instruction Fine Tuning](https://arxiv.org/abs/LINK). DELIFT is a unified fine-tuning algorithm that optimizes data subset selection across the three stages of fine-tuning:
+- **Stage 1**: Instruction Tuning: enhancing a model's ability to follow general instructions
+- **Stage 2**: Task-Specific Fine-Tuning: refining a model's expertise in specific domains
+- **Stage 3**: Continual Fine-Tuning: integrating new information into a model while mitigating catastrophic forgetting.
+
+## Running DELIFT
+### Datasets
+Dataset specifics can be defined in `huggingface_datasets.json` (used for Stage 1) and `benchmark_datasets.json` (used for Stage 2 - stage 3 uses a combination of these datasets). The following attributes can be defined for each dataset:
+- "input": the column name that corresponds with the input. This is mandatory.
+- "output": the column name that corresponds with the output. This is also mandatory.
+-"split_name": the names of the training, validation, and testing splits.
+- "instruction": [optional] the column name that corresponds with the instruction. This can be left out, and an empty instruction will be added.
+- "subset": [optional] whether there is a specific subset of the data that needs to be loaded.
+
+### Files to run
+
+Any of the `run_....sh` files can be used to reproduce our results. Each file follows the below structure.
+
+First, run data pre-processing with:
+```
+py visualization/create_embeddings.py
+```
+
+Next, load all the experimental results (this will take time):
+```
+py visualization/load_all_experiments.py
+```
+
+Finally, load the visualization:
+```
+py visualization/visualization.py
+```
+
+Note: the middle step can be skipped, as `visualization.py` also has the same code to load all the experimental results. Still, it is recommended to load all experiments before rendering the visualization.
+
+## Citation
+Please cite our paper:
+
+(Please be patient, our work is under submission and we'd like the anonimity to remain until after the review period. Thank you!)
diff --git a/installs.sh b/installs.sh
@@ -0,0 +1,15 @@
+pip install streamlit
+pip install scikit-learn
+pip install plotly
+export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
+pip install sklearn
+pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
+pip install sentence-transformers
+pip install faiss-gpu
+pip install peft
+pip install evaluate
+pip install torch
+pip install transformers
+pip install trl
+pip install bert-score
+pip install numpy
diff --git a/run_version.sh → run_continual_fine_tuning.sh b/run_version.sh → run_continual_fine_tuning.sh
@@ -1,19 +1,4 @@
-pip installs
-pip install streamlit
-pip install scikit-learn
-pip install plotly
-export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
-pip install sklearn
-pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
-pip install sentence-transformers
-pip install faiss-gpu
-pip install peft
-pip install evaluate
-pip install torch
-pip install transformers
-pip install trl
-pip install bert-score
-pip install numpy
+source install.sh
 
 # use case 3: Given a model, its training data, and a new dataset, fine-tune a model on a subset of points from the new dataset that adds new knowledge to the existing dataset
 python3 visualization/create_embeddings.py --use_case 3

diff --git a/run_same_dataset.sh → run_instruction_tuning.sh b/run_same_dataset.sh → run_instruction_tuning.sh
@@ -1,19 +1,4 @@
-pip installs
-pip install streamlit
-pip install scikit-learn
-pip install plotly
-export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
-pip install sklearn
-pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
-pip install sentence-transformers
-pip install faiss-gpu
-pip install peft
-pip install evaluate
-pip install torch
-pip install transformers
-pip install trl
-pip install bert-score
-pip install numpy
+source install.sh
 
 # use case 1: Given a dataset, fine-tune a model on a subset of points that improves the performance on the entire dataset.
 python3 visualization/create_embeddings.py --use_case 1

diff --git a/run_benchmark.sh → run_task_specific_fine_tuning.sh b/run_benchmark.sh → run_task_specific_fine_tuning.sh
@@ -1,19 +1,4 @@
-pip installs
-pip install streamlit
-pip install scikit-learn
-pip install plotly
-export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
-pip install sklearn
-pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
-pip install sentence-transformers
-pip install faiss-gpu
-pip install peft
-pip install evaluate
-pip install torch
-pip install transformers
-pip install trl
-pip install bert-score
-pip install numpy
+source install.sh
 
 # use case 2: Given a model and a new dataset, fine-tune a model on a subset of points that improves the performance on a benchmark.
 python3 visualization/create_embeddings.py --use_case 2