Authors: Yutao Zhu, Peitian Zhang, Chenghao Zhang, Yifei Chen, Binyu Xie, Zhicheng Dou, Zheng Liu, and Ji-Rong Wen
🤗 HuggingFace Model List
Model | Backbone Model |
---|---|
INTERS-LLaMA-7b-Chat | LLaMA-2-7b-chat |
INTERS-LLaMA-7b-Base | LLaMA-2-7b |
INTERS-Mistral-7b | Mistral-7b |
INTERS-Minima-3b | Minima-2-3b |
INTERS-Falcon-1b | Falcon-rw-1b |
- May, 2024: We are happy that INTERS has been accepted by ACL 2024 main conference!
- Feb, 2024: We have released the dataset, instruction templates, fine-tuned models, and evaluation scripts.
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance.
We consider tasks under the categories of query understanding, document understanding, and query-document understanding. Our dataset consists of 20 tasks derived from 43 datasets. All tasks and datasets we used are shown in the figure below.
The evaluation script is under the evaluation
directory.
torch 2.0.0
transformers 4.36.2
numpy 1.26.3
tqdm 4.66.1
scikit-learn 1.4.0
rouge_score 0.1.2
nltk 3.8.1
accelerate 0.26.1
This evaluation script use pytorch DDP for text generation.
- Download test data and save it to
data/in-domain/zero_shot/
. The directory structure is like below:
qu-du-tasks
├── eval_sampling.py
├── inference_dataset.py
├── inference_qu_du.py
├── inference_tasks
│ ├── conversational_qa.py
│ ├── fact_verification.py
│ └── ...
└── data
└── in-domain
└── zero-shot
├── conversational_qa_coqa.zero_shot.test.jsonl
├── conversational_qa_quac.zero_shot.test.jsonl
├── fact_verification_climate_fever.zero_shot.test.jsonl
├── fact_verification_fever.zero_shot.test.jsonl
├── fact_verification_scifact.zero_shot.test.jsonl
└── ...
-
If you choose to place the test files in other directories, you can modify the path in each task file under
inference_tasks
directory (inget_path()
function). -
Run evaluation as
TOKENIZERS_PARALLELISM=True python3 inference_qu_du.py \
--model_name_or_path your/model/path \
--tokenizer_name your/tokenizer/path \
--setting in-domain \
--n_shots zero_shot
- Download test data and save it to
data/
. The directory structure is like below:
qdu-tasks
├── cqa.sh
├── eval_rank.py
├── postprocess_cqa.py
├── run_eval.sh
└── data
├── cqadupstack
│ ├── android
│ │ └── test.pt.key.do-not-overwrite.json
│ ├── english
│ │ └── test.pt.key.do-not-overwrite.json
│ └── ...
├── arguana.bm25.100.jsonl
├── climate_fever.bm25.100.jsonl
└── ...
- For datasets other than cqadupstack, modify the paths in
run_eval.sh
, then run the script
MODEL_PATH="your/model/path"
TOKENIZER_PATH="your/tokenizer/path"
RESULT_PATH="your/result/path"
EVAL_DATA_PATH="data"
-----------------------
bash run_eval.sh
- For cqadupstack dataset, modify the paths in
cqa.sh
, then run the script
MODEL_PATH="your/model/path"
TOKENIZER_PATH="your/tokenizer/path"
RESULT_PATH="your/result/path"
-----------------------
bash cqa.sh
- This script supports testing pointwise/pairwise/listwise methods for reranking. Modify the parameter of
eval_rerank.py
inrun_eval.sh
orcqa.sh
# pointwise: (default)
--rerank_method pointwise
# pairwise:
--rerank_method pairwise
# listwise:
--rerank_method listwise \
--listwise_window 5 \
--listwise_stride 5
Please kindly cite our paper if it helps your research:
@inproceedings{INTERS,
author = {Yutao Zhu and
Peitian Zhang and
Chenghao Zhang and
Yifei Chen and
Binyu Xie and
Zheng Liu and
Ji{-}Rong Wen and
Zhicheng Dou},
editor = {Lun{-}Wei Ku and
Andre Martins and
Vivek Srikumar},
title = {{INTERS:} Unlocking the Power of Large Language Models in Search with
Instruction Tuning},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2024, Bangkok, Thailand,
August 11-16, 2024},
pages = {2782--2809},
publisher = {Association for Computational Linguistics},
year = {2024},
url = {https://doi.org/10.18653/v1/2024.acl-long.154},
doi = {10.18653/V1/2024.ACL-LONG.154},
}