We propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages.
- 📖 Paper: Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction
- 🤗 Model: GoLLIE-7B-TF
- 🚀 Example Jupyter Notebooks: GoLLIE-TF Notebooks
The labels are represented as Python classes, and the guidelines or instructions are introduced as docstrings.
You will need to install the following dependencies to run the GoLLIE codebase:
Pytorch >= 2.0.0 | https://pytorch.org/get-started
We recommend that you install the 2.1.0 version or newer, as it includes important bug fixes.
transformers >= 4.33.1
pip install --upgrade transformers
PEFT >= 0.4.0
pip install --upgrade peft
bitsandbytes >= 0.40.0
pip install --upgrade bitsandbytes
Flash Attention 2.0
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
You will also need these dependencies
pip install numpy black Jinja2 tqdm rich psutil datasets ruff wandb fschat
First, we initialize the model from GoLLIE-7B. Then we set up training data path in gollie-tf.yaml (dataset_ep_dir:). A copy of the data can be found in link.
Second, we continue train the model on English and TransFusion data using QLoRA. Cehck bash_scripts/run_training.sh
CONFIGS_FOLDER="configs/model_configs"
python3 -m src.run ${CONFIGS_FOLDER}/gollie-tf.yaml
Finally, we run inference by loading lora weights and merging it with the GoLLIE-7B. Check bash_scripts/run_inference.sh. A copy of the processed test data can be found in link.
python3 -m src.hf_inference --dataset_path $DATASET_PATH --task_name_list $DATASET_NAME --num_size $NUM_SIZE --output_path $OUTPUT_PATH --batch_size 8 --model_name $MODEL_NAME
Please check code at edchengg/transfusion for multilingual BERT based transfusion experiment.
@article{chen2023better,
title={Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction},
author={Chen, Yang and Shah, Vedaant and Ritter, Alan},
journal={arXiv preprint arXiv:2305.13582},
year={2023}
}
This material is based in part on research sponsored by IARPA via the BETTER program (2019-19051600004).
The GoLLIE-TF codebase is adopted from the GoLLIE project. We appreciate authors discussion on model implementation. We extend the codebase by adding multilingual IE evaluation tasks and extend dataset class. Please cite GoLLIE as well if you use the model.