Constrained Decoding for Cross-lingual Label Projection

This repo contains the code for our ICLR 2024 paper: Constrained Decoding for Cross-lingual Label Projection.

Installation

This project uses python 3.9.16

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txt

Quick usage

Important

We have noticed strange behaviors in the transformers tokenizer when loading from finetuned NLLB checkpoints. To reproduce the results in the paper, please load the tokenizer from the original NLLB checkpoints (e.g., facebook/nllb-200-distilled-600M).

We have provided a simple code to run Codec in example.py:

MODEL="ychenNLP/nllb-200-distilled-1.3B-easyproject"
NLLB_TOKENIZER="facebook/nllb-200-distilled-1.3B"  # <--- Load from the original NLLB checkpoint

SRC_TEXT="Only France and [ Britain ] backed Fischler 's proposal ."
TEMPLATE="Faransi ni Angiletɛri dɔrɔn de ye Fischler ka laɲini dɛmɛ ."

python example.py \
           --src_text ${SRC_TEXT} \
           --template ${TEMPLATE} \
           --model_name_or_path ${MODEL} \
           --tokenizer_path ${NLLB_TOKENIZER}

The current codebase supports mBART, M2M-100, and NLLB model checkpoints. The fine-tuned version of NLLB-600M can be downloaded here (please load the NLLB tokenizer from facebook/nllb-200-distilled-600M when using this checkpoint).

Cross-lingual NER

Translate train

Download the CoNLL-2003 and MasakhaNER2.0 datasets
Run the scripts/translate_conll.sh script, you will need to edit the input and output path inside the script first.
- This script will process the CoNLL03 dataset and translate the training data to 18 African languages
Run the script scripts/augment_ner.sh, you will need to edit the target language, input and output paths first
- This script will prepare the input to Codec, run Codec, and create the augmented training data on the target language
To train the NER model, we use the code and script from this repo

Translate test

Train an English NER model on CoNLL-2003 dataset (code: this repo ). In CoNLL-2003, we convert all MISC to O tag before training.
Translate the test data to English (code: pipelines/nllb_translation_tasks.py), than use the English NER model to annotate.
Run the script scripts/masakhaner_translate_test.sh, you will need to edit the target language, input and output paths first

Citation

If you use this codebase in your work, please consider citing our paper:

@inproceedings{
le2024constrained,
title={Constrained Decoding for Cross-lingual Label Projection},
author={Duong Minh Le and Yang Chen and Alan Ritter and Wei Xu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=DayPQKXaQk}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
pipelines		pipelines
scripts		scripts
README.md		README.md
constrained_beam_search.py		constrained_beam_search.py
constrained_decoding.py		constrained_decoding.py
constrained_decoding2.py		constrained_decoding2.py
decoding_argument.py		decoding_argument.py
example.py		example.py
generation_utils.py		generation_utils.py
iterative_decode.py		iterative_decode.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Constrained Decoding for Cross-lingual Label Projection

Installation

Quick usage

Cross-lingual NER

Translate train

Translate test

Citation

About

Releases

Packages

Languages

duonglm38/Codec

Folders and files

Latest commit

History

Repository files navigation

Constrained Decoding for Cross-lingual Label Projection

Installation

Quick usage

Cross-lingual NER

Translate train

Translate test

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages