This repo contains code and data associated with the EMNLP 2024 Findings paper "Schema-Driven Information Extraction from Heterogeneous Tables".
@misc{bai2024schemadriveninformationextractionheterogeneous,
title={Schema-Driven Information Extraction from Heterogeneous Tables},
author={Fan Bai and Junmo Kang and Gabriel Stanovsky and Dayne Freitag and Mark Dredze and Alan Ritter},
year={2024},
eprint={2305.14336},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2305.14336},
}
- Create conda environment.
git clone https://github.com/bflashcp3f/schema-to-json.git
cd schema-to-json
conda env create -f environment.yml
conda activate s2j
-
Set up OpenAI API key with the environment variable
OPENAI_API_KEY
. If you want to use Azure, set up the environment variableAZURE_API_KEY
. -
Install from the source
pip install -e .
Four datasets (MlTables, ChemTables, DiSCoMat and SWDE) in our benchmark are available under the data
directory.
Below are the commands to reproduce paper results. Make sure you set up API_SOURCE
(openai
or azure
) and BACKEND
(model name) in the script. For open-source models, use scripts with suffix _os.sh
.
# Prompt (w/ error recovery)
bash scripts/mltables/prompt_error_recovery.sh
# Evaluation
bash scripts/mltables/eval.sh
# Prompt (w/ error recovery)
bash scripts/chemtables/prompt_error_recovery.sh
# Evaluation
bash scripts/chemtables/eval.sh
# Prompt
bash scripts/discomat/prompt_error_recovery.sh
# Evaluation
bash scripts/discomat/eval.sh
# Prompt
bash scripts/swde/prompt.sh
# Evaluation
bash scripts/swde/eval.sh