Schema-Driven Information Extraction from Heterogeneous Tables

This repo contains code and data associated with the EMNLP 2024 Findings paper "Schema-Driven Information Extraction from Heterogeneous Tables".

@misc{bai2024schemadriveninformationextractionheterogeneous,
      title={Schema-Driven Information Extraction from Heterogeneous Tables}, 
      author={Fan Bai and Junmo Kang and Gabriel Stanovsky and Dayne Freitag and Mark Dredze and Alan Ritter},
      year={2024},
      eprint={2305.14336},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2305.14336}, 
}

Task: Schema-to-JSON

Installment

Create conda environment.

git clone https://github.com/bflashcp3f/schema-to-json.git
cd schema-to-json
conda env create -f environment.yml
conda activate s2j

Set up OpenAI API key with the environment variable OPENAI_API_KEY. If you want to use Azure, set up the environment variable AZURE_API_KEY.
Install from the source

pip install -e .

Data

Four datasets (MlTables, ChemTables, DiSCoMat and SWDE) in our benchmark are available under the data directory.

Experiments

Below are the commands to reproduce paper results. Make sure you set up API_SOURCE (openai or azure) and BACKEND (model name) in the script. For open-source models, use scripts with suffix _os.sh.

MlTables

# Prompt (w/ error recovery)
bash scripts/mltables/prompt_error_recovery.sh

# Evaluation
bash scripts/mltables/eval.sh

ChemTables

# Prompt (w/ error recovery)
bash scripts/chemtables/prompt_error_recovery.sh

# Evaluation
bash scripts/chemtables/eval.sh

DiSCoMat

# Prompt
bash scripts/discomat/prompt_error_recovery.sh

# Evaluation
bash scripts/discomat/eval.sh

SWDE

# Prompt
bash scripts/swde/prompt.sh

# Evaluation
bash scripts/swde/eval.sh

Name	Name	Last commit message	Last commit date
Latest commit bflashcp3f Update README.md Dec 5, 2024 66e610d · Dec 5, 2024 History 28 Commits
data	data	Update compressed discomat	Nov 20, 2023
figures	figures	Update README.md	Oct 18, 2023
scripts	scripts	Add code for open-source models	Apr 1, 2024
src/schema2json	src/schema2json	Add code for open-source models	Apr 1, 2024
.gitignore	.gitignore	Add code for open-source models	Apr 1, 2024
LICENSE	LICENSE	LICENSE	Oct 18, 2023
README.md	README.md	Update README.md	Dec 5, 2024
environment.yml	environment.yml	Update conda environment and README.md for open-source models	Apr 1, 2024
run.py	run.py	Add code for open-source models	Apr 1, 2024
setup.py	setup.py	Add source code	Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schema-Driven Information Extraction from Heterogeneous Tables

Task: Schema-to-JSON

Installment

Data

Experiments

MlTables

ChemTables

DiSCoMat

SWDE

About

Releases

Packages

Languages

License

bflashcp3f/schema-to-json

Folders and files

Latest commit

History

Repository files navigation

Schema-Driven Information Extraction from Heterogeneous Tables

Task: Schema-to-JSON

Installment

Data

Experiments

MlTables

ChemTables

DiSCoMat

SWDE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages