Skip to content

EMNLP 2024 Findings "Schema-Driven Information Extraction from Heterogeneous Tables"

License

Notifications You must be signed in to change notification settings

bflashcp3f/schema-to-json

Repository files navigation

Schema-Driven Information Extraction from Heterogeneous Tables

This repo contains code and data associated with the EMNLP 2024 Findings paper "Schema-Driven Information Extraction from Heterogeneous Tables".

@misc{bai2024schemadriveninformationextractionheterogeneous,
      title={Schema-Driven Information Extraction from Heterogeneous Tables}, 
      author={Fan Bai and Junmo Kang and Gabriel Stanovsky and Dayne Freitag and Mark Dredze and Alan Ritter},
      year={2024},
      eprint={2305.14336},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2305.14336}, 
}

Task: Schema-to-JSON

Installment

  1. Create conda environment.
git clone https://github.com/bflashcp3f/schema-to-json.git
cd schema-to-json
conda env create -f environment.yml
conda activate s2j
  1. Set up OpenAI API key with the environment variable OPENAI_API_KEY. If you want to use Azure, set up the environment variable AZURE_API_KEY.

  2. Install from the source

pip install -e .

Data

Four datasets (MlTables, ChemTables, DiSCoMat and SWDE) in our benchmark are available under the data directory.

Experiments

Below are the commands to reproduce paper results. Make sure you set up API_SOURCE (openai or azure) and BACKEND (model name) in the script. For open-source models, use scripts with suffix _os.sh.

MlTables

# Prompt (w/ error recovery)
bash scripts/mltables/prompt_error_recovery.sh

# Evaluation
bash scripts/mltables/eval.sh

ChemTables

# Prompt (w/ error recovery)
bash scripts/chemtables/prompt_error_recovery.sh

# Evaluation
bash scripts/chemtables/eval.sh

DiSCoMat

# Prompt
bash scripts/discomat/prompt_error_recovery.sh

# Evaluation
bash scripts/discomat/eval.sh

SWDE

# Prompt
bash scripts/swde/prompt.sh

# Evaluation
bash scripts/swde/eval.sh

About

EMNLP 2024 Findings "Schema-Driven Information Extraction from Heterogeneous Tables"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published