FineSurE: Fine-grained Summarization Evaluation using LLMs (ACL'24-main, Long Paper)

Here is our paper on arXiv: [link]

The structure of the projects:

dataset: FRANK and REALSumm (in JSON format) are located in this folder
reproduce: the code to reproduce the results of FineSurE in Table 1 and Table 2
finesure: the code to run our FineSurE method to evaluate the summary generated from language models

Highlight

FineSurE is a multi-dimensional, fine-grained automated evaluation framework for text summarization. It covers there distinctive evaluation dimensions, namely faithfulness, completeness, and conciseness. These dimensions are crucial to assess the capability of modern language models in summarization, as they are susceptible to incorrect statement, information omission, and verbosity.

FineSurE framework breaks down a complicate evaluation process into two simple human-like evaluation tasks using LLMs.

Fact Checking: This is a task of solving a categorization problem involving nine categories. These include the seven factuality errors, along with an additional category "other error" for error outside the seven errors, and an additional category "no error" for cases whether no error was detected. Given a pair of input text and model summary, the LLM is expected to output the error type classified into one of the nine categories for each sentence along with a concise reason.
Keyfact Alignment: This is an alignment task of matching each keyfact into the summary sentences from which the keyfact is inferable. Given a pari of keyfact list and model summary, the output should be the binary label (whether inferable or not) an dth elist of line numbers of all summary sentences matched for each keyfact.

Running FineSurE on Model Summareis

We create sample datasets with 10 examples for fact-checking and keyfact-alignment tasks, respectively.

Please replace the openai api key with your api key in finesure/fact-checking.py and finesure/keyfact-alignmnet.py.

Runnining Command:

cd CodeRelease
python finesure/fact-checking.py [input-path] [output-folder]

# example code for fact checking on sampled data.
python finesure/fact-checking.py dataset/frank/frank-data-sample-10.json result/fact-checking

Runnining Command:

cd CodeRelease
python finesure/keyfact-alignment.py [input-path] [keyfact-path] [output-folder]

# example code for keyfact alignment on sampled data.
python finesure/keyfact-alignment.py dataset/realsumm/realsumm-data-sample-10.json dataset/realsumm/human-keyfact-list.json result/keyfact-alignment

Logs:

The results are saved in the result directory. See the results on examples below:

Fact Checking Task:

[Evaluation Results]
* sentence-level factuality error ratio per model (lower is better)
bert_sum	0.0%
bus	33.3%
pgn	16.7%
s2s	83.3%
bart	33.3%

* summary-level faithfulness score per model (higher is better)
bert_sum	100.0%
bus	66.7%
pgn	83.3%
s2s	16.7%
bart	75.0%

* system-level model ranking (left is better)
['bert_sum', 'pgn', 'bart', 'bus', 's2s']

* success rate: 100.0%

Keyfact Alignment Task:

[Evaluation Results]

* completeness score per model (higher is better)
unilm_out_v2	45.5%
t5_out_large	59.0%

* completeness model ranking (left is better)
['t5_out_large', 'unilm_out_v2']

* conciseness score per model (higher is better)
unilm_out_v2	76.0%
t5_out_large	81.7%

* conciseness model ranking (left is better)
['t5_out_large', 'unilm_out_v2']

* success rate: 100.0%

Reproduce the Main Table of the Paper

cd CodeRelease/reproduce
python reproduce-main-results.py results/frank-result-by-gpt4-w-finesure.json results/realsumm-result-by-gpt4-w-finesure.json

Citation

Please consider citation if our paper is useful in your research.

@inproceedings{song2024finesure,
  title={FineSurE: Fine-grained Summarization Evaluation using LLMs},
  author={Song, Hwanjun and Su, Hang and Shalyminov, Igor and Cai, Jason and Mansour, Saab},
  booktitle={ACL},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dataset		dataset
finesure		finesure
reproduce		reproduce
result		result
LICENSE		LICENSE
README.md		README.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FineSurE: Fine-grained Summarization Evaluation using LLMs (ACL'24-main, Long Paper)

Highlight

Running FineSurE on Model Summareis

Runnining Command:

Runnining Command:

Logs:

Reproduce the Main Table of the Paper

Citation

About

Releases

Packages

Languages

License

DISL-Lab/FineSurE-ACL24

Folders and files

Latest commit

History

Repository files navigation

FineSurE: Fine-grained Summarization Evaluation using LLMs (ACL'24-main, Long Paper)

Highlight

Running FineSurE on Model Summareis

Runnining Command:

Runnining Command:

Logs:

Reproduce the Main Table of the Paper

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages