An implementation of Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
Follow the steps mentioned in this README for
- Generating the XSum dataset starting from bbc Urls
- Training the LDA Model from scratch
- Decoding word-topics and doc-topics using the LDA model
- Data Processing
Generate source and target dictionary files. In this case, both files are identical (due to "--joined-dictionary"). It operates on the raw format data.
TEXT= {path to xsum_data_topic_convs2s dir}
!python ./XSum-Topic-ConvS2S/preprocess.py --source-lang document \
--target-lang summary \
--trainpref $TEXT/train \
--validpref $TEXT/validation \
--testpref $TEXT/test \
--destdir $TEXT \
--joined-dictionary \
--nwordstgt 50000 \
--nwordssrc 50000 \
--output-format raw
The model requires GPU for training. Check usage with -h for changing variant and hyperparameters
Model variants:
- TCONVS2S enc(t',tD) dec(tD)
- TCONVS2S enc(t') dec(tD)
save_directory = "./checkpoints-topic-convs2s"
CUDA_VISIBLE_DEVICES=1
!python ./dataset/scripts/XSum-Topic-ConvS2S/train.py $TEXT --source-lang document \
--target-lang summary \
--doctopics doc-topics \
--max-sentences 32 \
--arch fconv \
--variant 1 \
--criterion label_smoothed_cross_entropy \
--max-epoch 200 \
--clip-norm 0.1 \
--lr 0.10 \
--dropout 0.2 \
--save-dir {save_directory} \
--no-progress-bar \
--log-interval 10
Download the pretrained model at Pretrained Topic-ConvS2S model and dictionary files (1.2 GB)
Make sure that ./xsum-data-topic-convs2s has the test files to decode, the source and target dictionary files.
!python ./XSum-Topic-ConvS2S/generate.py ./xsum-data-topic-convs2s-output --path ../checkpoints-topic-convs2s/checkpoint_last.pt \
--batch-size 1 \
--beam 10 \
--replace-unk \
--source-lang document \
--target-lang summary \
--doctopics doc-topics \
--encoder-embed-dim 512 > ./test-output-topic-convs2s-checkpoint-best.pt
To extract the summary from a given document, run the following
!python ./extract-hypothesis-fairseq.py -o ./test-output-topic-convs2s-checkpoint-best.pt \
-f ./final-test-output-topic-convs2s-checkpoint-best.pt
!python path/eval_rouge.py --summary {system_summary_file} --mod_sum {model_summary_file}
Take txt files with generated summaries and a file with the corresponding model gold summaries and evaluates P, R, F on rouge-1, rouge-2, rouge-l Sample Output
rouge-1: P: 30.00 R: 37.50 F1: 33.33
rouge-2: P: 11.11 R: 14.29 F1: 12.50
rouge-l: P: 26.15 R: 31.50 F1: 28.58
@InProceedings{xsum-emnlp,
author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
year = "2018",
address = "Brussels, Belgium",
}