Code for Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT (coling 2020)
The CNN/DaliyMail dataset we use is directly from the chunked data in https://github.com/JafferWilson/Process-Data-of-CNN-DailyMailv, Download FINISHED FILES. The chunked data is put in /data/DMCNN/...
If you are interested in the fact-level CNN/DaliyMail dataset described in our paper, you can download them here: https://drive.google.com/file/d/1ma0uuXd5b2EgMUslRIGGF6pVPFHBCIs-/view?usp=sharing.
Introduction for the files:
/data/DMCNN/...: use to store the chunked CNN/DaliyMail dataset.
/data/raw_data_loader.py: use to extract article-summary pair from the chunked data.
/data_file/DMCNN/...: use to store the pickle files that contain processed data generated by make_data.py, and there are some examples in the folder. You can obtain the complete organized fact-level data with the link above.
/model/BERT.py: it contains BERT encoder with Hierarchical Graph Mask and the classifier for extractive summarization.
/utility/pyrougex.py: use to evaluate the result with ROUGE.
/utility/utility.py: it contains some functions used in make_data.py.
call_rouge.py: use to evaluate the result with ROUGE.
data_loader.py: data loader for training and testing the model, and it convert the data in pickle files into the form that used for BERT. It also construct the mask matrix.
make_data.py: split the chunked data into fact level and process the data. The output are pickle files stored in data_file.
run.py: use to train and test the model.
The output summary of our model "our s+f" is in result folder, the our s+f_cand refers to the standard setting described in our paper and our s+f 6_cand represents the result that extract 6 facts rather than 4 facts.