Summarization is the task of producing a shorter version of a document that preserves most of the original document's meaning.
The CNN / Daily Mail dataset as processed by Nallapati et al. (2016) has been used for evaluating summarization. The dataset contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average). The processed version contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs. Models are evaluated based on ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional). * indicates that models were trained and evaluated on the anonymized version of the dataset.
Sentence compression produces a shorter sentence by removing redundant information, preserving the grammatically and the important content of the original sentence.
The Google Dataset was built by Filippova et al., 2013(Overcoming the Lack of Parallel Data in Sentence Compression). The first dataset released contained only 10,000 sentence-compression pairs, but last year was released an additional 200,000 pairs.
Example of a sentence-compression pair:
Sentence: Floyd Mayweather is open to fighting Amir Khan in the future, despite snubbing the Bolton-born boxer in favour of a May bout with Argentine Marcos Maidana, according to promoters Golden Boy
Compression: Floyd Mayweather is open to fighting Amir Khan in the future.
In short, this is a deletion-based task where the compression is a subsequence from the original sentence. From the 10,000 pairs of the eval portion(repository) it is used the very first 1,000 sentence for automatic evaluation and the 200,000 pairs for training.
Models are evaluated using the following metrics:
- F1 - compute the recall and precision in terms of tokens kept in the golden and the generated compressions.
- Compression rate (CR) - the length of the compression in characters divided over the sentence length.
Model | F1 | CR | Paper / Source | Code |
---|---|---|---|---|
LSTM (Filippova et al., 2015) | 0.82 | 0.38 | Sentence Compression by Deletion with LSTMs | |
BiLSTM (Wang et al., 2017) | 0.8 | 0.43 | Can Syntax Help? Improving an LSTM-based Sentence Compression Model for New Domains |