This repository contains data and code associated with the LREC 2022 paper The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts.
Note: The official repository for the Project Dialogism Novel Corpus has been moved to here, and will be updated with new novels as they are annotated.
Please cite the following work if you use the PDNC dataset or the code associated with the paper:
@article{vishnubhotla2022project,
title={The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts},
author={Vishnubhotla, Krishnapriya and Hammond, Adam and Hirst, Graeme},
journal={arXiv preprint arXiv:2204.05836},
year={2022}
}
The PDN Corpus contains annotations for speaker, addressees, referring expression, and pronominal mentions for all quotations in 22 novels. The list of novels can be found in the file ListOfNovels.txt
.
In the data
folder, for each novel, there are three files:
-
text.txt
: The text of the novel -
quotations.csv
: This is a CSV file where each row contains, for a quotation:- The text of the quotation
- The corresponding character-byte spans from the novel text
- The name of the speaker
- The names of the addressees
- Texts of the mentions annotated within the quotation
- Character-byte spans of the mentions from the novel text
- The entities referred to by the above mentions
- The type of the quotation (implicit, anaphoric, or explict)
- The referring expression associated with the quotation, if any
-
charDict.pkl
: Each character-entity in a novel is assigned a unique ID. This pickle file is a dictionary with the following key-value pairs:- id2names: The list of names (aliases) associated with each ID
- name2id: A reverse-mapping of each character alias to the corresponding ID
- id2parent: The main character name associated with each ID
The IPython notebook load_data.ipynb
shows how to load and read the data files for a novel.
The full text of the annotation guidelines that were used to annotate this corpus can be viewed at this link.
The code
folder contains scripts needed to run the semi-supervised classification approach described in Section 5.1.3 of the paper.
You can run the classifier for a novel with the following command:
python semi_sup_clf.py --novel <novel-name> --save_path outputs/
where <novel-name>
should be substituted with the corresponding folder name in the data
folder for a novel.
Please contact the authors of the paper with any queries:
- Krishnapriya Vishnubhotla (University of Toronto)
- Adam Hammond (University of Toronto)
- Graeme Hirst (University of Toronto)
Contact: [email protected], [email protected], [email protected]