The Project Dialogism Novel Corpus

This repository contains data and code associated with the LREC 2022 paper The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts.

Note: The official repository for the Project Dialogism Novel Corpus has been moved to here, and will be updated with new novels as they are annotated.

Please cite the following work if you use the PDNC dataset or the code associated with the paper:

@article{vishnubhotla2022project,
  title={The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts},
  author={Vishnubhotla, Krishnapriya and Hammond, Adam and Hirst, Graeme},
  journal={arXiv preprint arXiv:2204.05836},
  year={2022}
}

Data and Annotation

The PDN Corpus contains annotations for speaker, addressees, referring expression, and pronominal mentions for all quotations in 22 novels. The list of novels can be found in the file ListOfNovels.txt.

In the data folder, for each novel, there are three files:

text.txt: The text of the novel
quotations.csv: This is a CSV file where each row contains, for a quotation:
- The text of the quotation
- The corresponding character-byte spans from the novel text
- The name of the speaker
- The names of the addressees
- Texts of the mentions annotated within the quotation
- Character-byte spans of the mentions from the novel text
- The entities referred to by the above mentions
- The type of the quotation (implicit, anaphoric, or explict)
- The referring expression associated with the quotation, if any
charDict.pkl: Each character-entity in a novel is assigned a unique ID. This pickle file is a dictionary with the following key-value pairs:
- id2names: The list of names (aliases) associated with each ID
- name2id: A reverse-mapping of each character alias to the corresponding ID
- id2parent: The main character name associated with each ID

Helper File

The IPython notebook load_data.ipynb shows how to load and read the data files for a novel.

Annotation Guidelines

The full text of the annotation guidelines that were used to annotate this corpus can be viewed at this link.

Code

The code folder contains scripts needed to run the semi-supervised classification approach described in Section 5.1.3 of the paper. You can run the classifier for a novel with the following command:

    python semi_sup_clf.py --novel <novel-name> --save_path outputs/

where <novel-name> should be substituted with the corresponding folder name in the data folder for a novel.

Authors

Please contact the authors of the paper with any queries:

Krishnapriya Vishnubhotla (University of Toronto)
Adam Hammond (University of Toronto)
Graeme Hirst (University of Toronto)

Contact: [email protected], [email protected], [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data		data
ListOfNovels.txt		ListOfNovels.txt
ReadMe.md		ReadMe.md
load-data.ipynb		load-data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Project Dialogism Novel Corpus

Data and Annotation

Helper File

Annotation Guidelines

Code

Authors

About

Releases

Packages

Contributors 2

Languages

Priya22/pdnc-lrec2022

Folders and files

Latest commit

History

Repository files navigation

The Project Dialogism Novel Corpus

Data and Annotation

Helper File

Annotation Guidelines

Code

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages