Voices Speaking To and About One Another: Introducing the Project Dialogism Novel Corpus

paper, specified "long paper"
  1. 1. Adam Hammond

    University of Toronto, Canada

  2. 2. Krishnapriya Vishnubhotla

    University of Toronto, Canada

  3. 3. Saif M. Mohammad

    National Research Council Canada, Ottawa, ON, Canada

  4. 4. Graeme Hirst

    University of Toronto, Canada

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We introduce a new dataset for the computational analysis of novels: the Project Dialogism Novel Corpus (PDNC). The PDNC currently consists of 22 novels in which all quotations are identified and annotated for speaker, addressee(s), and characters mentioned. PDNC is by an order of magnitude the largest corpus of its kind. Each novel is annotated manually by a pair of annotators using customized software we developed. In addition to releasing the dataset itself alongside this paper, we are also releasing the custom annotation software we developed (including the source code) along with our annotation guidelines. In the discussion section, we present two applications of the PDNC from our own research: quote attribution and emotion dynamics. We argue that the PDNC will promote a more nuanced and accurate view of novelistic discourse; whereas much research currently envisions the novel as expressing the voice of the
author, the PDNC presents novels as a polyphonic fabric of
characters’ voices.

Overview of the Project Dialogism Novel Corpus
The PDNC currently consists of 22 novels (see Table 1). In selecting novels, our aim has been to annotate texts in a variety of genres (literary fiction, children’s literature, detective fiction, and science fiction are represented); from the LitBank (REF #1) and QuoteLi (REF #15) corpora, to facilitate comparison and validation; of broad interest to a variety of scholars while still relevant to our group’s interest in stylistic diversity and dialogism. Further, we have chosen to annotate multiple novels by Jane Austen, in order to facilitate comparative analysis of a single author’s oeuvre (Austen was chosen because she is included in all existing corpora). 
The annotation workflow proceeds as follows. First, the novel is pre-processed in GutenTag (Brooke et al. 2015); from this, a provisional character list is built and likely quotations are identified. Next, the novel is manually annotated in our customized software (see Figure 1). This is done separately by two annotators. Working from our guidelines (Hammond et al. 2021), annotators select each quotation, then identify the speaker, addressee, and anyone mentioned in the quotation (whether by name or pronoun). Annotators also identify the referring expression for each quotation, as well as the quotation type: explicit (quotations in which the referring expressions give the character’s name; for example, “said Emma”), pronominal (pronoun given; “she said”), or implicit (no referring expression). Once both annotators have completed their work, their annotations are compared for any discrepancies. The annotators then meet to resolve any disagreements, in what we call a “consensus exercise.” Once comparison shows no disagreement between annotations, the novel is considered annotated.
  The PDNC is by an order of magnitude the largest corpus of its kind (see Table 2). The largest previous corpus of novels annotated in this manner is the QuoteLi corpus, which contains only three novels (
Pride and Prejudice and
Emma, both in PDNC; and Chekhov’s
The Steppe, not in PDNC). The LitBank corpus includes annotations for 100 novels, but only for a very small fraction of each is annotated (on average, only 2,000 words). The Columbia Quoted Speech Attribution Corpus consists of six texts, two of which are compilations of short stories, but they are only partly annotated for quote attribution.

Table 1.
PDNC: Tokens, quotations, speakers, total # of addressees recorded, total # of mentions 

Screen shot from our custom annotation software.

Table 2.
Comparison of PDNC with previous quotation attribution corpora

Research Applications
The research applications of the PDNC are multiple, extending well beyond the boundaries of our own research interests. Yet our own research serves to demonstrate some of its possible uses.
We began developing the PDNC primarily to test our quote attribution system (Hammond et al. 2020). The corpus has proven essential to this work, allowing us to compare our systems against state-of-the-art systems like QuoteLi and the BERT-based system in the latest release of BookNLP (see Table 3).

Table 3.
A comparison of performance of our latest quote attribution system vs. QuoteLi vs. BookNLP. Numbers reported are accuracy scores; best scores are bolded.

Perhaps the largest aim of PDNC is to reorient computational work away from conceiving novels as undifferentiated lumps of text attributed solely to their authors — but rather as complex fabrics of differentiated voices speaking to and about one another, mediated by a narrator. In the paper introducing the tool GutenTag (Hammond and Brooke 2017), one of our authors used a rudimentary version of PDNC to rebut Matthew Jockers’s (2013) claim that female novelists generally write about stereotypically feminine themes. By looking at character voices
within novels, however, rather than attributing all the novel’s text to its author, we demonstrated that it was female
characters who discussed these themes — and that Jockers’s results were a secondary consequence of the fact that female authors tended to include far more female characters in their works. By allowing researchers to look
within novels and analyze novels through the voices that make them up, PDNC will shift research away from mistaken assumptions and conclusions like Jockers’s.

  Our work on “emotion dynamics” — the study of change in emotional states over time — presents another example of new research enabled by the PDNC. Sentiment analysis is among the richest and most vital areas of computational literary research today. Yet major work seeking to plot novels’ sentiment trajectories remains limited by the necessity of assuming a single source for all words: the author (Elsner 2012, Mohammad 2011, Jockers 2014, Reagan 2016). In a pioneering essay on “emotion dynamics” in films, Hipson and Mohammad (2021) show the benefits of considering
individual characters’ emotional trajectories. This approach enables researchers to determine each character’s “home base” (typical emotional range) as well as their emotional variability and the speed at which they regulate variations. We are currently working to apply this approach to the novels in PDNC (Figures 2–4 show the emotional trajectory of Jake Barnes in Ernest Hemingway’s
The Sun Also Rises, revealing that this reputedly taciturn character in fact experiences one of the most extreme emotional troughs (in terms of valence) of any character in PDNC). We are using this approach to test whether characters’ emotion dynamics track with familiar literary-critical categories such as flat vs. round characters (Forster 1927). We are also investigating the extent to which emotional trajectories are gendered, and whether male or female authors are more likely to create characters that diverge from gender norms.

Emotion dynamics trajectory, valence only, for characters in Ernest Hemingway’s The Sun Also Rises. Jake Barnes’s emotional trajectory is highlighted; the trough three-quarters of the way through the novel (~76%-87%) occurs during and after his fight with Robert Cohn at the Fiesta.

Emotion dynamics, valence only, for all characters in PDNC. Jakes Barnes’s trajectory (highlighted) is extreme in the context of the characters in our corpus.

Emotion words (with frequency count) used by Jake Barnes during trough (76%-87% portion of novel)


Bamman, D., Popat, S., and Shen, S. (2019). An annotated dataset of literary entities.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2138-2144.

Brooke, J., Hammond, A., and Hirst, G. (2015). GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus.
Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 42-47.

Elsner, M. (2012). Character-based kernels for novelistic plot structure.
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 634-644.

Elson, D. K. and McKeown, K. R. (2010). Automatic attribution of quoted speech in literary narrative.
Twenty-Fourth AAAI Conference on Artificial Intelligence. 2010.

Forster, E. M. (1927).
Aspects of the Novel. New York: Harcourt, Brace, and Company.

Hammond, A. and Brooke, J. (2017). GutenTag: A User-Friendly, Open-Access, Open-Source System for Reproducible Large-Scale Computational Literary Analysis.
Proceedings of the Digital Humanities 2017 Conference, pp. 246–249.

Hammond, A., Vishnubhotla, K., and Hirst, G. (2020). The Words Themselves: A Content-Based Approach to Quote Attribution.
Proceedings of the Digital Humanities 2020 Conference.

Hammond, A., Vishnubhotla, K., Duarte, L., Oh, S., Pajovic, J., and Siegal, B. (2022). Annotation Guidelines for the Project Dialogism Novel Corpus.
https://tinyurl.com/quoteattribution (accessed April 28, 2022).

Mohammad, S. (2018). Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 174-184.

Brooke, J, and Hirst, G. (2013). A multi-dimensional Bayesian approach to lexical style.
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 673-679.

He, H., Barbosa, S., and Kondrak, G. (2013). Identification of speakers in novels.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1312-1320.

Hipson, W.E., and Mohammad, S. E. (2021). Emotion dynamics in movie dialogues.
PloS one vol. 16(9).

Jockers, M. (2013).
Macroanalysis: Digital Methods and Literary History (University of Illinois Press).

Jockers, M. (2014). “A novel method for detecting plot.” http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/ (accessed April 28, 2022).

Mohammad, S. (2011). From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales.
Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH).

Muzny, F., Fang, M., Chang, A., and Jurafsky, D. (2017). "A two-stage sieve approach for quote attribution.
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 460-470.

Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M., and Sheridan Dodds, P. (2016). The emotional arcs of stories are dominated by six basic shapes.
EPJ Data Science 5(31), pp. 1–12.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO