Project Dialogism: Toward a Computational History of Vocal Diversity in English-Language Literature

paper, specified "short paper"
  1. 1. Adam Hammond

    San Diego State University (SDSU)

  2. 2. Julian Brooke

    University of Melbourne

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Over the past several years, we have worked to develop a human-interpretable computational method for quantifying “style” in literary texts. In projects focused on modernist texts, we have demonstrated the usefulness of this approach for studying dialogism (or multi-voicedness) in literature. Now we propose to extend our method to the “big data” scale by using a tool we have created, GutenTag ( Our research promises insights into the historical evolution of dialogism in English-language fiction.

Aims and Approach
Dialogism — the literary practice of allowing characters to speak in their own distinctive manners, without altering their speech to suit the particular linguistic practices and prejudices of the author — has been recognized as an ethically and politically significant aspect of fiction since the early twentieth century. Russian critic Mikhail Bakhtin, who coined the term “dialogism,” has been particularly influential in arguing that the dialogic novel could support pluralistic modes of thought that model democratic social systems. Yet despite the widely recognized importance of dialogism as an analytic category in literary studies, it is one that has proven difficult to study computationally, particularly at the “big data” scale. While style has proven a tractable aspect for computational literary study, and while excellent work has been produced on distinguishing character voices within literary texts using style-based methods (Burrows, 1987; McKenna and Antonia, 1996; Rybicki, 2006), existing approaches present two significant drawbacks. First, their reliance on Burrow’s PCA-based methodology means that while this work often produces reliable and insightful results, its computational outputs are generally not human-interpretable; they may be able to show
that an author distinguishes characters based on linguistic style, but not to tell us
how they are differentiated. Further, these methods tend not to be suitable to expansion to the “big data” level, since they require significant manual annotation of character speech in the texts under investigation. Our method — a human-interpretable quantitative method for analyzing literary style — and our tool — which performs automatic structural tagging of plain text — make such research possible, and thus open the way for the first large-scale investigation of dialogism in literary fiction.

Since 2011, we have been laying the foundations for a computational history of dialogism in English-language fiction. The first step was developing and refining our six-dimensional approach to quantifying literary style. Our method is based on six discrete aspects of style: objectivity (words that project a sense of disinterested authority); abstractness (words denoting concepts that cannot be described in purely physical terms); literariness; colloquialness; concreteness (words referring to events, objects, or properties in the physical word); and subjectivity (words that are strongly personal or reflect a personal opinion). To build our stylistic lexicons, we produce a relatively small set of words carefully selected for their stylistic diversity, which human annotators evaluate in terms of the six stylistic aspects listed above. Next, we use an automatic procedure to collect information on how these words are employed in all English texts in the 2010 image of
Project Gutenberg (Brooke et. al., 2016). Using this information, we are able to derive stylistic information for any word in our target text, and to build stylistic profiles for any character or speaking voice within a text.

We have demonstrated the usefulness of our six-style approach to literary problems in two projects. One project focused on free indirect discourse (FID) in Woolf’s
To the Lighthouse and Joyce’s “The Dead” (Brooke et. al., 2016). Our intention was to employ our method to test the long-held hypothesis that FID represents a stylistic middle ground between the neutral style of an objective narrator and the more extreme styles of personalized characters as rendered in direct discourse. Our method confirmed this assumption and, because it produces human-interpretable results, shed some new light on Woolf’s text in particular, finding that while Woolf’s upper-class characters exhibit a conventional power dynamic (they are more authoritative, more literary, more concrete, less objective, and less colloquial than characters of other classes) her female characters reverse these conventional dynamics: they are more objective, more abstract, less colloquial, and less subjective. The other project undertook a quantitative investigation of the problem of voice in T. S. Eliot’s
The Waste Land (Brooke et. al., 2015b). While it is generally agreed that
The Waste Land is composed of many speaking voices, these voices are not explicitly identified, nor are their points of transition provided. Our work explored methods of automatically segmenting and clustering voices in the text; we used the human-interpretable results of our six-style approach to evaluate the performance of various approaches and arrive at a blended human/machine interpretation.

The other key foundation for our work is GutenTag, an open-source software tool released in October 2015 (Brooke et al., 2015a). Two aspects of GutenTag are particularly relevant to the present project. First, it allows users to quickly create large, customized literary corpora. Working from the 2010 image of Project Gutenberg (PG), metadata provided by PG and derived automatically from other sources, and our automatic decision-tree genre classifier, one can, for example, easily assemble a large corpus of nineteenth-century novels in English. Second, it uses our sophisticated rule-based method to automatically generate reliable, genre-specific structural tags in standard TEI XML. Crucially, our tagging system is able to distinguish character speech from narration and identify individual characters in novels; and to separate character speech from stage directions and setting descriptions in plays, associating each speech with a character in the dramatis personae.

Developing a Metric for Dialogism
With the major pieces in place to conduct our research into the history of dialogism, our main task is to develop a reliable quantitative metric for the dialogism of a given literary text. We will proceed by calculating stylistic profiles for each character using methods already established in previous work. We will include a minimum word cutoff to exclude characters for which the stylistic profile is likely to be too noisy due to lack of data. Next, assuming multiple characters, for each style we will treat each character as a datapoint and calculate a weighted variance, where weights are applied based on the relative proportion of speech by each character, to produce a number that indicates overall stylistic variation across characters for that stylistic dimension in the given text. We can average across styles to produce a single metric.
We will experiment with calculating dialogism based on stylistic variance between (a) individual characters, (b) groupings of characters (based on gender, social background, age, etc.), and (c) social networks. This will allow us to track (a) texts in which the speech of individual characters is highly varied, (b) texts in which, for example, male characters speak in a highly distinct manner from female characters, and (c) texts in which members of distant branches of a derived social network speak in their own differentiated styles. To carry out this analysis, certain modifications will be necessary to the GutenTag system; namely, methods to detect social networks in texts and to automatically identify character groupings (a method for identifying character gender is already in place; methods for identifying other groups will be more difficult, but we will explore them).

Research Questions
With these technical foundations in place, we will be ready to present our answers to some of the following questions. Our aim is not to cover all of them, but to offer detailed investigations of those that yield the most interesting results.

Which texts in PG are the most stylistically diverse, according to our quantitative definition? Are they texts by authors traditionally celebrated as dialogic (Austen, Woolf, Joyce, translations of Dostoevsky) or are they non-canonical texts? If the latter, do these texts register qualitatively as “dialogic” to a human reader? What does a close reading of these texts reveal about the viability of our automatic method?
How does stylistic diversity map onto historical time? Do periods of political turmoil (wars, revolutions, natural disasters, strikes, etc.) correspond to changes in the average stylistic diversity of fiction? What can we learn about the social role of fiction by studying this relationship?
How is stylistic diversity distributed geographically? Which regions produce the most stylistically varied writing?
How is stylistic diversity distributed among genres? Does our method support or refute Bakhtin’s claim that the novel is the most dialogic of the genres, and poetry the least dialogic? Can dialogism be meaningfully compared across genres?
Are authors of different ages, classes, and genders more or less likely to produce dialogic fiction?
How does the stylistic diversity of fiction track against that of non-fiction? Does non-fiction become more or less dialogic over time, and does it follow a similar curve to that of fiction? Do changes in the dialogism of fiction anticipate changes in non-fiction, or are the two unrelated?


Brooke, J., Hammond, A., and Hirst, G. (2016). Using Models of Lexical Style to Quantify Free Indirect Discourse in Modernist Fiction.
Digital Scholarship in the Humanities, 2(2): 1–17.

Brooke, J., Hammond, A., and Hirst, G. (2015a). GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus. 
Workshop on Computational Linguistics for Literature. Denver: NAACL, pp. 1–6.

Brooke, J., Hammond, A., and Hirst, G. (2015b). Distinguishing Voices in
The Waste Land Using Computational Stylistics. 
Linguistic Issues in Language Technology, 12(2): 1–43.

Burrows, J .F. (1987).
Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press.

McKenna, C. W. F., and Antonia, A. (1996). ‘A Few Simple Words’ of Interior Monologue in
Ulysses: Reconfiguring the Evidence.
Literary and Linguistic Computing, 11(2): 55–66.

Rybicki, J. (2006). Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and Its Two English Translations.
Literary and Linguistic Computing, 21(1): 91–103.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2016
"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website:

Series: ADHO (11)

Organizers: ADHO