Relating the Unread: A Data-Rich Approach to the Literary Canon and the “Great Unread”

paper, specified "short paper"
  1. 1. Judith Brottrager

    Technische Universität Darmstadt (Technical University of Darmstadt)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Since the emergence of computational approaches in literary studies, the inclusion of more than established and thoroughly researched literary works has been a crucial argument for using quantitative methods despite the inevitable loss of detail caused by necessary formalisations and operationalisations. Franco Moretti (2013: 48 f), for example, argues that so-called
Distant Reading approaches enable the modelling of an alternative literary history that does not exclude the “Great Unread”, i.e. works of literature which have been excluded from the canonical spheres of literary history. Numerous digitisation projects have been launched with this premise in mind: the more texts are available as high-resolution scans, or even better, transcribed as digital text, the broader this alternative literary history can be defined. However, recent contributions to the field (Algee-Hewitt et al., 2016; Underwood and Sellers, 2016; Porter, 2018; Underwood, 2019: 68–110) have shown that besides an ever-expanding archive of digitally available texts, additional data is needed to embed the analysed texts in what Kathrine Bode calls a “data-rich literary history” (2018: 37–57).

This contribution exemplifies how a bespoke context-rich dataset can be compiled by describing the theory-driven creation process of a dataset for the comparative analysis of approximately 1,200 canonised and non-canonised English and German novels and narratives from 1688 to 1914. The description of this use case will focus on the encoding of literary history on two main levels: first, the canon-conscious corpus selection, and second, the data-based operationalisation of canonisation and contemporary reception.
For the corpus compilation, an approach suggested by Algee-Hewitt and McGurl (2015) for the creation of a representative corpus of 20
th century English literature, which focuses on the counterbalancing of inherent biases in available data, has been systematically adapted for the time frame in question. When research projects rely solely on standard collections of already digitised material for their corpus creation, they work with what Algee-Hewitt and McGurl would call a “found” corpus, which builds on layered selection processes that are not transparent but generally linked to a text’s status in the canon. To compensate these biases, Algee-Hewitt and McGurl suggest to move from a “found” to a “made” list of texts: By using a predefined list of works to be included in a corpus, gaps in the digitised archive are made visible and can be filled by retro-digitisation (see also Algee-Hewitt et al., 2016).

Even though a canon-conscious corpus selection encodes some aspects of literary history by reconstructing a text’s “history of transmission” (Bode, 2018: 38) in terms of its availability and accessibility, additional data is needed to operationalise literary categories such as canonisation and contemporary reception as numerical scores to be able to use them for quantitative analyses. For both operationalisations, categories suggested by Heydebrand and Winko (1996) have been used. Defining a text’s canonisation status as the result of consecutive selective processes, Heydebrand and Winko (1996: 222–23) propose, among others, the continuous scientific engagement with the text (formalised as student editions), interest in its author (formalised as complete/collected works editions), and its treatment in literary history (formalised as mentions in narrative literary histories and other secondary sources) as markers for canonisation. These proxies encompass blurrier features of canonisation, such as longevity (see Bloom, 1994; Assmann, 2008) and cultural capital (see Bourdieu, 1986; Guillory, 1998), while being more generalisable and widely available than publishing records. Analogously to the canonisation status, contemporary reception can be modelled by collecting instances of value judgements, i.e. reviews, from representative journals (
The Monthly Review,
The Critical Review,
La Belle Assemblée,
Flowers of Literature,
The Star,
The Athenaeum,
Allgemeine Literatur-Zeitung,
Morgenblatt für gebildete Stände,
Blätter für literarische Unterhaltung, Deutsche Literaturzeitung) and implicit markers of audiences’ interests, as, for example, entries in circulating libraries, which are, in contrast to sales numbers, more representative of lay audience’s reading habits (Martino, 1990; Gamer, 2000). Both reviews and circulating library catalogues represent to a certain extent samples of convenience, as the existence of digital surrogates was a prerequisite for their inclusion in the dataset.

Especially for markers of reception and evaluation, the dataset also draws heavily from already existing databases (e.g.
British Fiction 1800-1829,
English Short Title Catalogue (ESTC),
Gelehrte Journale und Zeitungen der Aufklärung,
The Athenaeum Projects) and is in turn also designed to be sustainable, compatible, and re-usable by adhering to community standards, including international identifiers (as, for example,
VIAF), and providing open access and documentation.


Algee-Hewitt, M., Allison, S., Gemma, M., Heuser, R., Moretti, F. and Walser, H. (2016). Canon/Archive. Large-scale Dynamics in the Literary Field.
Pamphlets of the Stanford Literary Lab(11)

Algee-Hewitt, M. and McGurl, M. (2015). Between Canon and Corpus: Six Perspectives on 20th-Century Novels.
Pamphlets of the Stanford Literary Lab(8)

Assmann, A. (2008). Canon and Archive. In Erll, A., Nünning, A. and Young, S. B. (eds),
Cultural Memory Studies: An International and Interdisciplinary Handbook. Berlin; New York, NY: Walter de Gruyter, pp. 97–107.

Bloom, H. (1994).
The Western Canon: The Books and School of the Ages. New York, NY: Harcourt Brace.

Bode, K. (2018).
A World of Fiction: Digital Collections and the Future of Literary History. Ann Arbor, MI: University of Michigan Press.

Bourdieu, P. (1986). The Forms of Capital. In Richardson, J. (ed),
Handbook of Theory and Research for the Sociology of Education. Westport, CT: Greenwood, pp. 241–58.

Gamer, M. (2000).
Romanticism and the Gothic: Genre, Reception, and Canon Formation. Cambridge, UK; New York, NY: Cambridge University Press.

Guillory, J. (1998).
Cultural Capital: The Problem of Literary Canon Formation. Chicago, IL: University of Chicago Press.

Heydebrand, R. von and Winko, S. (1996).
Einführung in die Wertung von Literatur: Systematik - Geschichte - Legitimation. Paderborn: Schöningh.

Martino, A. (1990).
Die Deutsche Leihbibliothek: Geschichte Einer Literarischen Institution (1756-1914). Wiesbaden: Harassowitz.

Moretti, F. (2013).
Distant Reading. London, UK; New York, NY: Verso.

Porter, J. D. (2018). Popularity/Prestige.
Pamphlets of the Stanford Literary Lab(17)

Underwood, T. (2019).
Distant Horizons: Digital Evidence and Literary Change. Chicago: The University of Chicago Press.

Underwood, T. and Sellers, J. (2016). The Longue Durée of Literary Prestige.
Modern Language Quarterly, 77(3): 321–44 doi:10.1215/00267929-3570634.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO