Topic Modeling the Nineteenth-Century Poetry Canon: English Poetry Reprinted in Anthologies

paper, specified "short paper"
  1. 1. Natalie Houston

    University of Massachusetts - Lowell

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The contents of poetry anthologies offer scholars a valuable resource for analyzing changes to the literary canon over time. No matter their size, poetry anthologies are necessarily selective, reprinting texts according to the editor’s aesthetic, educational, or political decisions. Anthologies designed for use as textbooks describe and define the field of literary study by providing a representation of a time period or literary movement within their pages. Within the academic context, these choices can have a far-reaching impact, as Wendell Harris suggests: “what is easily available in print tends to be what is being taught and written about” (Harris, 1991: 114). Anthologies also offer us a view into changing literary tastes and values: the poems by William Wordsworth (or any other poet) selected by anthology editors in the 1880s are very different from those selected by editors in the 1980s. As John Guillory suggests, “Canonicity is not a property of the work itself but of its transmission, its relation to other works in a collocation of works.” (Guillory, 1993: 55) Poems accrue status and cultural value through being reprinted in anthologies, where they are placed in relation to other poems. This paper applies topic modeling and network analysis to a corpus of nineteenth-century British poems reprinted in British and American anthologies from 1880-2010 to understand the impact of those relationships.
Previous work used network analysis to examine the relationships among poets, poems, and anthologies in a corpus of 30 anthologies of nineteenth-century poetry published between 1880-2010 (Houston, 2017). A bimodal affiliate network of anthologies and poems reveals the relationships among anthologies that printed the same poems. A co-printing network (based on bibliographic co-citation analysis) consists of nodes representing each poem, with edges drawn between poems printed within the same anthology. Modularity analysis of this network reveals clusters of poems that are frequently printed together.
In this paper, I apply Latent Dirichlet Allocation (LDA) topic modeling (Blei et al) to the poems in the corpus and then examine the distribution of topics within the poem-anthology network and the co-printing network. LDA is a generative statistical model which assumes documents consist of “topics” made up of co-occurring words, and that these topics are present to varying proportions within the documents in the corpus. LDA has been shown to be an effective method for information retrieval and document classification tasks and has been applied to distant reading projects in the digital humanities on diverse materials ranging from novels to newspapers to scholarly articles (Buurma, 2015; Block, 2006; Goldstone and Underwood, 2012). Although the compressed semantic representation of an LDA topic can be seen as limiting the figurative complexity of poetic language (Rhody, 2012), the method has been shown to be effective for exploring and classifying short poetic texts (Navarro-Columbo, 2018; Plecháč and Haider, 2020; Šeļa et al, 2020).
Following Šeļa et al, I use an LDA topic model of the entire corpus as a representation of its semantic “topic space,” an “abstracted representation of poetic language” (Šeļa et al, 2020: 15). Each poem can then be labeled by its highest-ranking topic (by proportion within the document). Encoding this semantic information as node features within the poem-anthology network reveals how the selections within particular anthologies emphasize or minimize particular themes. Within the co-printing network, this semantic information reveals how strongly thematic connections relate to the structural relationships of the poems’ publication format. Combining the semantic insights offered by LDA topic modeling with the structural insights offered by network analysis offers new approaches to understanding the impact of influential anthologies (and their editors) in shaping subsequent generations’ understanding of nineteenth-century British poetry.


Blei, D et al.
(2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research
3: 993–1022.

Block, S.
(2006). Doing More with Digitization.

Buurma, R.
(2015). The fictionality of topic modeling: Machine reading Anthony Trollope's Barsetshire series.
Big Data & Society
2(2): 1-6.

Goldstone, A. and Underwood, T.
(2012). What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?
Journal of Digital Humanities

Guillory, J.
Cultural Capital: The Problem of Literary Canon Formation.
Chicago and London: The University of Chicago Press.

Harris, W.
(1991). Canonicity.
106: 110-21.

Houston, N.M.
(2017). Measuring Canonicity: a Network Analysis Approach to Poetry Anthologies.
Digital Humanities 2017

Navarro-Colorado, B.
(2018). On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry.
Frontiers in Digital Humanities
5(15). doi: 10.3389/fdigh.2018.00015.

Plecháč, P. and Haider, T.
(2020). Mapping Topic Evolution Across Poetic Traditions.
Digital Humanities 2020

Rhody, L.
(2012). Topic Modeling and Figurative Language.
Journal of Digital Humanities

Šeļa, A. et al.
(2020). Weak Genres: Modeling Association Between Poetic Meter and Meaning in Russian Poetry.
CHR 2020: Workshop on Computational Humanities Research

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO