The Augmented Criticism Lab’s Sonnet Database

paper, specified "short paper"
  1. 1. Michael Ullyot

    University of Calgary

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction / Importance
The sonnet is a prodigious poetic form. Since its invention in the 13th century by Giacomo da Lentino, hundreds of poets have written many thousands of sonnets in European literary languages. It was popularized by Petrarch in the 14th century, translated by Wyatt and Camões in the 16th, and reformulated by poets from Shakespeare to Rilke to Frost. The experimental poet Raymond Queneau has even written a machine-generated sequence whose lines can be recombined in a hundred trillion different ways.
This project has begun to compile every extant sonnet into a database < >, in order to quantify their features through time. Those features include dates, languages, authors, diction (word choices), sentiments, named entities, and form.
My research question is straightforward: just what is a sonnet? Definitions have tended to focus on its form: 14 lines of rhymed ten-syllable (pentameter) verse. Subtypes, including the Petrarchan and the Shakespearean sonnet, are often based on rhyme scheme. But another definition is based on generic rather than formal features: a first-person reflection or “dialectical self-confrontation,” often with a volta (or turn) from problem to resolution (Oppenheimer: 1989). To what degree, then, is the sonnet a form or a genre? What subtypes will a comprehensive, quantified taxonomy reveal?
I am pursuing these inquiries by gathering as many known specimens of sonnets as possible, and then quantifying my analysis of their metadata. This includes metadata at the level of tokens and lines; of clauses and sentences; of rhyme-units (couplets/quatrains/sestets/octets) and complete sonnets; and of their published sequences. There are many features of these units that can be encoded, largely through automated natural-language processing. Tokens can be lemmatized and tagged with their parts of speech; their order and frequencies can be modelled as topics; their syllables per line can be counted; their rhyme with other tokens can be represented. The only human-dependent encoding the database includes at present leverages the expertise of anthology editors: orthography, punctuation, authors, dates, and copyright.
The sonnet genre must be localized in its diction. Some words appear more frequently than others, particularly in the sonnet’s early centuries of first-person lovelorn reflections: words like ‘love’ and ‘she’ and ‘suffer’ and so on. So, too, do words describing the sonnet’s own composition: words like ‘ink’ and ’lines’ and (simply) ‘this’. But genre can be quantified at the level of the sentence, as other scholars have discovered by analyzing topics and principal components in Shakespeare’s sentences (Estill and Meneses: 2018; Hope and Witmore: 2010). This project will determine what generic features the sonnet’s words and sentences reveal.

The ACL Sonnet Database has standardized its texts according to the TEI guidelines, making them available to basic query functions and JSON object serialization. Thus far it contains 1880 Englishlanguage sonnets, including 445 transcribed from a single print anthology (Hirsch and Boland: 2008). My students and I have populated this repository first with English-language sonnets because they are numerous enough to offer a test case for machine-enabled research in any natural language.
The database also maintains a Python class for connecting to its data via the RESTful API, automating much of the data parsing for analysis with software like the Natural Language Toolkit (NLTK). Initial student-driven inquiries began with close readings of ten sonnets from the anthology to identify quantifiable features. Students have charted the frequency distributions of the sonnets’ rhyme schemes; enjambment; rhetorical figures (anaphora and epistrophe); and topics, including rhetorical questions and references to celestial objects and classical muses.

At this proof-of-concept stage, the database offers results only in these limited domains, and on this limited dataset. At the time of the DH2019 conference, it will have many more thousands of sonnets. I will report on their quantifiable formal characteristics, including rhyme schemes, meter, line lengths, sentence lengths, word frequencies, part-of-speech distributions, and ngrams. I will also report on topic models and the sonnets’ principal components distributed through time, author nationality and gender, and other salient subdivisions.

Anthologies of sonnets are sufficient for preliminary student-driven inquiries, but to generate insights into the sonnet writ large, a wider net is necessary. I have begun conversations with machine-learning specialists to train a neural network to recognize sonnets in undifferentiated text files, based on the formal and generic characteristics of sonnets isolated by anthology editors. To prepare for this phase my approach will be two-pronged: to give students another dozen anthologies for further transcription; and to use that expanding repository of sonnets as a training set for a machine-learning process that will identify similar poems in a corpus of 70,000 English texts printed before 1700, the Early English Books Online - Text Creation Partnership (EEBO-TCP) corpus. Early sonnets establish conventions to which later English sonnets respond, so they are a valid place to begin this inquiry. That process has already begun with a subset of 18,000 XML files from the EEBO-TCP corpus containing the <l> element, denoting lines of poetry. My lead programmer, who built the database, will write an algorithm that parses these undifferentiated elements into clusters of 14-line sequences, on the provisional assumption that all 14-line stanzas or poems bear a family resemblance to the sonnet. (This, too, is a provisional assumption; there are sonnets, including one by Shakespeare, of irregular lengths.) I will begin with 14-line sequences in order to identify the extra-formal characteristics that are twinned with this form; only then can I unshackle the detection algorithm from the constraints of form, to see which other poetic units bear the nearest affinity.


Estill, L., and Meneses, L. (2018). Is Falstaff Falstaff? Is Prince Hal Henry V?: Topic Modeling Shakespeare’s Plays. Digital Studies/le Champ Numérique 8(1).

Hirsch, Edward and Boland, Eavan (eds.) (2008.
The Making of a Sonnet: A Norton Anthology. New York; London: W. W. Norton.

Hope, J., and Witmore, M (2010). The Hundredth Psalm to the Tune of “green Sleeves”: Digital Approaches to Shakespeare’s Language of Genre. Shakespeare Quarterly 61(3).

Oppenheimer, Paul (1989).
The Birth of the Modern Mind: Self, Consciousness, and the Invention of the Sonnet. New York; Oxford: Oxford University Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO