Building the Princeton Prosody Archive

poster / demo / art installation
  1. 1. Meredith Martin

    English Dept - Princeton University

  2. 2. Grant Wythoff

    Society of Fellows - Columbia University

  3. 3. Meagan Wilson

    English Dept - Princeton University

  4. 4. Travis Brown

    Maryland Institute for Technology and Humanities (MITH) - University of Maryland, College Park

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

The Princeton Prosody Archive (PPA) is a full-text searchable database of nearly 10,000 digitized texts – comprising 800 million words – on prosody published in English between 1750 and 1923. During the 2012-2013 academic year, a grant from the Mellon Foundation supported the completion of the PPA’s first phase. This poster will reflect on the outcomes of the start-up stage, as well as some of the challenges and opportunities the PPA anticipates as the digital collection expands. Conference participants are encouraged to visit to access the Archive’s beta-site.
2. Prosody and Historical Poetics

In the nineteenth century, “prosody” – which refers to both pronunciation and the technicalities of versification – was codified as the fourth section of the grammar book, after “orthography,” “etymology,” and “syntax.” By the early twentieth century, “prosody” referred primarily to versification. In recent years, scholars of English literature have begun questioning the uniformity of poetic terminology, recognizing terms such as “prosody,” “meter, “tone,” or “rhythm” as culturally determined and fundamentally unstable concepts that have shifted through the centuries. By turning to historical texts, they are tracing how inherited notions of poetic form developed over time, and in turn, painting a more accurate picture of the evolution of English-language discourse on poetics.
As the field of historical poetics has grown, so too has our access to nineteenth-century materials through online archives. The majority of these digital resources, however, are primarily focused on prose works, and thus both technological and scholarly innovation has been made in the field of prose. The PPA is filling the gap as the only digital archive dedicated to the study of poetics, writ large, and allowing scholars to practice the kind of broad-view historical research the field demands. The PPA aggregates foundational texts in the history of poetics, reviews of these texts, debates about poetics in the public press, and grammar books and poetic handbooks that present contrary definitions and views of poetics so that “big questions” about literary movements and culture can be posed. With this large data-set, we can now ask: How did the changing science of linguistics and increased impulse toward education impact discussions of poetry over time? How often were particular poets used as examples in poetic pedagogy? How, when, and why did certain poetic terms and genres came in and out of use?
3. Methodology

The PPA partnered with Google Books and HathiTrust in 2011, and the collection is currently composed of works digitized by Google and Hathi.[1] In 2013, the PPA began to develop a beta-site, which, though still under development, allows users to browse, search, and correct its content. To best serve its user community, the PPA functions as a freely-available, user-friendly repository, a trusted scholarly reference source, and a creative workspace that enhances traditional scholarly practices and pedagogies while enabling new ones.
3.1 Curation: Google Books and the HathiTrust Digital Library

The PPA’s initial corpus was selected from the holdings of the HathiTrust Digital Library. We began by gathering every out-of-copyright text referred to by prosody scholar T.V.F. Brogan in his annotated bibliography English Versification, 1570-1980 that had been digitized.[2] Though the availability of Hathi’s digital facsimiles and transcriptions is incredibly valuable, some aspects of their digitization and description present serious technical obstacles to the kinds of analysis the PPA intends to support. The most obvious is that the transcriptions were prepared by a range of Optical Character Recognition (OCR) systems, and few (if any) were hand corrected. Most were digitized as part of the Google Books program, whose OCR tools are not tailored to the vocabulary, orthographic conventions, or typefaces of eighteenth and nineteenth century texts. They were generally unable to capture indentation, italicization, or other formatting, variations in font size, or diacritical marks, not to mention musical notation or non-standard marks.
3.2 OCR and Diacritic Correction: Representing Scansion

When dealing with texts on prosody and versification, accurate representation of diacritics and typographical marks is particularly important. How do you render musical annotation, scansion, line spacing, or iambic markings, for example, into plain text? Because of the focus on notation and the transmission of concepts and terms, particular care must be taken to ensure that these issues do not interfere with (or silently distort) scholarly analysis. To that end, we are developing a model for encoding scansion – the non-textual elements such as musical notation, macron, breve, or other diacritics, including non-standard marks created by the many scholars who attempted to invent prosodic systems in English. Moreover, we will employ the kind of OCR that retains document coordinates for individual characters whose position on the page often conveys important information.
3.3 Metadata Correction: Scholarly Re-use and Linking Data

The metadata we ingest from HathiTrust also presents challenges. One of the PPA’s goals is to allow researchers to trace the development of prosodic discourse across time and place, and the ability to support this functionality depends on consistent and reliable metadata. While the HathiTrust provides the Machine-Readable Cataloging (MARC) records that have been supplied by contributing libraries, the fields indicating the place and date of publication are free text and vary widely in their conventions of encoding. In the PPA’s start-up phase, we developed an application that assembles text and metadata from the HathiTrust Digital Library, performs some initial automated correction, and loads the text and metadata into a Drupal 7 installation, where it can be browsed, searched, and corrected by scholars working with the Archive. Corrections to metadata can be credited to registered and authenticated users, and metadata fields can now even be versioned, using Drupal 7’s native revision control. In this initial phase, however, these corrections are essentially locked in the Drupal data store; they cannot be returned to the HathiTrust Digital Library or conveniently shared with other scholars working with the same HathiTrust volumes in other contexts. Going forward, the PPA will explore possibilities for enacting a workflow on its own metadata, engaging in the correction of HathiTrust metadata and connecting those corrections to linked data resources by working with the Maryland Institute for Technology in the Humanities and the Foreign Literatures in America project.
3.4 Connecting Prosody Networks: Topic Modeling and Visualization

Topic modeling, and specifically Latent Dirichlet allocation (LDA) has received attention in the digital humanities community over the past several years, in part because it is an unsupervised method – it does not require expensive training material or elaborate encodings – and also because it is relatively robust against textual errors. We have begun experimenting with LDA, not only to return a set of “topics” (which are simply distributions over the vocabulary) that often characterize the semantic and thematic composition of the PPA’s corpus in compelling ways, but also as a means by which we can identify mistranscription, special characters, and even musical notation. We also plan to begin experimenting with visualization tools in the following ways: 1) Plotting temporal and geographical metadata; tools such as Google Earth, MIT’s SIMILE, and Leaflet offer practical and intuitive ways to allow users to navigate temporally and geographically situated data sets interactively – for example, to view a three-dimensional chart on a globe indicating the relative prominence of cities as places of publication while moving a time slider through several centuries; 2) Mapping the documents in the corpus by its topical or lexical spaces; here, each document is represented as a point in a high-dimensional space, where the dimensions of the space are features such as counts or frequencies of individual words or n-grams, or the percentage of words allocated to a particular topic in a topic model; 3) tracking discursive networks by quotation identification and citation extraction; for example, the quotation of exemplars could be represented as a bimodal network, with nodes representing both volumes in the archive and lines of verse, and with edges from the former to the latter indicating instances of quotation.
3.5 Sharing Results

The PPA is committed to providing models so that other digital humanists struggling with the question of how to organize and present their own Hathi collections (in their research or in the classroom). Though these scholars might not be subject area experts in prosody or historical poetics, we would like to provide enough information that we might navigate unspecialized visitors through the corpus and share ideas about how they might build similar archives themselves.

[1] We negotiated a Google Distribution Agreement between the Princeton University Library, Princeton Counsel, and HathiTrust that allowed us to access, download, and host all of this data on our own servers. A spreadsheet of all Archive monographs is available online at “Princeton Prosody Archive Database.” The PPA’s four collections can also be accessed through the HathiTrust site. See: 1) “Brogan's English Versification, 1570-1980” (578 works); 2) “Prosody Archive” (1,308 works); 3) “PPA Subject Search” (6,991 works); and 4) “Graphically/Typographically Unique“ (26 works set aside as possessing especially complex page images that would be misread by OCR).
[2] Brogan, Terry V. F. English Versification, 1570-1980: A Reference Guide with a Global Appendix. Baltimore: Johns Hopkins University Press, 1981.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO