Automatic Extraction of Poetry from Digitally Scanned Books

poster / demo / art installation
  1. 1. John Foley

    Smith College

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We present an automatic, learned model for the extraction of poetry from digitally scanned books. This abstract highlights our recent work on poetry identification from Internet Archive books and the public resources (code, data and models) that exist as a result. We hope that this is the beginning of deeper and richer research into poetry in the digital humanities because curating custom collections of poetry should be less expensive.Poetry in Digital LibrariesDigital libraries have expanded rapidly in quantity and quality of content over the past decade. Out-of-copyright and public domain works are available from the inventing of the printing press all the way to the early twentieth century.Unfortunately, this explosion in content has not quite connected all the way to different genres: large collections of poetry are not available because they are typically curated manually.The intersection of poetry and digital methods is actually fairly common and has been studied in a diverse set of languages and cultures e.g., Bangla (Rakshit et al., 2015), Arabic (Ahmed and Trausan-Matu, 2017) and Thai (Promrit and Waijanya, 2017). Features of poetry have also been studied using computational methods, e.g, meter (Hamidi et al., 2009), style (Baumann et al., 2018), authorship and time (Can et al., 2011), emotion (Alsharif et al., 2013; Barros et al., 2013; Kumar and Minz, 2014), and even content (Jamal et al., 2012; Choi et al., 2016; Lou et al., 2015; Kesarwani, 2018). Kaur an Saini’s recent work on classifying Punjabi poems into four categories is not a survey, but does provide a table of recent work, language targeted, and features discussed (2017).However, most of these works use small datasets (10s-100s of poems), because the cost of collecting and curating poetry is so high. There is a lot of poetry available in digital libraries, but it’s effectively hidden in those books.Automatic Extraction of PoetryUnderwood et al. (2013) present a study of genre in Hathi Trust books, and one of their genres is poetry, which they extend to page level labels in later work (Underwood, 2014). Other recent work uses image classification approaches (Lorang et al., 2015), focuses on Australian newspapers (Kilner and Fitch, 2017) or is language-specific on a small collection (Tizhoosh et al., 2008).These existing approaches cannot be cleanly applied to discover poems such as this poem about “Sweet Peas” that our algorithm identified in the middle of a gardening guide (Figure 1).Figure 1: A Poem printed in the middle of a Gardening Guide (Rockwell et al., 1917). This is the kind of “hidden” poetry our algorithm was designed to target.Drawing inspiration and ideas from these works, we formulated the poetry identification problem: does a given scanned book page contain poetry on it?Using a few thousand labeled pages as training data and only language-independent features, we developed a new model for poetry identification. This model is both effective (F1 = 0.83) and efficient (500,000 books/hour - single machine). It runs on DJVU-XML books from the Internet Archive.Public Resources, Code, & Open DataWe released a variety of public resources. There is a dataset of our identification task as well as a JSON-formatted collection of 600,000 pages identified to contain poetry from a random selection of 50,000 books. Our model is available and our methodology can be found in more detail in my dissertation (Foley, 2019).Datasets: & Model:

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at Data for this conference were initially prepared and cleaned by May Ning.

Conference website:


Series: ADHO (15)

Organizers: ADHO