Large-scale text analysis through the HathiTrust Research Center

Peter Organisciak; Sayan Bhattacharyya; Loretta Auvil; J. Stephen Downie; Beth Plale

Authorship

1. Peter Organisciak

Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign
2. Sayan Bhattacharyya

Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign
3. Loretta Auvil

University of Illinois, Urbana-Champaign
4. J. Stephen Downie

Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign
5. Beth Plale

Data to Insight Center - Indiana University, Bloomington

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
Digitization of text and tools for making sense of it are enabling digital humanists to perform ever-larger exploration and inference, from early work analyzing the style of a relatively small set of works or a single writer 12, to the modern-day practice of “distant reading” entire eras.34 However, “research, even in the digital age, is… limited by the materials that scholars can readily and reliably access”.5 Such factors include copyright, availability of materials, cost of infrastructure, and/or the technical capabilities demanded of the researcher tend to limit the access that digital humanists have, in practice, to the texts with which they would like to work.

Overview
The HathiTrust Research Center is a collaborative effort formed in 2011 through a partnership between Indiana University, the University of Illinois at Urbana-Champaign, and the HathiTrust, to meet the challenge of dealing with massive amounts of digital text that digital humanists confront when they perform “distant reading.” We will present recent progress by the HTRC in addressing this ongoing challenge.

HTRC aims to support the natural investigative process of researchers who want to perform text analytics on the HathiTrust corpus by running the analytics algorithms “close to the data,” even when content restrictions do not allow actual human-reading level (“consumptive”) access to the text. A popular parallel to this higher-level exploration of digital corpora is the Google Ngram Viewer6. HTRC’s tools maintain a conceptually similar non-consumptive separation from the content, while allowing more control over the content being looked at.

The HathiTrust public domain corpus consists of an online repository comprising a comprehensive body of published works drawn from the collections of over sixty participating major research institutions and libraries. Digital humanists can access digitized works through the HTRC via two different levels: the production-system and the sandbox. The production level of the HTRC provides access to the public domain HathiTrust corpus (a mix of works digitized by Google and other digitization projects), secured to comply with the restrictions on the content use. In contrast, the sandbox level of the HTRC provides a more open level of access, to a smaller corpus consisting of 250,000 volumes which do not have any known copyright restrictions. Building on top of the SEASR tools7, both systems support analytics such as topic modeling, tag clouds, entity extraction, spellcheck reports, and the Dunning’s log-likelihood statistic on the distribution of text. They also include functionalities such as a Marc downloader and a word frequencies. The intention is for researchers to design algorithms on the more open sandbox system and then submit the algorithms on the production system. A metadata and data API exist on the sandbox system as well as on the production system for accessing metadata and token counts.

Methodology
In this poster, we focus on the questions that humanities scholars can address using the HTRC’s front-end tools. Specifically, HTRC offers a workset builder for searching the HathiTrust collection and creating collections of texts (“worksets”), and a portal for analyzing such worksets through a simple web interface.

Fig. 1: Results are collected for a workset of Late 19c Poetry.

Fig. 2: Setting up an algorithm to run on the workset.

Fig. 3: Results from algorithm shown.

As an example of a digital humanities need that HTRC can enable, suppose that an instructor intends her students to discover the differences in themes addressed between that english poets of the eighteenth-century and those of the nineteenth century. Using the HTRC’s workset builder, the instructor can prepare different worksets consisting of appropriately large corpora of eighteenth-century english poetry, nineteenth-century english poetry, and a combination of both. Using the Dunning log-likelihood in the HTRC portal, students can infer a distribution of words in each of these worksets, discovering for themselves that different sets of words —and arguably different thematic concepts— were characteristically preponderant in one century relative to the other. The instructor can also use our metadata capabilities to subset these worksets into finer sets, such as those consisting of male eighteenth-century poets, female eighteenth-century poets, male and female poets, etc. By making worksets easy to create and modify, HTRC affords the user the ability to run analytics on various unions and intersections of sets, as this description suggests, and has the potential to facilitate hands-on-discovery on the part of the instructor’s students.

HTRC has been developed with digital humanists in mind, to overcome some of the technical, logistic, and accessibility hurdles present in large-scale text analysis. As it moves forward, the feedback of scholars is being listened to and solicited in order to meet the needs of scholars. However, the infrastructure that is currently in place already makes it a valuable tool for search.

References
1. Milic, Louis Tonko (1967). A Quantitative Approach to the Style of Jonathan Swift. Mouton: Walter de Gruyter.

2. Burrows, John Frederick (1987). Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method. Oxford: Clarendon Press.

3. Moretti, Franco (2013). Distant Reading. London: Verso.

4. Jockers, M. (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.

5. Belasco, Susan (2011). Whitman’s Poems in Periodicals: Prospects for Periodicals Scholarship in the Digital Age. in Earhart, Amy E. and Andrew Jewell. The American Literature Scholar in the Digital Age. Ann Arbor, MI: University of Michigan Press. p. 54.

6. Michel, Jean-Baptiste et al. (2011) Quantitative Analysis of Culture Using Millions of Digitized Books. Science, Vol. 331 No. 6014, 4 January 2011. pp. 176-182.

7. Auvil, Loretta, Boris Capitanu, Matthew Jockers, Ted Underwood, and Ryan Heuser. (2011) SEASR Analytics. Presentation at the Chicago Colloquium on Digital Humanities and Computer Science, Chicago, Illinois. November 19. chicagocolloquium.org/wp-content/uploads/2011/11/dhcs2011_submission_17.pdf. Retrieved: Oct 30, 2013.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Large-scale text analysis through the HathiTrust Research Center

1. Peter Organisciak

2. Sayan Bhattacharyya

3. Loretta Auvil

4. J. Stephen Downie

5. Beth Plale

ADHO - 2014

"Digital Cultural Empowerment"