Mining Texts with the Extracted Features Dataset

workshop / tutorial
  1. 1. Peter Organisciak

    University of Illinois, Urbana-Champaign

  2. 2. J. Stephen Downie

    University of Illinois, Urbana-Champaign

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The HathiTrust Digital Library holds digitized copies of nearly 15 million scanned volumes from libraries around the world. These volumes are a significant resource for large-scale research: with their scale and breadth of material, a digital humanities scholar can make new inferences on history, language use, cultural trends, or even the structure of the printed word. However, access is complicated by the complexities of navigating copyright laws around the world, while use of the materials is impeded by the effort and technical demands of a researcher. To address both of these issues, the Extracted Features (EF) dataset from the HathiTrust Research Center (HTRC) provides volumes in a format that has already been cleaned, extracted, and tagged for computation use.
In this hands-on tutorial, participants will learn to use the Extracted Features dataset for text analysis alongside the HTRC Feature Reader library, equipping them to perform research on millions of publicly-accessible volumes. Through the HTRC Feature Reader, participants will be make use of popular data science tools in Python for EF dataset analysis, and will be left with demonstrative materials to repurpose in their future work.

The Extracted Features (EF) dataset from the HathiTrust Research Center (Capitanu et al., 2015) provides an open and permissive download of page-level extracted information for every page of 4.8 million volumes from the HathiTrust Digital Library. A “feature” refers broadly to a quantitative measure of some property in a dataset; for example, the number of times a word appear on a page. The EF data features include part-of-speech-tagged term counts, line and sentence counts, counts of the characters occurring on the far left and far right side of a page, and inferred language probabilities. Most notably, this information is provided
for every page. Also, headers, body, and footer have been identified and features are provided separately for each part.

In the tutorial, participants will learn the significance of each feature, such as using line counts and character information to identify the type of content on a page, or using part-of-speech tags for improving topic models based on content.

This tutorial introduces participants to introductory text analysis in Python using the Extracted Features dataset with the HTRC Feature Reader. This includes accessing term counts and other raw information, slicing within that data, visualization trends within or across books, and leading into advanced techniques like topic modeling and sentiment tagging.
The skills taught in this tutorial are underpinned by programming in Python using a popular set of data science libraries. All code examples are provided, though they are most useful if participants are comfortable in tinkering and have a familiarity with Python's basic conventions. Our intention is to make the code examples transparent enough so as to be easily modifiable by beginner users.
A participant completing the workshop will understand:

the structure and possibilities of the HTRC Extracted Features Dataset;
how to access the EF dataset files, both for individual and bulk use;
how to start a Jupyter notebook, an accessible browser-based approach to data science in Python;
the fundamentals of reading volume files, accessing metadata, and slicing and grouping token lists;
basic visualization of EF data; and
an advanced analytic technique modeled on recent digital humanities methods, discussed below.

The first part of the tutorial teaches the fundamental skills for working with the HTRC Feature Reader. For the final exercise of the tutorial, participants have a choice from prepared advanced exercises, which instructors will assist individually. This structure accommodates more intensive approaches in the time given, while also leaving participants with more examples for practicing their newly-acquired skills in the future.
The advanced exercises to be provided are: classification of paratext, using features suggested by Underwood (2014); visualization of sentiment trends in books as a proxy for a plot arc as previously performed by Jockers (2015), and within-book topic modeling using the inference approach presented by Organisciak et al. (2015).

The EF dataset offers a diverse and incredibly large collections of works for analysis in an easily accessible way. By providing these works as already extracted data, the EF dataset covers a large part of the text analysis workflow for researchers and is thus particularly suited for learners. This tutorial will use EF to teach text analysis through Python, using new software called the HTRC Feature Reader. By the end, students will be able to slice, group, and manipulate individual volumes for their needs, and will be familiar with techniques for modeling texts, identifying pertinent pages, and plotting trends across books.

Capitanu, B., Underwood, T., Organisciak, P., Bhattacharyya, S., Auvil, L., Fallaw, C., J. and Downie J. C. (2015).
Extracted Feature Dataset from 4.8 Million HathiTrust Digital Library Public Domain Vol. 2,[Dataset]. HathiTrust Research Center, doi:10.13012/j8td9v7m.

Jockers, M. L. (2015). Revealing Sentiment and Plot Arcs with the Syuzhet Package.
Matthew L. Jockers. Blog.

Organisciak, P., Auvil, L. and Downie J. S. (2015). Remembering books: A within-book topic mapping technique.
Digital Humanities 2015. Sydney, Australia.

Underwood T. (2014).
Understanding Genre in a Collection of a Million Volumes. Interim Report.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2016
"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website:

Series: ADHO (11)

Organizers: ADHO