Workshop: HathiTrust Research Center’s Extracted Features 2.0 Dataset

workshop / tutorial
  1. 1. Ryan Dubnicek

    HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA

  2. 2. Jennifer Christie

    HathiTrust Research Center, Pervasive Technology Institute / Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN USA

  3. 3. Deren Kudeki

    HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA

  4. 4. Glen Layne-Worthey

    HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA

  5. 5. John A. Walsh

    HathiTrust Research Center, Pervasive Technology Institute / Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN USA

  6. 6. J. Stephen Downie

    HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This workshop will introduce participants to the HathiTrust Research Center’s Extracted Features Dataset, and demo new data fields and functionality introduced in the latest version, 2.0. Generated from the over 17 million volumes (over 60% still in copyright) in the HathiTrust Digital Library, the EF 2.0 Dataset supports text and data mining in this corpus while still being distributed as open, restriction-free data. This tutorial will introduce the EF 2.0 Dataset, the key concepts behind its creation, and hands-on research use cases for the Dataset using IPython notebooks.
Index Terms—digital libraries, text and data mining, HathiTrust, HTRC Extracted Features Dataset, collections as data

Digital libraries are in a state of flux: as institutional collections have grown, their purposes have been transformed from “preservation and access” into “big data” repositories. “The Santa Barbara Statement on Collections as Data” identifies foundational principles for this new focus of digital libraries, encouraging library providers and researchers to view these vast digital collections as sources of data for computational research (Santa Barbara Statement on Collections as Data, 2019). This approach opens new possibilities of inquiry and discovery, but also carries with it the weight of copyright limitations and implications; questions of bias, completeness and representation in data; and additional requirements for new skills, sophisticated infrastructures, and increased funding from stakeholders at all levels.
Having foreseen some of these issues, the HathiTrust Research Center (HTRC) developed an approach to “non-consumptive research,” a framework that supports computational research of digital items in the HathiTrust Digital Library (HTDL) while complying with copyright limitations on data access. Similarly, through its Extracted Features (EF) Dataset, HTRC supports an approach to computational research that seeks to minimize skill and infrastructure requirements while also aligning with the broader considerations mentioned above. First released in 2015 with version 0.2, the EF Dataset blends MARC-derived, volume-level metadata with computationally generated, page-level metadata and data (Capitanu et al., 2015). The EF Dataset is open (available under a Creative Commons Attribution 4.0 International License), and includes data for each volume in the HathiTrust Digital Library (HTDL) in a bag-of-words format, supporting a multitude of computational use cases.
This workshop will focus on introducing the context and application of the EF Dataset version 2.0, for the first time published in a linked-data-compliant format that includes new and updated data fields, a revised structure, and meaningful uniform resource identifiers (URIs) (Jett, et al., 2020). Attendees will learn about the data from which the EF Dataset is generated, the motivation for the dataset’s creation, its structure and format, and its potential applications and limitations. Additionally, workshop attendees will work hands-on programmatically with EF 2.0 data using IPython notebooks, and will use the HTRC FeatureReader Python library, developed specifically to facilitate use of the EF Dataset (Organisciak and Capitanu, 2016).

Workshop Description

Section In this three-hour workshop, attendees will first be introduced to HathiTrust, the data of the HathiTrust Digital Library, and the HathiTrust Research Center, followed by context and motivation of the Extracted Features Dataset, the specifics features (or “fields”) included in the dataset, its format, and a discussion of its potential uses and limitations. Finally, attendees will have a chance to work directly with Python code to analyze HTRC EF data and to use the HTRC FeatureReader Python library. Hands-on work will also present new linked data fields and their potential application for research with the EF Dataset. A more detailed breakdown of the workshops modules follows:
Section 1: Intro to HathiTrust Digital Library and HTRC

What is HathiTrust?
What is/isn’t the HathiTrust Digital Library?
What is the HathiTrust Research Center?

Section 2: Context and motivation for the HTRC EF Dataset

Non-consumptive research
What is in the data?
Data models and analysis techniques

Section 3: Ethical considerations of text datasets

Bias in libraries, datasets, data, and algorithms

Section 4: Getting and Exploring EF data

Hands on with EF data, Python notebooks and the HTRC FeatureReader library

The workshop will be a mix of presentation, discussion, and hands-on activity, with an emphasis on open discussion. Our discussion on the ethics of dataset construction, big data and algorithmic analysis will highlight work from Katherine Bode (2020), Catherine D’Ignazio and Lauren Klein (2020), Mimi Ọnuọha (2021) and Roopika Risam (2019).

The workshop is open to all, and is targeted at digital library and digital humanities scholars of all levels, library and information professionals, and anyone interested in computational research using HathiTrust Digital Library data. Attendees will develop an understanding of HathiTrust Digital Library, the HathiTrust Research Center’s services, tools and platforms, selected ethical issues associated with digital libraries, dataset and data analysis, and the Extracted Features model, its suitability for various text and data mining applications, and hands-on familiarity with using the dataset for exploratory data analysis. The workshop can accommodate 10 - 40 attendees.
This workshop builds on years of HTRC workshops ranging from general introductions to text and data mining to more advanced work with HTRC’s tools, services, and data. This workshop will provide an in-depth focus on the Extracted Features Dataset version 2.0, demo new ways for exploring and using the dataset and also engage with the ethics of datasets, data and algorithms.
The general objectives of this workshop are to introduce the HathiTrust context, motivation for, and development and release of the Extracted Features Dataset, and to familiarize participants with the data format, its potential applications, and the latest additions in the 2.0 version. Topics of instruction and potential discussion will include:
• How does the Extracted Features Dataset help make the HTDL more accessible for text and data mining?
• What is the EF Dataset model and the structure of its files?
• What research use cases or exploratory data analysis can be supported using HTRC EF data, especially using the new features of the 2.0 dataset”?
• What tools are available for working with EF data, and hands-on experience using them in Python notebooks?
Our goal is for attendees to leave this workshop with a general understanding of the utility of derived datasets and to be comfortable beginning exploratory data analysis using the EF Dataset.
In addition to more HTRC-centric learning objectives, hands-on activities will have added bonuses of an introduction to common cultural analytics tasks in Python, and the associated software libraries used for such tasks, including Pandas, NLTK and Gensim.

Instructor Biographical Information
Ryan Dubnicek is a Digital Humanities Specialist with HTRC, where he works on external and internal research support and outreach and education. He has a Master of Science in Library and Information Science from the University of Illinois at Urbana-Champaign.
Jennifer Christie is an Associate UX Specialist at the HathiTrust Research Center. She is interested in user-centered interaction design and front-end web development. Her research is grounded in qualitative and quantitative assessments of HTRC’s user base, as well as assisting with outreach and education efforts to engage with HTRC’s growing user community.
The remaining listed authors have contributed to the development of this curriculum and its associated instructional materials and concepts, but will not be part of the instructional team.

The curriculum presented in this workshop was developed with support from HathiTrust, University of Illinois, Indiana University, and the Institute of Museum and Library Services, award number RE-00-15-0112-15. Additionally, former Associate Director of Outreach and Education with HTRC, Eleanor Koehl, contributed heavily to the development and design of the teaching materials and activities.

Bode, K. (2020). Why You Can’t Model Away Bias. Modern Language Quarterly, 81(1): 95–124 doi:

D’Ignazio, C. and Klein, L. (2020). ‘What Gets Counted Counts’. Data Feminism (accessed 10 December 2021).

Jett, J., Capitanu, B., Kudeki, D., Cole, T., Hu, Y., Organisciak, P., Underwood, T., Dickson Koehl, E., Dubnicek, R. and Downie, J. S. (2020). The HathiTrust Research Center Extracted Features Dataset (2.0) HathiTrust Research Center doi:
10.13012/R2TE-C227. (accessed 2 June 2022).

Organisciak, P. and Capitanu, B. (2016). Text Mining in Python through the HTRC Feature Reader. Programming Historian (accessed 2 June 2022).

Risam, R. (2019). The Stakes of Postcolonial Digital Humanities. New Digital Worlds. (Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy). Northwestern University Press, pp. 23–46 doi:
10.2307/j.ctv7tq4hg.5. (accessed 10 December 2021).

Partners Always Already Computational - Collections as Data (accessed 30 November 2021).

The Library of Missing Datasets — MIMI ONUOHA MIMI  ONUOHA (accessed 10 December 2021).

The Santa Barbara Statement on Collections as Data Always Already Computational - Collections as Data (accessed 30 November 2021).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO