Extending the Utility of the HTRC Extracted Features Dataset Through Linked Data

  1. 1. Jacob Jett

    University of Illinois, Urbana-Champaign

  2. 2. Boris Capitanu

    University of Illinois, Urbana-Champaign

  3. 3. Deren Kudeki

    University of Illinois, Urbana-Champaign

  4. 4. Timothy W. Cole

    University of Illinois, Urbana-Champaign

  5. 5. J. Stephen Downie

    University of Illinois, Urbana-Champaign

A poster describing the latest version of the Extracted Features dataset derived from the HathiTrust Digital Library's 17+ million volume corpus. This version employs Linked Data standards to both, make the dataset more accessible and to incorporate richer metadata describing the volumes from which the data was derived. The dataset is arranged by the volumes and the data (tokens, part of speech tags, language tags, line counts, etc.) is directly associated with the metadata describing the volume in the form individual JSON-LD documents. The EF dataset provides a ready means of interacting with volumes whose intellectual content remains under copyright and allows a variety of analytics, such as visualizing word usage over time, to be carried out on data that would not otherwise be accessible.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO