A poster describing the latest version of the Extracted Features dataset derived from the HathiTrust Digital Library's 17+ million volume corpus. This version employs Linked Data standards to both, make the dataset more accessible and to incorporate richer metadata describing the volumes from which the data was derived. The dataset is arranged by the volumes and the data (tokens, part of speech tags, language tags, line counts, etc.) is directly associated with the metadata describing the volume in the form individual JSON-LD documents. The EF dataset provides a ready means of interacting with volumes whose intellectual content remains under copyright and allows a variety of analytics, such as visualizing word usage over time, to be carried out on data that would not otherwise be accessible.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Carleton University, Université d'Ottawa (University of Ottawa)
Ottawa, Ontario, Canada
July 20, 2020 - July 25, 2020
475 works by 1078 authors indexed
Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.
Conference website: https://dh2020.adho.org/
Series: ADHO (15)