Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust

poster / demo / art installation
Authorship
  1. 1. Nikolaus Nova Parulian

    Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign

  2. 2. Kristina Hall

    HathiTrust

  3. 3. Ryan Dubnicek

    Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign

  4. 4. Yuerong Hu

    Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign

  5. 5. J. Stephen Downie

    Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust Nikolaus Parulian1, Kristina Hall2, Ryan Dubnicek1, Yuerong Hu1, Stephen Downie11 HathiTrust Research Center, School of Information Sciences, University of Illinois at Urbana-Champaign2 HathiTrust, University of Michigan IntroductionHathiTrust fully provides scanned images, plain text and metadata in support of their mission to contribute to research, scholarship and the sharing of human knowledge. Since facts, unlike expressive content, are exempt from copyright, this project seeks to use machine learning approaches to evaluate how often expressive content appears in the first 20 pages of a given HathiTrust volume, with an eye to potentially making this data open. Information contained in the first 20 pages of a volume can be useful to scholars. For example, the title page, table of contents, or acknowledgment page may contain useful information to understanding the volume. However, it is likely that some volumes include materials that have copyright protection in this same range. Some observed copyrighted materials in this page range are illustrations or even the main text itself. One method to understand if expressive content is exposed in the first 20 pages would require manual page labeling, which is time-intensive. A machine learning approach is more efficient and could be well-suited to this type of prediction task, and we seek to answer these research questions: Can we develop a machine learning approach to help detect expressive contents in the first 20 pages of HathiTrust volumes? How reliably does this approach match manual labeling data?MethodologyProviding a high-quality dataset for training the machine learning model is essential, and human expertise is required. We manually sampled 900 volumes from HathiTrust and labeled each of the first 20 pages: either as 'factual' for a page with contents lacking creative expression and 'creative' if there is protected material on the page. Then we developed a workflow to use the statistical features of the page from the HathiTrust Research Center (HTRC) Extracted Features Dataset as additional data to train our model. The features used included: token and line counts, tokens per line, and begin and end line characters.Using the features above, we trained and compared four basic classification models on our feature set: Random Forest, Logistic Regression, Support Vector Machine, and Stochastic Gradient Descent. Through this comparison we hope to both find the most accurate model as well as generally evaluate if a machine learning approach can be accurate for this task. The preliminary results of our prediction model can be seen in Figure 1.  Figure 1: Confusion Matrix for four models for predicting creative (protected) contentConclusion and Future WorkResults suggest that the Random Forest model performs best in both accuracy for predicting all labels (86%) and recall (0.88) for predicting creative content. For this project, we give more attention to the recall on the 'creative' label because a false negative on this label is a less desirable outcome. Future goals of this work are: to pilot different methods that can increase confidence in determining creative content, such as deep learning and utilizing page text, and to increase the scope of this prediction beyond our test set to a larger set of HathiTrust volumes.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO