A Study on the Accuracy of OCR-based and NLP-based detection of Japanese Text in the HathiTrust Extracted Features v2.0 Dataset

poster / demo / art installation
  1. 1. David Bainbridge

    Department of Computer Science, University of Waikato, New Zealand

  2. 2. Genna Hilbing

    HathiTrust Research Center, iSchool, University of Illinois Urbana-Champaign, USA

  3. 3. Ming Jiang

    HathiTrust Research Center, iSchool, University of Illinois Urbana-Champaign, USA

  4. 4. Yuerong Hu

    HathiTrust Research Center, iSchool, University of Illinois Urbana-Champaign, USA

  5. 5. Glen Layne-Worthey

    HathiTrust Research Center, iSchool, University of Illinois Urbana-Champaign, USA

  6. 6. J Stephen Downie

    HathiTrust Research Center, iSchool, University of Illinois Urbana-Champaign, USA

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The HathiTrust Research Center (HTRC) Extracted Features (EF) Dataset [1] consists of volume-, page-, and word-level data for more than 17 million volumes in a wide array of languages. Every volume is described by a library catalogue record, which includes at least one cataloguer-determined primary language for that volume. Although it is generally accurate, volume-level language information does not tell the whole story of a book: such description likely disregards substantial but incidental additional language material at the page level. Accompanying and supplementing this human-created, volume-level language metadata in the HTRC EF Dataset is page-level, machine-generated language metadata for each of the 6.2 billion pages—a design decision we consider appropriate, given the overwhelmingly daunting task that page-level manual cataloguing would be.
Machine-generated language detection occurs at two different stages of the EF production process: during initial OCR, and as part of a complex pipeline of other natural language processes [5, 6]. This poster reports on a set of related studies to assess the quality and usability of this machine-generated metadata, and to suggest means to improve them. In recognition of DH2022’s host country, and acknowledging that both NLP and OCR are notoriously problematic for Asian languages [2, 3, 4], we have narrowed our focus here on texts identified by either human or algorithm as being in Japanese.
Both page-level and volume-level metadata are searchable in HTRC’s Solr-based search interface, the “Workset Builder,” which, in an ideal scenario, allows scholars to unearth pages of content written in their language of study that would otherwise go undiscovered—or at least would be much more difficult to find—as a result of them being “masked” by appearing in a volume identified as being in a different language. We focused our study precisely on these cases.
Having randomly sampled 400 items where the volume level language metadata was
not Japanese but the NLP language identification tool had classified a page as having Japanese text, we relied on human classification to determine the actual language of each page. Overall the accuracy of the NLP Japanese text was poor. Examples of pages erroneously identified as Japanese included: illustrations, blank pages with a few “noise” marks on them, handwritten texts, mathematical or musical notation, pages with a substantial portion of characters misidentified as Japanese
kanji. We found the largest category of error to be scanned images that included Kanji characters that the NLP tool had classified as being Japanese when they were actually Chinese. In fact, out of the 400 sampled pages, only 1 example was found that was actually Japanese text. (Keep in mind that our sample set consisted purposefully as volume-page “mismatches” of language identification.) We then studied the opposite phenomenon, sampling pages identified as anything
other than Japanese, from volumes human-cataloged as Japanese. This second study also surfaced a substantial number of algorithmically-introduced errors, assignable to a different set of error categories.

What is the research cost of these errors in terms of misidentified language materials? Figure 1 summarizes our initial calculations. HathiTrust contains 559,718 volumes human-identified as Japanese, consisting of 249,252,918 pages. There are also 176,300,305 pages algorithmically-identified as Japanese, spanning 623,623 volumes. The intersection of these sets is the degree of agreement between these two methods of identifying Japanese language materials: there are 168,026,395 pages in common, coming from 501,150 volumes. This mismatch indicates that scholars are likely to miss a substantial amount of text from either search methodology.

Figure 1. Summary of potentially “missing” Japanese-language materials between two methods of retrieval.

The error rates found through both these analyses are high enough that we are considering changes both in the Workset Builder interface (to provide caveats for researchers upon executing page-level language searches), and in the production pipeline for the next release of the EF dataset: to employ newer and different language-detection packages (an approach that appears promising in pilot tests), and to seek access to an altogether new source of language detection: that often—but not always—is provided during the initial OCR processes, and encoded in metadata not previously available to us.
While we stand by the decisions that led us to favor human language identification at the volume level, and algorithmic language identification at the page level, we are nonetheless inspired to refine and qualify both process and presentation of this important dataset.

Jett, J., Capitanu, B., Kudeki, D., Cole, T., Hu, Y., Organisciak, P., Underwood, T., Dickson Koehl, E., Dubnicek, R., & Downie, J. S. (2020).
The HathiTrust Research Center Extracted Features Dataset (2.0). HathiTrust Research Center. https://doi.org/10.13012/R2TE-C227

Meknavin, S., Kijsirikul, B., Chotimongkol, A., & Nuttee, C. (1998).
Combining trigram and winnow in Thai OCR error correction. COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.

Ikeda, K., Hayashi, R., Nagasaki, K., & Morishima, A. (2017).
Human-assisted OCR of Japanese books with different kinds of microtasks. iConference 2017 Proceedings Vol. 2.

Yin, Y., Zhang, W., Hong, S., Yang, J., Xiong, J., & Gui, G. (2019). Deep learning-aided OCR techniques for Chinese uppercase characters in the application of Internet of Things.
IEEE Access, 7, 47043–47049.

[5] The Optimaize Language Detector. https://github.com/optimaize/language-detector.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014).
The Stanford CoreNLP natural language processing toolkit.
In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55–60).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO