An Open Data Approach to Revealing Indigenous Texts in Large-Scale Digital Repositories: A Case-Study of Locating Pages of Māori Text in the HathiTrust

David Bainbridge; J Stephen Downie; Hemi Whaanga

Authorship

1. David Bainbridge

University of Waikato
2. J Stephen Downie

University of Illinois, Urbana-Champaign
3. Hemi Whaanga

University of Waikato

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In this case study we report on our experiences in locating pages of Māori text in the HathiTrust Digital Library (HTDL). Using traditional biographic metadata, i.e., the language field, only 182 items were returned out of HTDL’s 17.1 million volumes. Our Open Data approach is based on the freely available HathiTrust Extracted Features Dataset. We establish a collection of high frequency terms in Te Reo Māori, which we iteratively use as search terms to identify a group of candidate texts. We then apply NLP analysis to verify those texts that contain substantial amounts of the Māori language. Using this approach we were able to increase the number of volume returned to 598. This positive result suggests that scholars who want to analyse other low-resourced languages should be able to adopt our workflow to reveal otherwise hidden texts in their desired languages.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020

"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO

An Open Data Approach to Revealing Indigenous Texts in Large-Scale Digital Repositories: A Case-Study of Locating Pages of Māori Text in the HathiTrust

1. David Bainbridge

2. J Stephen Downie

3. Hemi Whaanga

ADHO - 2020

"carrefours / intersections"