Taming the Data: Web-Scraping and De-Duplicating Messy Multilingual Philosophy Corpora

Raluca A. Tanasescu; Cristian A. Marocico

Authorship

1. Raluca A. Tanasescu

Rijksuniversiteit Groningen (University of Groningen)
2. Cristian A. Marocico

Rijksuniversiteit Groningen (University of Groningen)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This poster presents a technical report and a method for corpus expansion in the humanities, with an application to early modern philosophy, alongside a case study of dealing with heavy data redundancy in several Latin, English, and French title corpora. It enlarges on the steps taken during the initial stages of a data-intensive research project that aims to go beyond established writers and views in natural philosophy between 1600 and 1800 and it reflects on the collaboration between a humanist and a data scientist with respect to web-scraping and redundant multilingual data taming in Python.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020

"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO

Taming the Data: Web-Scraping and De-Duplicating Messy Multilingual Philosophy Corpora

1. Raluca A. Tanasescu

2. Cristian A. Marocico

ADHO - 2020

"carrefours / intersections"