Semi-Automatic Identification of Travelogues

paper, specified "short paper"
Authorship
  1. 1. Jan Rörden

    AIT Austrian Institute of Technology GmbH

  2. 2. Rainer Simon

    AIT Austrian Institute of Technology GmbH

  3. 3. Doris Gruber

    OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

  4. 4. Martin Krickl

    Österreichische Nationalbibliothek (Austrian National Library)

  5. 5. Bernhard Haslhofer

    AIT Austrian Institute of Technology GmbH

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Travel literature represents a rich source of information about the past, and has been of increasing interest in the scholarly community (c.f. Salzani & Tötösy de Zepetnek, 2010; Belgum et al., 2018). The Travelogues project aims to study what we can learn from past views of foreign regions, cultures and religions in the light of present-day challenges such as mass tourism, migration and globalization. Comprising a team of historians, librarians and data scientists, Travelogues applies a transdisciplinary approach, combining quantitative and qualitative analytical methods to study a large-scale corpus of German language travelogues.The project focuses on German-language holdings of the Austrian National Library printed between 1500 and 1876, including 167,570 digitized volumes. Those volumes have previously been digitized and processed by Optical Character Recognition (OCR) in the Austrian Books Online project—a public-private partnership of the Library and Google Books. In order to facilitate analysis of this vast and heterogenous collection, the project faces a number of challenges. The first challenge is to compile a corpus that includes as many travelogues from the inventory as possible. The second challenge is the profiling of the corpus at scale, analyzing it specifically for aspects of geographical coverage, salient terms over time and intertextuality. Finally, the key challenge lies in the identification of depictions of otherness in the corpus, and evolution of those depictions over time.Travelogues is a work in progress. In this paper we will describe our results so far—how we created the corpus and approached the task of profiling—and present our plans for the upcoming steps required for a detailed analysis of the corpus.As the intellectual basis for the project, historians first established a definition for this project’s use of the term travelogue. In the context of the project, a travelogue is defined as a specific form of media that records the experiences of a factually undertaken journey. Applying this definition, we created a balanced ground truth of digitized travelogues and non-travelogues (works that could belong to any other genre). To account for variations in the data such as document length, OCR quality or orthographic differences, we created separate ground-truth datasets for different time periods: the 16th, 17th and 18th centuries and 1800–1876. This is a manual and time-consuming process involving several steps, including keyword and metadata searches of the collection, cleansing and enrichment of heterogeneous metadata and comparisons with both contemporary and modern travelogue bibliographies (e.g. Chatzipanagioti-Sangmeister, 2006; Griep & Luber, 1990; Treue, 2014; Yerasimos, 1991). Every book we identified using this method was independently verified by a historian and a librarian.Based on the ground truth for each period, we trained and evaluated different machine learning algorithms to classify works as either travelogues or non-travelogues. This evaluation was done using five-fold cross-evaluation on a training set and a validation set. Using the best-performing approach, we applied the models for each time period and classified all documents not part of the ground truth (a total of 161,522 books). Our model returned a confidence score, essentially quantifying the likelihood that a given document is or includes a travelogue. The top 200 findings for each time period (800 in total) were then manually evaluated by our domain experts in order to confirm the validity of the automated results. Our process revealed 345 previously-unknown volumes of travelogues that were not listed as travelogues in any bibliography we consulted so far, nor could they be found using conventional metadata search methods (e.g. searching for different spelling variations of the German word for travel). Although the 345 newly-discovered travelogues did not noticeably differ in their content from the previously-known canon, their materiality was particular. A large number of them were originally published as part of larger documents (such as serial publications, collected volumes or diaries) that, due to the lack of metadata in the library system, usually cannot be found with the traditional methods of the humanities as described above. Although we did not segment the documents into smaller entities (e.g. pages or chapters) for classification (Underwood et al., 2013), this shows that our methodology leads to robust results concerning documents that are only partially considered part of a genre, as in our case with travelogues. Additionally, we successfully proved that our methodology can expand traditio­nal bibliographic research and help save time for domain experts. We have already described our classification task in detail (Rörden et al., 2020). Our next steps concern the analysis and historical contextualization of the corpus. We are currently creating a searchable index of the entire document inventory (travelogues as well as non-travelogues) to enable exploratory searches. Key exploration scenarios include plotting the number of travelogues published over time, optionally while filtering by various facets (such as authors, publishers, keywords or catalogue subject classifications); and exploring salient terms that feature more prominently in travelogues over non-travelogues, or in travelogues of a particular period vs. in those of another. We have also begun to generate maps of travelogues’ geographical coverage by performing Named Entity Recognition, and resolving coordinates against the GeoNames gazetteer. Furthermore, we have taken the first steps to analyze intertextual relations between documents (e.g. Dörr & Kurwinkel, 2014; Rajewsky, 2002), experimenting with a mix of approaches including n-gram fingerprinting (c.f. Stein, 2007) and paragraph vectors for document representation (Le & Mikolov, 2014). This has been combined with clustering and text passage alignment using the BLAST-based text reuse algorithm developed by Vierthaler and Gelein (2019). By applying the algorithm corpus-wide, we hope to learn more about the relationships between the documents and their authors, as well as how descriptions of and references to people, places and customs propagate through literature over extended time periods. Preliminary results seem promising, and have revealed what appear to be potential candidate cases of previously undocumented text-reuse. However, deeper analysis of the candidates and refinement of the methods are still ongoing. This method for the detection of intertextual relations is also a promising tool for clustering and relating works, not only based on literal title-strings but also on indexed full-texts, thus following the suggested implementation of the International Federation of Library Associations and Institutions Library Reference Model (IFLA LRM) into library catalogs (Decourselle et al., 2015; Rafferty 2015; Riva et al., 2017). With our method, we were able to create the largest curated corpus of German-language travelogues to date—3,595 volumes, 345 of which were, to the best of our knowledge, not previously identified or findable as travelogues—thus proving that methods like ours can successfully expand the classic bibliographic methodology of the humanities and library sciences. We have made first steps towards enabling the interactive exploration of a number of relevant properties of the corpus. The next year will be dedicated to addressing the main research goals: the identification of intertextual relations between the travelogues in our corpus to deepen our understanding of how they depended on each other, and what this tells us about the circulation of knowledge, stereotypes and prejudices. This will ultimately lead to the question of how notions of otherness were depicted, how and why they changed over time, and what conclusions this allows concerning today’s perceptions and biases. We have already published large parts of our corpus under a Creative Commons license. Beyond the goals of our own project, we feel that the open availability of the corpus marks a significant contribution to the research community at large, and will invite further scholarship and collaboration around this exciting resource.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO