Dehmel digital – Algorithmic-driven indexing of historical letters

Marie Flüh; Sandra Bläß

Authorship

1. Marie Flüh

Universität Hamburg (University of Hamburg)
2. Sandra Bläß

Universität Hamburg (University of Hamburg)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

On our poster, we present the
Dehmel digital project (Hamburg University and Hamburg State and University Library; cf. Nantke, 2022) and its automatisation-oriented workflow for indexing historical letters. The project aims to digitise and index the letters of the correspondence network of Ida and Richard Dehmel, consisting of approximately 35,000 original handwritten letters. To be able to manage this huge amount of material, manual, semi-automatic and automatic/algorithm-driven work steps are interlinked in the project's workflow. Therefore, different machine learning techniques that belong to the methodological repertoire of the Digital Humanities are combined with each other. The goal is a digital network scholarly edition of the letters that makes the personal, cultural, and social dynamics contained in the correspondences tangible and explorable (cf. Nantke et al., 2022). There, the users can choose from various medial perspectives on the documents, from facsimiles and transcriptions to registers and visualizations, to pursue their diverse (research) interests.

The workflow
The process of converting the handwritten originals into machine-readable representations is divided into a series of successive work stages that produce different data. Each result of the respective work steps is another layer of abstraction from the original material (see fig. 1).

Figure 1: Different results of the worklow
In order to establish a procedure that enables the indexing of such large manuscript corpora, but also keeps a philological standard of scholarly editions in mind, we combine several procedures already established within the Digital Humanities and modify them for the use in a digital edition (see fig. 2).

Figure 2: The individual work steps in in the Dehmel digital project
After the original letters have been translated into high-resolution image digitisations and enriched with basic metadata, we transcribe about fifty handwritten document pages of one writer manually using Transkribus (cf. READ-COOP, 2021) until the data basis for training a Handwritten Text Recognition (HTR) model has been created. Following the successful training, further transcription is carried out in iterative alternation between automated transcription and manual post-correction, whereby the corrected letters are again fed into the HTR workflow as training data for a new model.
An XML processing chain converts these transcripts into a TEI-based format and applies a classifier of the Stanford Named Entity Recognizer (cf. Manning et al., 2014) to tag persons, places, institutions and artworks. The NER-classifier is adapted to the project’s material base since it is trained manually to recognise these four different types of entities in letters from artists from the period around 1900. Recognised entities are introduced inline while retaining the references to the layout and structure of the document. For this purpose, a simplified implementation of the Separated Markup API for XML (cf. Verwer, 2020) is used (cf. Maus, 2021) which integrates classified tokens of NER and annotations of textual characteristics. The entities recognised in this way are used as the basis for the semi-automated generation of registers. These enlist all of the included people, places, institutions, and artworks and direct them back to the relevant documents. To convert the entities into stable register entries, a machine-supported data reconciliation is carried out with norm data systems and local knowledge bases. For reconciliation, we use OpenRefine (cf. OpenRefine, 2022) and the Gemeinsame Normdatei (GND) (cf. Deutsche Nationalbibliothek, 2022) as a norm data system.

The output
The computer-assisted disambiguation of entity types with OpenRefine (cf. OpenRefine, 2022) forms the basis for registers as well as for manually created macro comments on the central actors, institutions, and artworks of the corpus (see fig. 3).

Figure 3: The homepage of the
Dehmel digital project

The algorithm-driven approach explained above ultimately enables both a basic overview of the entire corpus and partial correspondences. These are visualized as a dynamic, interactive network, which includes the letter’s senders, recipients, as well as everyone mentioned within the letters, and directs back to the respective documents (see fig. 4). By linking central entities, visualisations, and digitised letters, it becomes possible to gain precise insights into the extensive corpus and use it as a scholarly edited historical source.

Figure 4: Dynamic network visualisation of the correspondences as a network created with Graph Commons (cf. Arıkan et al., 2016)

Bibliography

Arıkan, B., Üstün, Z., Kızılay, A., Badur, A., Erikli, F., Zıngıl, Ö., Kılıçoğlu, D., Aldatmaz, A. and Dölec, G. (2016). Graph Commons. https://graphcommons.com (accessed 22 November 2021).

Deutsche Nationalbibliothek (2022): GND. Gemeinsame Normdatei.

https://gnd.network/Webs/gnd/DE/Home/home_node.html
(accessed 12 April 2022).

Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J. and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In:
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland: ACL, pp.55–60.

Maus, D. (2021). XML NLP Pipeline, July 23, 2021. DOI: 10.25592/uhhfdm.9443.

Nantke, Julia (ed.) with the collaboration of Sandra Bläß and Marie Flüh (2022). Dehmel digital.
https://dehmel-digital.de/ (accessed 19 April 2022).

Nantke, Julia, Sandra Bläß, and Marie Flüh (2022). Literatur als Praxis: Neue Perspektiven auf Brief-Korrespondenzen durch digitale Verfahren. In: Fischer, F. and Horstmann, J. (eds.):
Sonderband Text Praxis. Digitales Journal für Philologie. Digitale Verfahren in der Literaturwissenschaft.
DOI: https://doi.org/10.17879/64059432335.

OpenRefine (2022). OpenRefine.

https://openrefine.org/
(accessed 22nd November 2021).

READ-COOP (2021). Transkribus. KI-gestützte Handschriftenerkennung.

https://readcoop.eu/de/transkribus/
(accessed 22 November 2021).

Verwer, N. (2020). Plain text processing in structured documents. In: Proceedings of Declarative Amsterdam 2020. DOI: 10.1075/da.2020.verwer.plain-text-processing.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022

"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO

Dehmel digital – Algorithmic-driven indexing of historical letters

1. Marie Flüh

2. Sandra Bläß

ADHO - 2022

"Responding to Asian Diversity"