1. Research question
In Computational Literary Studies texts are typically pre-processed with Natural Language Processing (NLP) tools. However, due to historical and/or aesthetic characteristics, literary texts sometimes deviate notably from the data the tools are trained on. Due to this difference in domain, the performance of the tools drops (Scheible et al., 2011; Rayson et al., 2007; Herrmann, 2018; Bamman, 2020). Instead of considering this to be a problem, the ‘erroneousness’ of the tools could provide a computational understanding of the ‘deviance of literary texts’; produced errors might reveal something about the characteristics of literature.
In the following, we report on a
Tool Misuse experiment on German lyric poetry – a genre that is usually associated with a high degree of deviance (Müller-Zettelmann, 2000: 100; Zymner, 2019: 29–30) – in which we develop a pipeline that provokes tokenization, lemmatization and POS tagging 'errors' of NLP tools and typologises these 'errors' in a rule-based way.
Pipeline for error typing of the corpora.
Since gold standard annotations are not available for our scenario, we base our evaluation on the assumption that correctly produced lemmas can be found in dictionaries of German language. Based on the
TextGridRepository, we build a canon-based corpus of 'prototypical' German-language poetry comprising 5,144 poems. For comparison, we use a prose corpus of 100 German-language novels from the 19th century, compiled from the TextGridRepository and
Project Gutenberg. As dictionaries we use ‘GermaNet‘ (Hamp & Feldweg, 1997; Henrich & Hinrichs, 2010) and the ‘Digitales Wörterbuch der deutschen Sprache‘ (Klein & Geyken, 2010). To ensure that the resulting errors are not tagger specific, we use several NLP tools for tokenization, lemmatization and POS tagging of the corpora (fig. 01) and consider all content word types as potential errors that are lemmatized by at least two tools and for which none of the produced lemmas are found in the dictionaries (fig. 02, column "pFail").
Our 'error pipeline' prefers recall over precision, thus it produces only circumstantial evidence of potential errors. A larger number of false positives is to be expected, because we process out-of-vocabulary words of the dictionaries.
Number of word types (NOUN, VERB, ADJECTIVE) for the entire corpus ("all") and for the sets with potential errors ("pFail").
Based on manual inspections of the pFail set, we postulate 13 error types described in figure 03. For each type we formulate a rule
For the rules see:
which is then applied to the pFail set following the order of the error types listed below. Multiple typings are not possible.
Description of error types.
Relative frequency for the types of potential errors for the two pFail sets.
53.33 % of the word types in the pFail set for poetry and 59.88 % of the word types in the pFail set for prose are identified. PUNC and SHORT are predominantly sub-word level characters, mostly noise which appears to a comparable extent in poetry and prose. ORTH_SZ reflects the effect of Historical Orthography which a normalisation step could remedy.
The ten remaining types can be combined into three groups:
COMP_DASH, COMP, PART_ ADJECTIVE, PREFIXED gather
Creative Lexis, i.e. word formation mechanisms (composition, derivation); these are often out-of-vocabulary words and therefore pipeline errors, not tool errors. In poetry, 45.25 % of the "pFail" set can be assigned to this group, in prose 57.09 %.
As expected, the pipeline produces a higher error rate for poetry (0.62 %) than for prose (0.02 %) for ORTH_UPPER, which identifies a characteristic of
Lyric Typography (capitalizing first letters in lines).
The error rate of
Prosodic Deformation consisting of ELISION_APO, ELISION_SIMPLE, ELISION_END, EPITHESIS and CONTRACT is also higher for poetry than for prose (6.62 % compared to 1.93 %). We assume that the deformations are due to the addition or deletion of vowels for metric reasons.
Our pipeline identifies
Lyric Typography and
Creative Lexis as typical sources of error when processing poetry with NLP tools. However, our pipeline needs to be optimized: too many potential errors are, as in the case of
Creative Lexis, in fact not tool errors but pipeline errors. Additionally, our rule-based typology is only able to describe 53.33 % of the pFail set. This reveals two areas for follow-up research: the pipeline could be improved on to decrease the number of pipeline errors and the rule-based typologisation procedure could be optimized against our baseline.
Bamman, D. (2020). LitBank: Born-Literary Natural Language Processing. [Preprint]. https://people.ischool.berkeley.edu/~dbamman/pubs/pdf/Bamman_DH_Debates_CompHum.pdf [Last accessed November 16, 2021].
Braam, H. (2019).
Die berühmtesten deutschen Gedichte. Auf der Grundlage von 300 Gedichtsammlungen. Stuttgart: 2. Aufl., Kröner.
Hamp, B. and Feldweg, H. (1997). GermaNet - a Lexical-Semantic Net for German. In
Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid, Spain, pp. 9–15. https://aclanthology.org/W97-0802.pdf [Last accessed November 16, 2021].
Henrich, V. and Hinrichs, E. (2010). GernEdiT - The GermaNet Editing Tool. In
Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 2228-35. http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf [Last accessed November 16, 2021].
Herrmann, J. B. (2018). Praktische Tagger-Kritik. Zur Evaluation des PoS-Tagging des Deutschen Textarchivs. In
DHd2018: Kritik der digitalen Vernunft. Book of Abstracts. Cologne, Germany, pp. 287-90. https://zenodo.org/record/3684897#.YO_x1W5CTOQ [Last accessed November 16, 2021].
Honnibal, M. et al. (2020). spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. https://doi.org/10.5281/zenodo.1212303 [Last accessed November 16, 2021].
Klein, W. and Geyken, A. (2010). Das ‘Digitale Wörterbuch der Deutschen Sprache DWDS’, in: Lexicographica 26: 79–96.
Müller-Zettelmann, E. (2000).
Lyrik und Metalyrik. Theorie einer Gattung und ihrer Selbstbespiegelung anhand von Beispielen aus der englisch- und deutschsprachigen Dichtkunst. Heidelberg, Germany, Winter.
Qi, P. et al. (2018). Universal dependency parsing from scratch, in:
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 160-70.
Rayson, P. et al. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In
Proceedings of Corpus Linguistics (CL2007). https://eprints.lancs.ac.uk/id/eprint/13011/1/192_Paper.pdf [Last accessed November 16, 2021].
Schmid, H. (1994). Probabilistic part-of speech Tagging using decision trees. In
Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp. 154-62.
Schmid, H. (1995). Improvements in Part-of-Speech Tagging with an Application to German. In
Proceedings of the ACL SIGDAT-Workshop, Dublin, Ireland, 13-25. https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger2.pdf [Last accessed November 16, 2021].
Schmid, H. (2019). Deep learning-based morphological taggers and lemmatizers for annotating historical texts.
In Proceedings of the 3rd international conference on digital access to textual cultural heritage, Brussels, Belgium, 133-37.
Scheible, S. et al. (2011). A gold standard corpus of Early Modern German. In:
Proceedings of the 5th Linguistic Annotation Workshop, pp. 124–28. https://dl.acm.org/doi/abs/10.5555/2018966.2018981 [Last accessed November 16, 2021].
Zymner, R. (2019). Begriffe der Lyrikologie. In: Hildebrandt, Claudia et al. (eds.) Lyrisches Ich, Textsubjekt, Sprecher? (= Grundfragen der Lyrikologie, Bd. 1). Berlin, Germany: De Gruyter 25–50.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)