The long term goal of this work is a diachronic or historic lexicon i.e. a mapping of word forms from one time period (loosely defined) to another. A lexicon like this can open new possibilities in information retrieval in historical and cultural heritage collections as well as provide a new foundation for quantitative methods on such material. For example, a modern spelling variant can be used for search and retrieval by utilizing period specific counterparts. The other way around, a collection of words found at different time periods, can be grouped together using a modern variant.
In order to induce the lexicon we make use of a pipeline based on Transformers (Vaswani et al. 2017) for bitext (i.e. corresponding text in at least two languages) mining as in Reimers & Gurevych (2019) and word alignment as in Jalili Sabet et al. (2020). The primary data in this pilot study are two translations of Goethe’s
The Sorrows of Young Werther
in Norwegian from 1820 and 1998, respectively, and can be found in the collection of the National Library of Norway.
Just as spoken language changes over time, the written language does as well. Changes in spelling constitutes such a change, sometimes alongside semantic changes. Our main focus lies within spelling variations. Since writing is our only encounter with languages of the not so distant past, so to speak, understanding and mapping this change is important on several levels. One good source for diachronic bitext is collections of an author's complete works which often are modernized from time to time in honor of their birthday and other celebrations. However, such bitexts have often been modernized in a particular way, according to specific preconceptions or stylistics. But since variation between time periods poses serious challenges to search and retrieval in large collections, and also to quantitative analysis in general, there is a great need for linking such variation. A canonical example is if a word form or spelling variant simply changes one character, we no longer have identical forms across time as in Norwegian where the word
at least have had three forms (
) from the early 1800s to the present.
Recent developments in Transformer-based text processing have eased the need for the amount of specialized data in machine learning. Reimers & Gurevych (2019) show that a multilingual model is preferable in aligning sentences rather than monolingual models. And Jalili Sabet et al. (2020) provide us with the flexibility to make use of efforts like Kummervold et al. (2021) in aligning words. With such a pipeline, inspired by Shi et al. (2021), we get the
aforementioned mapping or lexicon in addition to linguistically interesting word pairs. While the resulting list of word pairs can tell us something about language change as well as language policy, it is also useful to search engines where every user will benefit from being able to access texts from earlier stages without domain knowledge or linguistic expertise. In addition to a historical lexicon, the possibility of making an automatic modernisation or paraphrasing of text, or, put differently, how to make texts more readable, is one of the desired offshoots of this pilot study.
Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schütze. 2020. SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association for Computational Linguistics.
Per E Kummervold, Javier De la Rosa, Freddy Wetjen, and Svein Arne Brygfjeld. 2021. Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland (Online).
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics.
Haoyue Shi, Luke Zettlemoyer, and Sida I. Wang. 2021. Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)