Lemmatizing Low-resource, Less-researched Languages: The Linked Open Text Reader and Annotator

poster / demo / art installation
Authorship
  1. 1. Max Ionov

    Johann-Wolfgang-Goethe-Universität Frankfurt am Main (Goethe University of Frankfurt)

  2. 2. Christian Chiarcos

    Johann-Wolfgang-Goethe-Universität Frankfurt am Main (Goethe University of Frankfurt)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Background
Lemmatization represents the basis for many experiments and analyses in computational philology and corpus linguistics, and although considered to be solved for modern major languages, producing lemmatized text remains challenging for languages for which little or no language resources are currently in existence. In particular, morphologically rich languages benefit greatly from the sparsity reduction achieved if automated pre-processing, annotation or distributional analyses (be it collocation graphs or correlation studies) are conducted on lemmata rather than the original word forms.
We describe experiments and a tool on lemmatizing languages with insufficient resources for state-of-the-art linguistic, or philological work. We introduce LiOTrea, the Linked Open Text Reader and Annotator: Given a text (corpus) and one or more dictionaries or lemma lists, LiOTrea uses the dictionaries to suggest lemmata (resp., links to dictionary entries) and morphological features for words in the corpus. Several linking/lemmatization strategies are implemented. Their suggestions are ranked and can either be used as a
pre-processing step in manual annotation, or
in place of manual annotation. Furthermore, novel dictionary entries can be created during annotation. The technological innovation is three-fold: (1) A number of lemmatization/linking strategies being implemented, (2) native support for the Ontolex-lemon vocabulary (McCrae et al., 2017) and lexical resources from the Linguistic Linked Open Data cloud, and (3) a language-independent technology.

Ontolex-lemon is a standard of growing importance to the DH community (cf. Bellandi et al., 2018), and tools for creating and publishing such datasets are available, but no tool, to the best of our knowledge, that currently uses this technology for analysing philological text.
The technology is applicable to any language, for illustration we use real-world studies on languages from the Caucasus, Eurasia and the Near East, conducted in the context of philological research (Armenian and Sumerian, languages with a long and extensive history as a written language) and the documentation of endangered languages and cultures (Batsbi; a minority language spoken in Georgia). The case studies thus cover two important strands of Digital Humanities.

Methods
Lemmatization is a sub-task of morphological analysis, with three prime strategies:

Rule-based annotation: Building a formal grammar for the language in question using tools such as XFST, Foma, etc. Successfully applied for a myriad of languages with rich morphology. Downside: Requires a rich model of the grammar as basis for formalization.
Lookup-based (dictionary-based) annotation: Given a set of paradigmas, annotate only their word forms. This technique (extended to morphs instead of full words) is the basic technique of popular tools in language documentation, ToolBox/FLEx. Downside: Limited coverage.
Pattern induction: Learn morphological regularities from data by means of symbolic (e.g. CSTLemmatizer

https://dighumlab.org/cst-lemmatiser/

) or machine learning techniques (e.g. UniMorph Shared Tasks

https://sigmorphon.github.io/sharedtasks/2018/

). Downside: Requires large amounts of data.

We focus on lookup-based methods: For languages without annotated corpora, pattern induction is not applicable. In language documentation, rule-based annotation is normally not possible because the grammar is superficially understood, only. Instead, annotation is used as a scientific device to learn more about the grammar of the language.
We generalize over plain lookup techniques by providing a fuzzy matching between dictionary entries and text words. Words get transliterated to IPA characters and then to a list of feature vectors using the PHOIBLE database (Moran et al., 2014). Transliteration is based on models created by language experts. We provide a
phonological search with different similarity metrics over these lists identifying the most similar dictionary forms for the word. Ranked lists are presented to the user to be confirmed or overwritten during annotation.

We report results for the following lemmatization techniques:

WORD (word as lemma): Baseline.
DICT (exact lookup-based annotation): Use a lookup dictionary of word forms for a set of lexemes. If a word form is not found in it, annotate as
unseen.

DICT+WORD: Use DICT, for unseen words, use WORD.
PHON (phonological search): Given a dictionary, annotate every word with the phonologically most similar lexical entries in the dictionary.
DICT+PHON: Use DICT, for unseen words, use PHON.
DICT+PHON+WORD: Use DICT+PHON as long as an (empirically adjusted) threshold is exceeded. Use WORD for unannotated words. We experimented with different thresholds and report the best results.

DICT is a simplistic baseline, but it yields acceptable results with a limited amount of annotation time. For PHON, we estimate phonological similarity between the corresponding vectors. For reasons of brevity, we only report results from the best metric.

Experiments
For reasons of brevity, we report results only for the Eastern Armenian National Corpus

http://eanc.net

, using two evaluation metrics: generalization (number of distinct lemmas in the test corpus), and accuracy (correct lemmas) as predicted from different amounts of previously annotated data.

In terms of generalization (Fig. 1), the WORD baseline drastically fails, whereas the other methods converge towards an agreement — but only with a dictionary covering more than 1,000,000 tokens. Combining PHON with WORD provides an analysis which is remarkably robust against overgeneralization.

Fig. 2 shows that both PHON-enhanced dictionary lookup routines also outperform the baseline DICT in terms of accuracy.
In our talk, we argue that these methods can significantly simplify annotation of corpora and creating lexical resources for minority languages with very little time effort, seamlessly providing access to them.
Our main contribution is, however, to provide a tool that integrates these lemmatization strategies into an annotation and text analysis workflow, that can be extended with more advanced techniques.

LiOTrea: Linked Open Text Reader and Annotator
In order to facilitate linguistic annotation using these methods, we have developed an environment, LiOTrea, that allows a user to display texts and their (automated or manual) annotation, checking and correcting annotations at the same time. It supports using dictionaries represented in RDF in the Ontolex-lemon model as a basis for lemmatization, phonological search lookup functionality against any dictionary, and the enrichment/creation of Ontolex-lemon dictionaries during the annotation process.
In our talk, we present this tool, demonstrate its capabilities and discuss scenarios for its use and future extensions. One application includes, for example, the comparison with dictionaries from another language as shown in Fig. 3, where phonological search is applied to detect Georgian loan words in Batsbi text.

In addition to current applications, which include Eastern Armenian, Batsbi and Sumerian, future applications will include related varieties, including other North-East Caucasian languages and different historical stages of Armenian.

Acknowledgements
The research presented in this paper was primarily conducted in the context of the independent research group “Linked Open Dictionaries (LiODi)”, funded 2015-2020 from the eHumanities programme of the German Federal Ministry of Education and Science (BMBF). The pilot on Sumerian was conducted in the context of the project “Machine Translation and Automated Analysis of Cuneiform Languages” (MTAAC), funded 2017-2019 in through the Trans-Atlantic Platform (T-AP) Digging into Data Challenge by DFG, NEH and SSHRC.

Bibliography

Bellandi, A., Giovannetti, E., and Weingart, A. (2018). Multilingual and Multiword Phenomena in a lemon Old Occitan Medico-Botanical Lexicon.
Information, 9(3), 52.

McCrae, J. P., Bosque-Gil, J., Gracia, J., Buitelaar, P., and Cimiano, P. (2017). The Ontolex-lemon Model: Development and Applications.
Proceedings of eLex 2017 Conference, Leiden, Netherlands, September 2017.

Moran, S. & McCloy, D. & Wright, R. (2014). PHOIBLE Online, https://phoible.org (accessed 18 November 2018).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO