Less explored multilingual issues in the automatic processing of historical texts - a case study

Cristina Vertan

Authorship

1. Cristina Vertan

Universität Hamburg (University of Hamburg)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Recently, the collaboration between the Language Technology community and the specialists in various areas of the Humanities has become more efficient and fruitful due to the common aim of exploring and preserving cultural heritage data. It is worth mentioning the efforts made during the digitisation campaigns in the last years and within a series of initiatives in the Digital Humanities, especially in making Old Manuscripts available in the form of Digital Libraries.

Having in mind the number of contemporary languages and their historical variants, it is practically impossible to develop brand new language resources and tools for processing older texts. Therefore, the real challenge is to adapt existing language resources and tools, as well as to provide (where necessary) training material in the form of corpora or lexicons for a certain period of time in history.

Another issue regarding historical documents is their usage after they are stored in digital libraries. Historical documents are not only browsed but together with adequate tools they may serve as basis for reinterpretation of historical facts, discovery of new connections, causal relations between events etc. In order to be able to make such analysis, historical documents should be linked among themselves and should be linked with modern knowledge bases. Activities in the area of Linked Open Data (LOD) play a major role in this respect

Most digital libraries are made available not only to researchers in a certain Humanities domain (e.g. classical philologists, historians, historical linguists), but also to common users. This fact has posited new requirements to the functionalities offered by the Digital Libraries, and thus imposed the usage of methods from Language Technology for content analysis and content presentation in a form understandable to the end user.

There are several challenges related to the above mentioned issues:

Lack of adequate training material for real-size applications: although the Digital Libraries usually cover a large number of documents, it is difficult to collect a statistically significant corpus for a period of time in which the language remained unchanged.
Historical variants of languages lack firmly established syntactic or morphological structures thus the definition of a robust set of rules is very difficult. Historical texts often constitute a mixture of multilingual paragraphs including Latin, Ancient Greek, Slavonic, etc.
Historical texts contain a large number of anon-standardized abbreviations.
The conception of the world is somewhat different from ours, which makes it more difficult to build the necessary knowledge bases.
Whilst these issues are generally accepted there is still less research done towards the content annotation of old texts. Language technology tools are mostly adapted in order to make corpus linguistics research (Piotrowski 2012), (Vertan et. al 2012) but not really used as means for text processing. One of the main barriers is the text readability in terms of language style and terms.

The aim of this paper, describing on-going work is to demonstrate that multilingual aspects in historical texts are a big challenge but can serve also for building a knowledge-based to be used in text presentation.

2. Selected materials
The selected texts are works of Dimitire Cantemir, prince of Moldavia at the end of the XVII century, but also historian, philosopher, composer, musicologist, linguist and much other.

As member of the Prussian Academy of Sciences he was asked to write a history of the Ottoman Empire as well as a history of Moldavia, both unknown territories for Western Europe at this time. These remarkable works remained until the middle of XIXth century the only generally accepted reference. Even if some historical aspects are interpreted in a subjective manner, the works of Cantemir represent a unique testimony of that time. He is not describing only dates and places of historical events but presents daily life, occupations, country organisation as he saw himself as prisoner in Istanbul and prince of Moldavia.

The „History of growth and decay of the Ottoman Empire“ and the §Description of Moldavia“ were written initially in Latin. Later they were translated in German, and then the German Edition was translated in English, French, Romanian and Russian. All these Editions are no tone-to-one translation but in many cases they are influenced by the perception of the translator.

We consider that the works of Cantemir are of particular importance fort he history of Easter-Europe, about which with exception of specialists, is less known.

The current project aims at the presentation and explanation of these works, by means of language technology tools. For the moment we concentrate on the German, English and Romanian Editions of the „Description of Moldavia“, all available in the Library of our institution and intend to extend the work to the other language editions.

3. Exploiting Multilinguality
Usually the multilingual problem of the old texts is resumed to the fact that Latin and sometimes ancient Greek passages are found. The works of Cantemir have the particularity that they introduce words in the language of the described country, namely Romanian and Ottoman Turkish.

Therefore a processing step is needed in order to identify these words. State-of –the art methods in language identifications based on n-grams cannot be used as both old Romanian were written with other alphabets (church Slavonic) and the transliteration rules were not standardised.

Our method uses the multilingual versions of his works help here in identifying the foreign words.

Explanation paragraphs differ slowly but the terms are preserved. Therefore we use comparison at sentence level in order to identify common words, which we mark afterwards as Named Entities.

Following example form the „Descriptio of Moldavia“ in German is illustrative:

„Der Watawul de Aprodschi de Tyrg ist Herr über die Gerichtsdiener, welche den Tribut und andere Abgaben der Bürger eintreiben, und andei Schazkammer liefern. „

Here we observe the following: the expression „Watawul de Aproschi de Tyrg“ is for a language technology tool dealing with German completely noisy. On the other hand the expression is an approximate transliteration of old Romanian written in Church Slavonic. Nowadays the correct transliteration would be „vatavul de aprozi de târg “.

In comparison here the entry in the R Romanian version

„Vatavul de aprozi de târg; este asupra slujitorilor divanului, cari strâng birul si alte dări ale orăsenilor si le aduc la Visterie;

Thus the algorithm for the processing of the text involve the following steps:

1. Sentence splitting of German and English and Romanian texts
2. Sentence alignment by means of comparable corpora sentence extraction (Smith et. al. 2010)
3. Identification of common identical strings
4. Extraction of common passages and normalisation using resources in old Romanian /ottoman Turkish as well as heuristic rules
5. Marking of identified expressions as named entities in order to be blocked from further processing.
We use the output of this process in order to built a multilingual knowledge base, and match the terms on wikipedia entries.

4. Conclusions and further work
Through this contribution we want to raise the attention on less studied aspects of multilinguality in historical texts and show how methods from multilingual language technology, here exploitation of comparable corpora, can be used for dealing with this issue. In the presentation we would describe in detail the algorithms used for the construction of the multilingual knowledge base and give relevant examples.

As mentioned before, this is an on-going project. After creating the knowledge base we intend to built a presentation interface where entries in the knowledge base may-be available at the moment of text reading through a mouse-over function. We intend also to asses how much from the found entries match Wikipedia, and show that in fact the information is complementary. We intend also to extend the methods to other works.

References
Cantemir, C,Descriptio Moldaviae Historisch-geographisch und politische Beschreibung der Moldau, nebst dem Leben des Verfassers (lat. - 1714-1716); germ. Beschreibung der Moldau, A. V. Büssch (Ed.), 1770

Cantemir, C., Geschichte des Osmanischen Reichs nach seinem Anwachse und Abnehmen. Beschrieben von Demetrie Kantemir, ehemaligem Fürsten in Moldau. Nebst den Bildern der Türkischen Kaiser, Hamburg Herold, 1745.

Piotrowki, M., Natural Language Processing for Historical Texts, Synthesis Lecture on Human Language Technologies, 2012

Smith, J and Quirk, C. and Toutanova, K, Extracting parallel sentences from comparable corpora using document level alignment, Proceedings of the HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 403-411

Vertan, C. and Osenova, P. and Slavcheva, M. and Piperidis, S.„Adapation of Language Processing and Tools for cultural heritage“, Proceedings of the Workshop at LREC 2012.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Less explored multilingual issues in the automatic processing of historical texts - a case study

1. Cristina Vertan

ADHO - 2014

"Digital Cultural Empowerment"