Multilinguality in historical documents – challenges and solutions for digital humanities

Laurent Romary; Stefani Dipper; Noah Bubenhofer; Cristina Vertan

Authorship

1. Laurent Romary

DARIAH, Institut national de recherche en informatique et en automatique (INRIA)
2. Stefani Dipper

No affiliation given
3. Noah Bubenhofer

Technische Universität Dresden (TU Dresden)
4. Cristina Vertan

Universität Hamburg (University of Hamburg)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Recently, the collaboration between the Language Technology community and the specialists in various areas of the Humanities has become more efficient and fruitful due to the common aim of exploring and preserving cultural heritage data. It is worth mentioning the efforts made during the digitisation campaigns in the last years and within a series of initiatives in the Digital Humanities, especially in making old manuscripts and prints available in the form of Digital Libraries.

The availability of old texts on-line produced a revolutionary shift in the way how such objects are analysed. They are no longer restricted to a small number of specialists, knowing the language of the document but to broader groups with various requirements:

non-expert users who would like to know what the document is about, understand the main topics, localise places, persons. These users have no or very little knowledge of old languages, and usually are less familiarised with toponyms (especially when these belong to geographical spaces unknown to the user);
researchers of neighbour fields, who often have only minimal knowledge of the language but considerable knowledge of the historical context and might be familiarised with historical toponyms and proper names;
students and researchers specialising in historical data, who have the required language skills but still can profit from additional information accompanying the texts.
These considerations imply that the storage and visualisation of old texts should be accompanied by a collection of tools empowering the text with suitable information and making it understandable for different user groups. Such tools usually involve automatic language processing methods. In contrast to processing of modern texts, for which language technology made a huge progress in the last years, automatic processing of old texts is still problematic mainly because:

Historical language data is sparse. First, compared to the wealth of documents written in modern languages, there are only few documents available for historical languages. Second, transcribing old manuscripts often requires expert knowledge. Third, due to the absence of a standard language, historical language variants differ in spelling, morphology, syntax, and lexical semantics from each other.
Texts are often multilingual, consisting of mixtures of different languages, such as single words or phrases or entire sentences written in Latin that are intermixed with passages written in the actual language of the text. In case of texts from areas with rich cultural mixtures (e.g. Balkans), one can find in addition paragraphs in “exotic” local languages.
The focus of this workshop is on the second aspect. We think that the challenges posed by multilinguality should be tackled by adapting existing multilingual language resources and tools, and, where necessary, by providing training data in the form of corpora or lexicons for a certain period of time in history.

The aim of this workshop is to bring together researchers working in this interdiciplinary domain as well as specialists in machine translation and multilinguality working with languages with sparse resources, to analyse problems and brainstorm solutions in order to implement machine (-aided) translation and processing for (multilingual) historical texts. We envisage also networking with European activites in Digital Humanities like CENDARI, CLARIN, DARIAH.

Topics of interest include but are not limited to:

character-level Machine Translation (MT) for normalisation
historical and modern data as comparable corpora (methods for extraction parallel segments from translations or new editions in modern language)
historical texts in different languages as parallel or comparable corpora
MT for translation between language versions
OCR for multilingual documents
word- and/or paragraph-level language identification
crosslingual retrieval in historical documents
ontologies as language-independent interfaces between collections of historical texts
particularities of multilingual historical texts and challenges for IT
Cristina Vertan, University of Hamburg

Research Group „Computerphilologie“, University of Hamburg, Vogt-Kölln Strasse 30, 22527 Hamburg. Germany

Cristina.vertan@uni-hamburg.de, Office: +49 40 42838 2319

nats-www.informatik.uni-hamburg.de/CristinaVertan

Cristina Vertan is senior researcher at the University of Hamburg. Her principal research fields are Machine Translation, Digital humanities, Crosslingual retrieval and less-resourced languages. She organised several workshops at important conferences (LREC, RANLP) about using language technology for cultural heritage and historical languages. Se his founding member of th SIGHUM special ACL-interest group in „Digital Humanities“ and co-organiser of theLATECH-2014 (Language Technology for cultural Heritage Social Sciences and Humanities) collocated with EACL 2014. Recent research activities iclude extraction of parallel corpora from historical translation, one paper being accepted at the Digital Humanities Conference 2014.

Full contact information for all workshop leaders, including a one-paragraph statement of their research interests and areas of expertise;

Stefanie Dipper, Ruhr-Universität Bochum

Sprachwissenschaftliches Institut, Ruhr-Universität Bochum, D-44780 Bochum,

dipper@linguistics.rub.de, Office: +49 234 32-25112

Stefanie Dipper is Professor of Computational Linguistics at Ruhr-University Bochum, Germany. She has worked on annotation formats, corpus tools, and corpus-based methods for many years. Her primary interests are in automatic analysis of historical texts, including normalization of historical spelling, POS and morphological tagging, and in methods for comparing and clustering historical dialects. She is PI of a DFG-funded project that deals with creating and analyzing a corpus of historical dialects, and Co-PI of two DFG-funded projects for creating reference corpora of historical German.

Noah Bubenhofer, TU Dresden / UZH Zurich

TU Dresden, Institut für Germanistik, Professur für Angewandte Linguistik, Mommsenstr. 13, D-01062

Dresden, noah.bubenhofer@tu-dresden.de, Office: +49 351 46 33 82 19, Mobile D: +49 170 901 17 94,

Mobile CH: +41 76 330 66 15

Web: www.bubenhofer.com, linguistik.zih.tu-dresden.de

Dr. NOAH BUBENHOFER is a member of the academic staff at the Chair of Applied Linguistics, Technische Universität Dresden and head of the recently opened Dresden Center for Digital Linguistics. In addition, he is co-founder of SEMTRACKS, the „Laboratory for Computer Based Meaning Research“. Since 2014, Noah Bubenhofer is a guest researcher at the Institute of Computational Linguistics at the University of Zurich. In his PhD-thesis „Muster an der sprachlichen Oberfläche“ (patterns at the linguistic surface), he developed corpus linguistic methods for discourse and cultural analysis. As a linguist, he is mainly interested in computer based semantic text analysis and the relation between text and discourse, society and culture. In the project „Tracking Meaning on the Surface“ categories were modelled for the description of semantic imprints using a data-driven approach. In doing so, the project explored possible applications of these models for the semantization of the Internet and the methodology of social sciences and cultural studies. Noah Bubenhofer is also co-leader of the project “Text+Berg digital” (www.textberg.ch) where a series of yearbooks by the Swiss Alpine Club (SAC) is being digitised and transformed into a deeply annotated corpus.

Laurent Romary, laurent.romary@inria.fr

Inria & HUB

Institut für deutsche Sprache und Linguistik

Philosophische Fakultät II

Humboldt-Universität zu Berlin Unter den Linden 6

D-10099 Berlin

Laurent Romary is Directeur de Recherche INRIA, France and guest scientist at Humboldt, University in Berlin Germany. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He received a PhD degree in computational linguistics in 1989 and his Habilitation in 1999. During several years he launched and directed the Langue et Dialogue team at Loria in Nancy, France and participated in several national and international projects related to the representation and dissemination of language resources and on man-machine interaction. In particular coordinated the MLIS/DHYDRO, IST/MIAMM and eContent/Lirics projects. He has been the editor of ISO standard 16642 (TMF – Terminological Markup Framework) and is the chairman of ISO committee TC 37/SC 4 on Language Resource Management. He has been member (2001-2007) then chair (2008- 2011) of the TEI (Text Encoding Initiative) council. In the recent years, he lead the Scientific Information directorate at CNRS (2005-2006) and established the Max-Planck Digital Library (sept. 2006-dec. 2008). He currently contributes to the establishment and coordination of the European Dariah infrastructure.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Multilinguality in historical documents – challenges and solutions for digital humanities

1. Laurent Romary

2. Stefani Dipper

3. Noah Bubenhofer

4. Cristina Vertan

ADHO - 2014

"Digital Cultural Empowerment"