Ugarit: Translation Alignment Technologies for Under-resourced Languages: Workshop presented at DH2022 Tokyo

workshop / tutorial
Authorship
  1. 1. Chiara Palladino

    Furman University, United States of America

  2. 2. Tariq Yousef

    Universität Leipzig (Leipzig University)

  3. 3. Farnoosh Shamsian

    Universität Leipzig (Leipzig University)

  4. 4. Nadia Kanagawa

    Furman University, United States of America

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


In this workshop, participants are going to learn the fundamentals of Translation Alignment and learn to use Ugarit (

http://ugarit.ialigner.com/

), an online environment targeted at the creation of manually aligned datasets in different languages. The goal of the workshop is to introduce participants to an important topic in Digital Humanities, and to expand our community and available datasets by targeting East Asian languages and Japanese in particular.

Translation alignment is one of the most important tasks in Natural Language Processing. It is defined as the comparison of two or more texts in different languages, also called parallel texts or parallel corpora [6][10], by means of automated or semi-automated methods. The result often takes the form of a list of pairs of items, which can be words, sentences, or larger text chunks like paragraphs or documents. The aligned pairs may be one-to-one (one word in the source text corresponds to one word in the translation), but often align as one-to-many, many-to-many, or many-to-one. Each word correspondence may be complete or perfect (with complete overlap between two words), but also possible or incomplete (partial overlap, or both words being a translation of each other only in certain contexts [4]).
There are numerous methods for automated translation alignment: the most popular ones, such as statistical machine translation, are based on various levels of manually aligned training data [2], although new models are being proposed, such as neural machine translation [1]. However, the alignment of texts in different languages is an exceptionally complex task, especially when considering word-level alignment. It is often difficult to find perfect correspondences across languages that express ideas through different morphosyntactic constructs, with variations in word order, sentence length. In addition, it is notoriously difficult to establish correspondences within wordplays, metaphors, or allusions. For these reasons, manually aligned word pairs are extremely important to establish gold standards, as sources of training data to implement machine translation methods, and for many other purposes, including text mining and creation of dynamic lexica [4][9].

Some modern languages, like English, German, and Chinese, have an impressive infrastructure for managing parallel corpora. However, that is not the case for historical and generally under-resourced languages, such as Classical Arabic, Persian, Latin, Ancient Greek, Gaelic, Cherokee, Georgian, and even for many languages of East Asia, including Japanese, Korean, and Sanskrit. Ugarit (

http://ugarit.ialigner.com/

) is a web-based environment designed to support the needs of these languages, providing a framework for creating and using manually aligned corpora. During the workshop, we will introduce the tool and illustrate the many ways in which parallel corpora aligned with Ugarit are currently used: these will include pedagogy and language learning, interlinguistic and translation analysis, dynamic visualization, data mining, dynamic lexica, and training of machine translation models [5][7][8][11]. We will invite the participants to test the tool on their own corpus or with a prepared dataset, to try first-hand the work of translation alignment, and to visualize and investigate the results.

Ugarit currently supports most East Asian languages and alphabets, but there are very few aligned datasets currently available. With this workshop, we want to specifically target the creation of new parallel corpora in Japanese, Chinese and Korean, and gather more feedback and requests from this part of the Digital Humanities community.
Instructors:

Tariq Yousef is a research associate at Leipzig University, working on Computational Linguistics, Textual Alignment, and Data Visualization. He is the Lead developer of Ugarit. Contact:

tariq.yosef@uni-leipzig.de

.

Chiara Palladino is Assistant Professor of Classics at Furman University. As project partner in Ugarit, she uses the tool in teaching and research and has led multiple workshops and seminars on translation alignment. Her main interest lies in language learning processes with translation alignment. Contact:

chiara.palladino@furman.edu

.

Farnoosh Shamsian is a PhD candidate at Leipzig University. As a project partner in Ugarit, she uses the tool in teaching and research and has led multiple workshops and seminars on translation alignment. Her main interest lies in digital pedagogy and teaching Greek through digital annotations. Contact:

shamsian@informatik.uni-leipzig.de

.

Nadia Kanagawa is James B. Duke Assistant Professor of Asian Studies and History at Furman University. She is a Ugarit user who often works with and translates classical Japanese texts in her research on immigrants in the early Japanese state. Contact:

nkanagawa@furman.edu

.

Bibliography

[1] Bahdanau, D., Cho, K., and Bengio, Y. “Neural Machine Translation by Jointly Learning to Align and Translate.” (2016). ArXiv:1409.0473 [Cs, Stat], May. http://arxiv.org/abs/1409.0473.

[2] Brown, P. F. et al. “A Statistical Approach to Machine Translation.” Computational Linguistics 16.2 (1990): 79-85.

[3] Dagan, I., Church, K., and Gale, W. “Robust Bilingual Word Alignment for Machine Aided Translation.” In Natural Language Processing Using Very Large Corpora, edited by Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann, and David Yarowsky, 209–24. Text, Speech and Language Technology. Dordrecht: Springer Netherlands (1999). https://doi.org/10.1007/978-94-017-2390-9_13.

[4] Graça, J. et al. “Building a Golden Collection of Parallel Multi-Language Word Alignment.” LREC (2008).

[5] Foradi, M. “Confronting Complexity of Babel in a Global and Digital Age. What can you produce and what can you learn when aligning a translation to a language that you have not studied?” DH2019: Digital Humanities Conference, University of Utrecht, July 9-12. Book of Abstracts. 2019. https://dev.clariah.nl/files/dh2019/boa/0611.html;

[6] Kay, M. and Röscheisen, M. “Text-translation alignment.” Computational Linguistics 19.1 (1993): 121-142.

[7] Palladino, C. “Reading Texts in Digital Environments: Applications of Translation Alignment for Classical Language Learning.” Journal of Interactive Technology and Pedagogy, 18 (2020). https://jitp.commons.gc.cuny.edu/reading-texts-in-digital-environments-applications-of-translation-alignment-for-classical-language-learning/

[8] Palladino, C., Yousef, T., and Foradi, M. “Translation alignment for historical language learning: a case study”. Digital Humanities Quarterly 15.3 (2021), http://www.digitalhumanities.org/dhq/vol/15/3/000563/000563.html

[9] Véronis, J. “From the Rosetta Stone to the Information Society.” In Parallel Text Processing. Alignment and Use of Translation Corpora, edited by Jean Véronis, 1–24. Springer Science & Business Media, 1999.

[10] Véronis, J. Parallel Text Processing: Alignment and Use of Translation Corpora. Springer Netherlands, Dordrecht-Boston-London (2000).

[11] Yousef, T., and Jänicke, S. “A Survey of Text Alignment Visualization.” IEEE Transactions on Visualization and Computer Graphics PP (October 2020): 1–1. https://doi.org/10.1109/TVCG.2020.3028975.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO