Applying LERA for collating witnesses of The Tale of Kiều, a Vietnamese poem written in Nôm script :

paper, specified "long paper"
  1. 1. Thi Kim Hanh Luu

    Martin Luther University Halle-Wittenberg, Germany

  2. 2. Marcus Pöckelmann

    Martin Luther University Halle-Wittenberg, Germany

  3. 3. Jörg Ritter

    Martin Luther University Halle-Wittenberg, Germany

  4. 4. Paul Molitor

    Martin Luther University Halle-Wittenberg, Germany

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The collation tool LERA
Homepage of LERA ‒ Locate, Explore, Retrace and Apprehend complex text variants:

is a working environment that combines the entire process of document management, tokenization/segmentation, normalization, alignment and visualization with interactive control options and exploratory tools. It is being successfully applied in several Digital Humanities projects of different languages, e.g., Arabic (Gründler and Pöckelmann 2018, Gründler 2019, Gründler et. al 2020), French (Bremer 2013), Hebrew (Necker et. al 2019, Molitor et. al 2020), Portuguese/Spanish/Latin (Bragagnolo 2017), English (Alder et. al 2020) and German (Hahn et. al 2020). A comprehensive description of LERA including a comparison with other state-of-the-art approaches for collation can be found in Pöckelmann et. al (2022).

Here we describe a recent experiment in which we applied LERA to the poem Truyện Kiều by Nguyễn Du (1765-1820) to reveal necessary adaptation needs of the software to compare Vietnamese text witnesses written in Nôm script, the historic writing system of Vietnam. Truyện Kiều ‒ or the Tale of Kiều ‒ is one of the most famous texts of Vietnamese literature. For our experiment, we used text witnesses from 1866, 1870, 1871, 1872, and 1902 made available by the Vietnamese Nôm Preservation Foundation (Balaban et. al 2001).

Specifics of Truyện Kiều
The text was written in Nôm script. It follows the traditional Vietnamese verse form Lục bát according to which lines of six Nôm characters (respectively syllables) alternate with lines of eight. The text is organized in pages whose amount of lines varies among the witnesses. The writing direction is top-to-bottom and right-to-left.

Fig. 1: Facsimile of the first page of Truyện Kiều (1866), adopted from Balaban et. al (2001).

The data includes philological commentaries and a translation of the text in
Quốc Ngữ, the modern writing system of Vietnam. There is no one-to-one mapping between the original Nôm script and its translation, although the verse structure is the same in both scripts. This is due to synonyms (e.g., 劄 → chép or ghi), the not normalized placement of diacritics (e.g., 鎖 → khoá or khóa), regional preferences of translators in the choice of words (e.g., 改 → gửi or gởi), and occasional spelling errors (e.g., 色 → sắo instead of sắc).

Data processing
For the processing in LERA, we converted the original plain text files into a XML format according to the Text Encoding Initiative

guidelines. Always two lines of Nôm script ‒ one of six, the other of eight characters ‒ were combined into one segment (by -tag). The original division was preserved with line breaks (-tags). We have chosen this approach because two lines normally form a sense unit. Each segment is also extended by its translation. To simplify the tokenization of Nôm script, we added zero-width-spaces (Unicode U+200B) between the individual characters so that lines are not considered falsely as single words by the system. The philological commentaries have been added as well to the XML at their respective positions.

Fig. 2: Left: the original plain text format in Nôm script
(hn) and Quốc Ngữ
(qn) including the commentary
(hw1 and
nt1) for the first two lines of Truyện Kiều (1866). Right: excerpt of the data transformed into a XML-format readable by LERA, where the two lines have been merged into one segment. The philological commentary was transformed to a -tag.

Collation with LERA
In order to display Nôm script properly, an appropriate font
The font
Nom Na Tong is also provided by the Vietnamese Nôm Preservation Foundation at

was embedded into LERA. To make the text segments more legible, options for displaying the inserted line breaks and centering the text have been added (see Fig. 3).

Fig. 3: Representation of a text segment of Truyện Kiều in LERAs full text visualization. Philological commentaries are indicated by consecutive numbers, with their text fading in via mouse-over.

LERA uses a two-step approach for collation (Pöckelmann et. al 2022), both applied to the four-line segments: an alignment of text segments according to their similarity and a detailed comparison of the aligned segments.
For the first step the manual assignment already encoded via line numbers into the data can be used for Truyện Kiều. However, the algorithm for alignment integrated in LERA produces nearly the same alignment.
The detailed comparison reveals many textual variants spread over the entire length of the work for both scripts. This is still true for modern Vietnamese when minor variants, like differences in capitalization and punctuation marks, or even diacritics, which are very important for a word's meaning, are ignored with the aid of LERAs filter system. Overall, textual variants occur more frequently in the Nôm script versions than in the modern Vietnamese versions. The reasons for this are almost all in the complex phonetic-semantic-composition of characters. Phono-semantic compound characters (such as 坦) have two components: one suggesting the word's meaning (土) and the other its approximate sound (旦). The characters 旦 and 坦 (used in the last witness only) have the same meaning
đắn. There are also rules in the positioning depending on function: semantic components tend to appear on the top or left side of characters; phonetic components at their bottom or right side, e.g.: 撑 = ⿰扌掌. Variant characters, which are homophones and synonyms caused by allography, occur also in the work, e.g., 句 and 勾, which were both translated to
câu, vary in their phonetic component because the second uses with 厶 another written representation of 口. The Nôm script characters have been compared so far only as a whole, so those variants could not be further differentiated.

It should be noted that the fixed structure of the segments with the verse form makes the comparison somewhat easier than for other texts: with a few exceptions, there are only substitutions of characters and no insertions or deletions.

Fig. 4: Screenshot of LERA showing an excerpt of the five collated version of Truyện Kiều. The navigation bar shows a very similar structure for all witnesses (with the exception of 1866, recognizable by the gaps at the top), while the density of the blue highlightings indicate textual variance within all aligned segments (the darker the more different the aligned segments are).

Discussion and Future Work
Only minor adaptations of LERA were necessary to allow a basic comparison of the text witnesses of Truyện Kiều, such as the integration of the font or a slightly adapted display of text segments.
The work shows further potential for the improvement of LERA for an application to Asian languages. This includes an approach to process, normalize and compare the components of complex composite characters separately and represent the results in an appropriate way. Furthermore, the demand for integrating the top-to-bottom writing direction
Right-to-left has been added in the course of scholarly editions of Arabic and Hebrew texts. became apparent.


Alder, E.
Cranfield, J., Dryden, L., Duncan, I., Ferguson, C., James, S.J., Kerr, D., Luckhurst, R., Machin, J., Schwan, A.
Wild, J. (2020).
Edinburgh Conan Doyle Project.



Balaban, J., Collins, L., Lesser, S., Phan, J., Schmid, D. N. and Việt, N. T. (2001).
Vietnamese Nôm Preservation Foundation.

Bragagnolo, M. (2017).
HyperAzpilcueta – Visualizing the instability of early modern normative knowledge.

Bremer, T. (2013).
Guillaume Thomas Francois Raynal: Histoire philosophique et politique des établissements et du commerce des européens dans les deux Indes.
Semi-automatische Di

erenzanalyse von komplexen Textvarianten.

Gründler, B., Pöckelmann, M. (2018).
Adjusting LERA for the comparison of arabic manuscripts of kalıla wa-dimna. In: Digital Humanities 2018 - Book of Abstracts. pp. 467–468. Mexico City, Mexico (2018)

Gründler, B. (2019).
Kalıla and Dimna – AnonymClassic.

Gründler, B., van Ginkel, J.J., Redwan, R., Khalfallah, K., Toral, I., Stephan, J., Keegan, M.L., Beers, T.S., Kozae, M., Ahmed, M.M. (2020).
An interim report on the editorial and analytical work of the AnonymClassic project. Medieval Worlds (11), 241–279 (2020),

Hahn, B., Eusterschulte, A., Kieslich, I. and Pischel, C. (2020).
Hannah Arendt Digital ‒ Kritische Gesamtausgabe.

Molitor, P., Necker, G., Pöckelmann, M., Rebiger, B., Ritter, J. (2020).
Keter Shem Tov – Prozessualisierung eines Editionsprojekts mit 100 Textzeugen. Conference of the Digital Humanities Association in German-speaking Countries (DHd). Paderborn (2020)

Necker, G., Molitor, P., Ritter, J., Rebiger, B., Pöckelmann, M. (2019).
Kabbalah Editions.

Pöckelmann, M., Medek, A., Ritter, J. and Molitor, P.(2022).
LERA - An interactive platform for synoptical representations of multiple text witnesses. In: Digital Scholarship in the Humanities (DSH). Oxford University Press 2022,

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO