Towards Creating A Best Practice Digital Processing Pipeline For Cuneiform Languages

Timo Homburg

Authorship

1. Timo Homburg

Fachhochschule Mainz (Mainz University of Applied Sciences)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
Ancient languages have recently become a research field gaining more and more attention from researchers in the DH community. This has led to various standards of digitization for different application cases which could be applied by other researchers to achieve easily accessible and often interlinked datasets.
Among those research areas gaining momentum in digitization, archaeology and cuneiform languages have been assessed in several past and ongoing projects.
For this DH2019 conference, the PaleoCodage encoding system for cuneiform languages has been accepted as a short paper presentation. This poster publication following (Brandes 2019) likes to introduce developments in this year in relation to PaleoCodage to create a cuneiform digital processing pipeline involving 3D-Scanning, paleography, transliteration, dictionary and signlist creation, automated font generation, linguistic annotations, semantic annotations and show how to publish said data in a sustainable form using a versioning system such as git.

Related Work
(Chiarcos 2018) created a proposal to annotate Sumerian cuneiform linguistically by extending the common CoNLL format (Chiarcos 2017) to support RDF (Miller 1998). While this concept works very well for linguistic annotations and natural language processing purposes we would like to present a solution which once setup can be used by nontechnical endusers and is at the same time usable by a variety of research communities. In addition to linguistic annotations our concept also contains paleographic sign variant descriptions of the respective cuneiform sign and conceptually the inclusion of 3D images.

Digital Processing Pipeline

The proposed pipeline is shown in Figure 2 and its stages are described in detail in the following sections. The pipeline is still a work in progress and feedback as well as suggestions are very welcome. The pipeline could be incorporated into a project workflow as shown in Figure 1 which is included into an currently pending project proposal.

Figure 1
(left)

:
Workflow of a research project utilizing the proposed cuneiform processing pipeline. The color green indicates the usage of a versioning system, the color yellow indicates exports which can be used by various communities.

Figure
2 (right): Digital Processing pipeline – schematic. A first a cuneiform tablet is 3D scanned, then transliterated and paleographically processed until it is further enriched and a semantic dictionary is created. Finally, several applications can produce domain-specific outputs which are useful for a variety of application fields.

Stage 1: Transliteration and Paleography
At first, cuneiform tablets are 3D scanned and character positions annotated by a professional. Next, a manual transliteration is created by the professional. In this step, the professional uses PaleoCodage (Homburg 2019) to create a machine-readable description of each cuneiform sign, thereby producing a sign list. The sign list refers to the unicode representations of the cuneiform signs if applicable and is modelled using Semantic Web standards. It needs to be noted that one unicode codepoint may refer to multiple cuneiform representations in PaleoCodage as sign variants are quite common. An example of this is given in Figure 3.

Figure 3: Cuneiform Sign Disambiguations of the same cuneiform sign E: Scholars consider these cuneiform signs to represent the same Cuneiform Unicode representation depending on the context of the cuneiform tablet
By describing sign variants using PaleoCodage, it is possible to automatically create a cuneiform OpenType

font from the sign list which is able to display the original cuneiform representations of the transliterations including sign variants digitally, which is a novelty in this community. The current State Of The Art merely includes drawing of cuneiform tablets as shown in Figure 4.

While such drawings can also capture the shape and broken parts of the cuneiform tablet they are currently not machine-readable. Using the PaleoCodage encoding, texts can be enabled to be searched by sign variant form which may be interesting for applications in philology. By using the OpenType font, the cuneiform text representation can be recreated in any application supporting OpenType fonts such as LibreOffice or MS Word ensuring the interoperability of the text.

Figure
4
: Cuneiform Sketch by the UniversityOf Pennsylvania

Ligatures
The OpenType font is enriched with sign variant descriptions as ligatures. For example, if we were to describe the cuneiform sign E in the signlist and discover that variants of E exist in the cuneiform texts, the variants can be described as
E\_v1...E\_vn and can be shown in the sign list. Using this given description, a ligature
E,\_,v,1 is created which is subsequently substituted using OpenTypes GSUB table to display the correct version of the character encoded in the automatically created OpenType font. The idea is the same as the concept of
in which latin characters are replaced using appropriate emojis. An example text

shows the font creation and the application of the font using an example text.

Stage 2: Annotation and Enrichment
At the start of this stage cuneiform texts have been manually transcribed into transliterations and have been saved in the Git repository in the ATF format

for cuneiform. We use an automized process to translate ATF documents to TEI XML (DeRose 1999) building up on specifications provided by the ETCSL project

. Having converted ATF into TEI XML, it can be annotated in CWRCWriter (Rockwell 2012) (Figure 5) which we intend to extend to support linguistic annotations.

The result of this enrichment process is a TEI XML which is a first publishable result on the project homepage. Using TEI Boilerplate (Walsh 2013), the texts can be visualized in a way that is suitable for scholars. An appropriate template is created during the work with CWRCWriter and can be seen in (Figure 5). In addition the annotation process results in RDF representations of important elements in the text. We hereby annotate:
Linguistic Elements as well as Semantic Elements (Persons, Places, Wares) with the goal to interlink as many facts about the artifacts and texts as possible in established Semantic Web vocabularies (e.g. Wikidata(Vrandečić 2014), DBPedia (Auer 2009), GeoSPARQL (Battle 2012), Pleiades (Simon 2016)).

Figure 5: CWRCWriter using a cuneiform TEI template for enriching contents

Stage 3: Dictionary Creation
Linguistic Elements become the basis of a semantic dictionary of the corpus which is worked on the outlines of which have been discussed in (Homburg 2017, 2018) and shown online

. This dictionary is created on-the-fly as the annotation progresses and is based on the Lexicon Model for Ontologies (Lemon) (McCrae 2010) (Figure 6) combining Semantics with linguistic annotations.

Figure 6: Lexicon Model for Ontologies: Bridging the gap between semantic web concepts and text (McCrae 2010)

Stage 4: Analysis, Evaluation and Deployment
In this stage, project specific analysis on texts can be conducted. One example could be the extraction and location of interesting at best spatiotemporal localizable information such as names of emperors, cities or typical goods that were traded during the time of cuneiform tablet creation. The created resources are also interesting for other research communities: The annotated 3D scans in combination with PaleoCodage can be used to improve cuneiform character recognition using machine learning (Mara 2010), annotated lingustic resources can be used as gold standard data to experiment with automated POSTagging or further text analysis. Also deployment options are implemented which include:

SPARQL Endpoint for the Semantic Web Community providing access to every resource (dictionary, annotated texts, images) in the project and provides interlinks
TEI XML and ATF Repository with Webservice access
Annotated 3D Image Data Storage using Git LFS
https://git-lfs.github.com

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO

Towards Creating A Best Practice Digital Processing Pipeline For Cuneiform Languages

1. Timo Homburg

ADHO - 2019

"Complexities"