Paleo Codage - A machine-readable way to describe cuneiform characters paleographically

paper, specified "short paper"
Authorship
  1. 1. Timo Homburg

    Fachhochschule Mainz (Mainz University of Applied Sciences)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Cuneiform characters have been described using various systems in the past and the varieties of systems used in the literature as well as in daily work varies from language to discipline. Commonly, sign lists (Borger 1971, Borger 2004, Ruster 1989, Deimel 1947) are created and published in the form of dictionaries in a non-machine-readable form. Similarly, for computers, the only way to distinguish cuneiform characters is currently to assign them different numbers in a list (e.g. Unicode (Unicode Staff, 1991)) and consider a distinction on this level. Therefore we are left with many systems and numbers to describe the same cuneiform sign. (Figure 4). Contrary to listing cuneiform signs, (Gottstein, 2012) took another approach in creating a searchable cuneiform character encoding based on wedge types which would be implemented in applications such as CuneiPainter

(Homburg, 2015). Character image recognition has also been performed in the past (Mara, 2010), but never yielded a machine-readable representation of a cuneiform characters paleographic information which could have been useful as a means of validation for machine learning recognitions. This publication therefore introduces Paleo Codage, a paleographic distinct machine-readable description inspired by the Manuel de Codage encoding (Van den Berg, 1997) for Egyptian Hieroglyphs.

Motivation
A machine-readable paleographic description despite yet representing another encoding scheme could link all systems of cuneiform character descriptions, as it directly describes the characters shape and positioning parameters. Scholars could register newly found characters easily in a machine-readable way and provide the basis for computational analysis on the paleographic shapes of cuneiform characters. Such paleographic information would ideally be integrated into currently emerging Semantic Dictionaries for cuneiform (Homburg, 2017, 2018) to enrich linguistic linked open data and thereby profit the respective scholars. In addition a machine-readable paleographic description provides the basis to capture sign variants of characters currently described in unicode. It is very common for on unicode codepoint to have many sign variants describing the same meaning over the centuries in which cuneiform has been written. Those sign variants have never been assessed digitally (only as sketches in books) and could provide valuable insights for philologists.

Approach
Paleo Codage builds on the description of (Gottstein, 2012), by using simple character descriptions for certain wedge types and by extending it with a Manuel de Codage (Van den Berg, 1997) inspired set of relational descriptions.
Cuneiform wedges are distinguished as follows:

Vertical wedge π’€Έ (a)

Horizontal wedge 𒁹 (b)

Diagonal wedge 1-4 π’€Ή,π’€Ί (c,d and mirrored e,f)
Winkelhaken π’Œ‹ (w)

The system encodes relations between wedges as shown by the following most frequent examples:

Wedges that pass through other wedges situated right to them (-) (e.g. MIN π’ˆ« -> a-a , three AΕ  𒐁 β†’ b-b-b )
Wedges that do not pass through other wedges situated right to them (_) (e.g. Ε U π’‹— -> b:b:b:b:b_a , GIΕ  π’„‘ -> b::b_a )
Wedges under another wedge possibly passing through other wedges (:) (e.g. U2 π’Œ‘ β†’ B::B-a-a-a-a , AΕ 2 π’€Ύ β†’ b:b:b:b-a )
Wedges under the current wedge not passing through other wedges (;) (e.g. BAR 𒁇 β†’ ;b-a )
Diagonally under another wedge (.) (e.g. GAM 𒃡 β†’ c.c )
Wedge inversion (!) (e.g. IDIM π’…‚ β†’ !b:b )

In addition size variations of cuneiform wedges are common and can be encoded as follows:

Capital letters signify a bigger version (e.g. A instead of a), wedges prefixed with a small s a smaller version (e.g. sa instead of a)
(e.g.
A x A

𒀁
β†’
a-sa-sa:sa-a:a
,
Ε E

π’ŠΊ
β†’
W:W-w:w-w:w-w:w-W:W
)

Lastly, angles of diagonal cuneiform characters may vary between characters which required angle modifiers to be added to the encoding.

The angle between the diagonal wedges in (e.g. IR π’…• β†’ c;d-a-a-a) is bigger than the angle between the diagonal wedges in (ARKAB π’€Ά β†’ |d;|c_A ). The angle can be halved by using the | operator.

While the order in which cuneiform wedges were drawn is not always agreed upon by the respective scholars (Devecchi, 2015), PaleoCodages’ order independent of this dispute is from left to right and then from up to down in order to avoid ambiguities concerning cuneiform sign definitions. In order to facilicate the representation of displaced wedge groups PaleoCodage also includes the following positioning modifiers (/ half the size down, ~ half size to the left, # half size to the right, as well as < and > as rotation modifiers, rotating the whole glyph). Further operators could be added if needed by glyphs which can currently not be modeled.

Proof Of Concept
A proof of concept is provided on a representative subset of 200 cuneiform unicode characters

https://en.wikipedia.org/wiki/Cuneiform_(Unicode_block)
which were analysed to infer the relations described section Approach. Table 1 includes further encoding examples.

Image
Unicode
Main Transliteration
Borger
Gottstein
Paleo Codage

𒁹
U+12079
DIΕ 
748
a1
a

π’€Έ
U+12038
AΕ 
001
b1
b

π’€Ή
U+12039
AΕ  ZIDA tenΓ»
575
C1
c

π’€Ί
U+1203A
AΕ  KABA tenΓ»
647?
c1
e

π’Œ‹
U+1230B
U
661
d1
w

π’ˆ¦
U+12226
MAΕ 
120
a1b1
:b-a

𒁇
U+12047
BAR
121
a1b1
;b-a

𒇲
U+121F2
LAL
750
a1b1
a-b

π’ˆ¨
U+12228
ME
753
a1b1
a-:b

𒃡
U+120F5
GAM
576
c2
c.c

π’‹»
U+122FB
TAR
009
a1c2
c.ca

π’Œ€
U+12300
TIL
114
b1c1
bc

𒉽
U+1227D
PAP
092
b1c1
C:d

π’‚’
U+120A2
EZEN x A
288
a7b6
:sa-:sb::sb-ab;b-:sa-:sa:sa-a-:sb::sb-:sa

π’…ˆ
U+12148
IGI RI
726
a4b2d2
:w-a-:b_-:b-a-a-:::w-a

Table 1: Cuneiform Encoding Examples
A generated similarity graph for verification purposes (Figure 2) using the new encoding method shows the applicability of the encoding to identify subglyphs that are included in other glyphs which in turn is useful information to be included in (Semantic) dictionaries. Further similarity measures on the encoding (String Similarity) could reveal additional connections between cuneiform character representations.

Figure 2: Cuneiform Character relations as graph (excerpt): Only by verification of the encoding the computer can e.g. now recognize that the glyph IMIN3 (b:b:b_b:b:b_b) is contained by the glyph ilimmu3 (b:b:b_b:b:b_b:b:b). Using the Gottstein System such a conclusion could not be made as they would be classified as b7 and b9 respectively.

Application
Given the paleographic information encoded in a standardized way users have the ability to draw a rudimentary shape of the character in order to detect the character they are seeing in front of them (e.g. on a picture or a tablet). This functionality is currently being implemented in CuneiPainter

, improves its accuracy when matching cuneiform characters and will be ready as a showcase for DH2019. A showcase in JavaScript (Figure 3) highlighting all currently encoded characters is already available for testing

, allowing users to verify and create their own encodings easily. In addition, the testing tool allows to export created cuneiform characters as SVG and as OpenType fonts in-browser, creating the basis for an easier automated font creation for cuneiform characters.

Figure 3: Paleo Codage Input (JavaScript Application)

Figure 4: Cuneiform Numbering Systems: Semantic Dictionary for Ancient Languages

Bibliography

Biber, D. (1988).
Variation Across Speech and Writing. Cambridge: Cambridge University Press.

Borger, R. (1971).
Akkadische zeichenliste. Neukirchen-Vluyn, Germany.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO