A Digital Platform for the “Latin Silk Road”: Issues and Perspectives in Building a Multilingual Corpus for Textual Analysis

paper, specified "short paper"
Authorship
  1. 1. Emmanuela Carbe'

    QuestIT - University of Siena

  2. 2. Nicola Giannelli

    University of Siena

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
In March 2018 the project Eurasian Latin Archive was started with the aim of building a digital platform for a large corpus of Latin texts and documents from medieval and early modern times. The corpus encompasses text from East Asia, including Latin-Chinese texts, on which a group of researchers began to work addressing specific multilingual issues. The corpus will be available within a digital library provided with tools for textual and thematic analysis. The goal of the project is the comparative linguistic analysis, both internal with other Latin texts from different eras and areas, and external with non-Latin texts on homogeneous subjects. The ultimate purpose of the project is to highlight relationships by extracting and investigating 1) historical-cultural data about religion, law, science, art, and customs; 2) linguistic data, to be compared with homologous values in European Latin literature, in order to determine the specificity of East Asian Latin and investigate overlappings and mismatches between Latin words and their local (mostly Chinese) equivalents. The start-up phase of the project (
DAS-MeMo), which will end by February 2020, is co-financed by Regione Toscana and QuestIT, an IT company specialized in NLP and Artificial Intelligence. The working group is interdisciplinary and brings together medievalists, digital humanists, engineers and IT specialists. The platform is inspired by the digital archive
ALIM (Archivio della Latinità Italiana del Medioevo), which acts as a starting point for the new archive. However, unlike ALIM, the project is geared towards providing textual analysis also with machine learning tools; ALIM focuses instead on digital representation of editions (including editiones principes) and takes into account texts of Italian Latinity exclusively (Ferrarini, 2017; Stella, 2015).

Preliminary complexities: defining and creating the corpus
To define the corpus, a first census of over three hundred texts has been made (also thanks to online resources, such as
Sinica 2.0 and
CCT-Database). The amount of documents is due to increase, but it is sufficiently large to collect some preliminary results in the main projects’ tasks. It seems useful to classify documents into four main categories: 1. Born-digital editions, freely available or to request for; 2) modern (critical and non critical) editions that have entered the public domain; 3. editions with possible issues with OCR; 4. manuscripts. With the exception of the born-digital editions, the categories can a) be already digitized and made available by other Institutions; b) still need to be digitized. At this stage, besides the born-digital editions already acquired, seventy more documents of the type 2a and 2b have already been processed. In the case of critical editions, we chose to provide the text without encoding the critical apparatus. Items are being encoded in TEI P5, with a particular attention for the essentials metadata, such as VIAF and Wikidata ID when available, OCLC references, reliability both for external and internal processes of the item (i.e. responsible for digitization and/or OCR, checking, date history of all changes, evaluation of the resource, and so on), possibly opening the way to integration with semantic models (Ciotti, 2018; Ciotti et al., 2016).

As a non-conclusion: a structure that is modular and can grow more complex with time
For such an ambitious project it is obviously necessary to proceed by ensuring a modular architecture from the very beginning, so that it can be implemented within this project progressively and can possibly merge, through its interoperability features, with open source projects from other Institutions. For this purpose, a requirement analysis has been carried out, and a list of specifications has been redacted in order to provide a basis of guidelines for the subsequent implementation.
The aim of this paper is to display the architecture of the project and some of the early results: a prototype based on the ElasticSearch search engine has been developed, which includes a search module (with multiple filter methods) and a browse module, with the possibility to investigate the corpus by authors, dates and periods, collections, sources, places mentioned within the document, languages used in the text, keywords. A particular attention is being paid to the recognition of entities. For the moment being, we are working on recognition of places, dates and people. For the particular case of places, we also need to take into account that location names often differ significantly from the ones used nowadays, to the point that many researchers actually disagree on the current location of cities mentioned in these texts. More tools will be added to the preliminary ones, in order to pursue a text analysis (Stella, 2018: 72-100; Eger et al., 2015). One of the most important issues remains however the encoding of multilingual documents, currently under examination by handling Prospero Intorcetta’s Sapientia Sinica, which contains chinese text, and transliterations in latin of the chinese terms.

Bibliography

Ciotti, F.
(2018). A Formal Ontology for the Text Encoding Initiative.
Umanistica Digitale
,
3
(2018). doi:

10.6092/issn.2532-8816/8174

.

Ciotti, F., Daquino, M. and Tomasi F.
(2016). Text Encoding Initiative Semantic Modeling. A Conceptual Workflow Proposal. In Calvanese, D., De Nart, D. and Tasso, C. (eds),
Digital Libraries on the Move
, vol. 612. Cham: Springer International Publishing, pp. 48-60. doi:

10.1007/978-3-319-41938-1_5

.

Eger, S., Brück, T. vor der, and Mehler, A.
(2015). Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods. In
Proceedings of the 9th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
. Beijing, China, pp. 105-13. doi:

10.18653/v1/W15-3716

.

Russo, L.
(2005). ALIM, Archivio della latinità italiana del Medioevo.
Reti Medievali Rivista,
6
(1): 149-151. doi:

10.6092/1593-2214/181

.

Stella, F.
(2015). Il problema della codifica nelle edizioni critiche digitali. In Del Corso, L., De Vivo, F., Stramaglia, A.,
Nel segno del testo. Edizioni, materiali e studi per Oronzo Pecere
. Firenze: Gonnelli, pp. 347-360.

Stella, F.
(2018).
Testi letterari e analisi digitali
. Roma: Carocci.

TEI Consortium
(2019).
TEI P5: Guidelines for Electronic Text Encoding and Interchange.
Version 3.5.0. Last updated on 29th January 2019.

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html

.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.