C-SALT APIs - Connecting and Exposing Heterogeneous Language Resources

paper, specified "short paper"
Authorship
  1. 1. Francisco Mondaca

    Universität zu Köln (University of Cologne)

  2. 2. Felix Rau

    Universität zu Köln (University of Cologne)

  3. 3. Claes Neuefeind

    Universität zu Köln (University of Cologne)

  4. 4. Börge Kiss

    Universität zu Köln (University of Cologne)

  5. 5. Daniel Kölligan

    Universität zu Köln (University of Cologne)

  6. 6. Uta Reinöhl

    Universität zu Köln (University of Cologne)

  7. 7. Patrick Sahle

    Universität zu Köln (University of Cologne)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
In this paper, we present a strategy for the integration of existing heterogeneous language resources such as texts and dictionaries by connecting these resources and making them available for internal projects and third-party applications through (Web) APIs. We describe our approach in the context of the C-SALT initiative (
Cologne South Asian Languages and Texts

http://c-salt.uni-koeln.de

), which gathers projects and resources hosted at the University of Cologne covering South Asian languages. To illustrate the potential use of our approach, we first introduce VedaWeb, a web-based platform that provides access to ancient Indian texts composed in Vedic Sanskrit, the oldest form of ancient Indo-Aryan. Then we describe the C-SALT APIs for dictionaries

https://api.c-salt.uni-koeln.de

. These APIs make several large Pāli and Sanskrit dictionaries available online. Building on that, we present the architecture behind these APIs, and finally we summarize by analyzing the potential role of APIs in Digital Humanities (DH) projects.

About VedaWeb
The cornerstone of VedaWeb is a digital edition of the Rigveda, one of the oldest and most important texts of the Indo-European language family, which comprises approx. 160,000 words. VedaWeb can be accessed either via a web application

https://vedaweb.uni-koeln.de/rigveda

or directly via an API

https://dh-cologne.github.io/vedaweb/#description-of-the-api

.  VedaWeb provides several layers of linguistic and philological information, alongside various editions of the text of the Rigveda. A search function with multiple linguistic parameters is available (including lemma, word form, morphological and metric information), which allows to execute queries across different levels of annotation by means of complex, combined search criteria. Besides the annotated version of the text, further layers include the display of translations (including Geldner, 2003; Grassmann, 1876; Griffith, 1896, Renou, 1956-1969) as well as commentaries to the Rigveda (Oldenberg, 1909/1912, Renou, 1956-1969). Parallel to the morphological annotations, all of these additional information layers can be accessed via full-text search as well as a more structured search function. The possibility to combine these multiple layers is crucial for enabling novel perspectives on the data, e.g. by means of quantifying feature combinations or by identifying context-dependent phenomena such as different types of constructions. VedaWeb is meant to advance research in all areas of Vedic studies, for example in syntax (e.g. referential null objects (Keydana & Luraghi 2012), non-configurationality (Reinöhl, 2016)), morphology (e.g. the Vedic
vr̥kī-type (Widmer, 2007),
ya-presents (Kulikov, 2012)) or word formation (e.g. compounds (Scarlata & Widmer, 2015)).

A Screenshot of the VedaWeb Application, with two layers selected. Rigveda data and the Dictionaries are proportioned via the C-SALT APIs

An important feature of VedaWeb is the enrichment of the Rigveda text by linking each word with entries from the standard dictionary for the Rigveda by Hermann Grassmann (Grassmann, 1873). Instead of encapsulating the data in the application, our approach is to leave the resource ‘in place’ and obtain the data via the C-SALT APIs for Sanskrit Dictionaries

https://cceh.github.io/c-salt_sanskrit_api

.

C-SALT APIs for Dictionaries
The C-SALT APIs for Dictionaries

https://api.c-salt.uni-koeln.de

have been developed to provide access to existing lexicographic resources in Pāli and Sanskrit without doubling work or hosting efforts. The dictionaries available via these APIs are also accessible through traditional monolithic web applications, like the Critical Pāli Dictionary Online

http://cpd.uni-koeln.de

, and the Cologne Digital Sanskrit Dictionaries

http://www.sanskrit-lexicon.uni-koeln.de

, which are a product of a major Sanskrit digitization project (Kapp & Malten, 1997).

C-SALT APIs Overview

API Architecture
The basis of the APIs and of the VedaWeb application are versions of the texts and dictionaries encoded in TEI

Text Encoding Initiative:
https://tei-c.org/

-XML

Extensible Markup Language

. We employ a TEI schema

https://github.com/cceh/c-salt_dicts_schema
developed initially for the three most complex Sanskrit dictionaries (Apte,1920; Böhtlingk & Roth, 1855-1875; Monier-Williams 1899). By using one TEI schema, we not only achieve data persistence, but we also achieve a consistent structure for all dictionaries. While software such as frontend applications or APIs change over time, TEI offers the DH community the safest way to assure data persistence. For this reason, all the data accessed through APIs is ultimately based on TEI files. The different C-SALT projects use different technologies as ‘middleware’ between TEI and endpoints and also different Web API technologies: REST (Fielding, 2009) and GraphQL

https://graphql.org/

. Independently of the technology employed, our APIs focus on performance and on providing well-documented access to curated linguistic data.

Summary and Outlook
Developing APIs means the separation of concerns. In the specific case of APIs: Well-curated data that should be efficiently accessed, through a clearly defined structure. For web applications this means
: Focusing on a specific user target, employing, if required, multiple APIs. We have described the potential use of APIs for lexicographic resources. There are several advantages to making the data accessible through APIs instead of encapsulating the data within the application. Instead of forcibly homogenizing diverse data sets into a general data model, it is more efficient to provide a common interface for accessing them. This also opens up opportunities to employ the different resources in the context of other applications. The main goal in developing C-SALT is to keep all resources as modular as possible, so that they can be used and reused in different research scenarios. In the case of VedaWeb, this currently applies to the dictionaries involved, but we see the potential to transfer the concept onto the other information layers as well, in particular the Rigveda text and its translations. In general, we believe that an API based approach to digital resources and data in the Digital Humanities provides efficient access to data and encourages the reuse of available resources. It thus facilitates novel uses by other researchers while avoiding repetition of work and unnecessary redundancy of resource instances. Applications are transient, but the knowledge, represented by the data, may stay and be reused.

Bibliography

Apte, V.S (1920)
3. The student's English-Sanskrit dictionary. Pune.

Böhtlingk, O. and Roth R. (1855-1875). Sanskrit-Wörterbuch. St. Petersburg: Kaiserliche Akademie der Wissenschaften.

Fielding, R. T. (2009). Architectural Styles and the Design of Network-based Software Architectures. Dissertation. Irvine: University of California. URL:

https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf

Geldner, K. F. (2003)[1951-57].
Der Rig-Veda. Aus dem Sanskrit ins Deutsche übersetzt und mit einem laufenden Kommentar versehen von Karl Friedrich Geldner. Cambridge (Mass.): Harvard University Press.

Grassmann, H. (1873). Wörterbuch zum Rig-veda. Wiesbaden, O. Harrassowitz.

Grassmann, H. (1876). Rig-veda. Übersetzt und mit kritischen und erläuternden Anmerkungen versehen von Hermann Grassmann. Leipzig: F.A. Brockhaus.

Griffith, R. T. H. (1896). The Hymns of the Rigveda. Benares: Lazarus.

Kapp, D. and Malten, T. (1997). Report on the Cologne Sanskrit Dictionary Project. URL:

http://www.sanskrit-lexicon.uni-koeln.de/CDSL.pdf

Keydana, G. and Luraghi, S. (2012). Definite referential null objects in Vedic Sanskrit and Ancient Greek. Acta Linguistica Hafniensia 44 (2):116–28.
https://doi.org/10.1080/03740463.2013.776245.

Kulikov, L. (2012). The Vedic -ya-Presents: Passives and Intransitivity in Old Indo-Aryan. Amsterdam, Netherlands: Rodopi.

Monier-Williams, M. (1899). A Sanskrit-English dictionary, new edition. Oxford:Clarendon.

Oldenberg, H. (1909/1912). R̥gveda. Textkritische und exegetische Noten. Berlin: Weidmann.

Reinöhl, U. (2016). Grammaticalization and the Rise of Configurationality in Indo-Aryan. Oxford: Oxford University Press

Renou, L. (1956-1969). Études védiques et paninéennes. Paris: Boccard.

Scarlata, S. and Widmer, P. (2015). “Vedische exozentrische Komposita mit drei Relationen”, Indo-Iranian Journal 58(1):26-47.

Widmer, P. (2007). “Der altindische vrkī́-Typus und hethitisch nakki-: Der indogermanische Instrumental zwischen Syntax und Morphologie”, Zeitschrift für Sprachwissenschaft (1-2):190-208.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.