An OWL 2 Formal Ontology for the Text Encoding Initiative

paper, specified "long paper"
  1. 1. Fabio Ciotti

    Università di Roma Tor Vergata

  2. 2. Silvio Peroni

    Università di Bologna (University of Bologna)

  3. 3. Francesca Tomasi

    Università di Bologna (University of Bologna)

  4. 4. Fabio Vitali

    Università di Bologna (University of Bologna)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper presents the results of an effort that our research team has done in order to develop an OWL 2 ontology to formally define the semantics of the Text Encoding Initiative markup language. The preliminary steps of this research project have already been presented at the TEI Conference in 2014 and 2015 (Ciotti and Tomasi 2014, Ciotti et al., 2015). We believe that our work has reached a satisfactory level of development, both on the theoretical side and in the practical implementation.

Why an ontology for TEI
The reasons to have a formal and machine-readable semantics for TEI are manifold. In the first place we can set forth a list of pragmatic and technical benefits that have been already pointed out in many previous works dedicate to this topic, that dates back to the mid-90s (Di Iorio, Peroni and Vitali, 2009; Ciotti and Tomasi, 2014). Here is a brief summary of those arguments:

enabling parsers to perform both syntactic and semantic validation of document markup;
inferring facts from documents automatically by means of inference systems and reasoners;
simplifying the federation, conversion and translation of documents marked up with different markup vocabularies;
allowing users to query upon the structure of the document considering its semantics.

The advantages envisioned in this list are not specific to the TEI or aim to facilitate the relationships between different markup languages; but some of the issues have special relevance for TEI and for the usage of TEI inside its reference community.
Take for instance the query issue: we all know that there are many ways of expressing one and the same textual feature in TEI markup, so that it is very difficult to query heterogeneous TEI corpora and text archives. Having a set of ontological definitions of the conceptual level behind markup, that is, a set of shared formal definitions of the textual features to which any single encoding project could bind idiosyncratic markup usage, could help solve this problem. The same argument could be made for a far more adequate management of interoperability of TEI text collections between different repositories or applications.
But we believe there is also a deeper theoretical and foundational advantage in the idea of an ontological semantic model for TEI. It is a commonly acknowledged notion that the very core of digital methods application in humanities research is the notion of model/modeling. The pair of terms “model/modeling” is deplorably understood in many different ways in the community. We think that, as far as we are using Turing machine like device for computation, the only workable notion of modeling is a formal one: model we should be interested in are formal models. Where formalization is to be understood as a series of semiotic processes that generates an algorithmically computable representation of one (or more) phenomenon/object.
It is widely recognized that the TEI is not only a markup facility but first and foremost a conceptual model of textuality. In fact, the Guidelines (TEI Consortium, 2015, chap. 23) explicitly introduce the notion of a TEI Abstract Model. The fact is that the notion of an abstract model is used in many formal procedures but this very notion is not formally defined. This ends up in a lot of problems and circularities. We think that we need to have a formalized account of the quasi-formal notion of TEI abstract model, if it has to be of any use other than a sort of regulatory principle.
We do not advocate going back to a monist theory of textuality. Our suggestion to adopt contemporary Semantic Web formalisms to build this abstract conceptual model give us the possibility to have a “foundation” of TEI in a well-defined data model that is not dependent on the notion of a single hierarchical “ordered hierarchy of content object” (OHCO, DeRose et al., 1997), and that can accommodate, at least to some extent, the “pluralities” of textuality.

Structure of the ontology
TEI as a whole is very complex, and its usage is governed by pragmatics and contextual requirements. We acknowledge that it is impossible to reduce to a unique formal semantic definition this fuzzy cloud. Though, we can identify a subset of shared assumptions, a common ground of notions about the meaning of TEI markup and the nature of documents like object: we think that this subset can be the object of an ontological formalization. For various reasons we have adopted the TEI Simple customization (Cummings et al., 2014) as an acceptable approximation of this common ontology. This is not an opportunistic ad hoc choice, as it may seem. TEI Simple in fact has been defined by a group of domain expert that have analyzed the actual usage of TEI markup in some big textual repositories and have selected and organized a set of one hundred or so elements that can describe all the textual features represented by the markup in those documents. This fits perfectly in the definition of a formal ontology development process.
The main design requirements for building our ontology have been the following:

the ontology must express at the same time an abstract characterization of TEI Simple elements' semantics and an ontological definition of their structural role;
the ontology must define a precise semantics of the elements having a clear characterization in the official TEI documentation (e.g., the element ), while it should relax the semantic constraints if the elements in consideration can be used with different semantic connotations depending on the context (e.g., the element );
it must be possible to extend the ontology, reuse it and define alternative characterizations of elements semantics without compromising the consistency of the ontology itself;
where possible existing ontologies or meta-ontologies must be reused

In accordance with these overall principles we have decided to implement a complex architecture using some pre-existing meta-ontology frameworks to express the meaning of TEI element set by the way of the classes and properties they define. In particular we have adopted:
1) LA-Earmark (Di Iorio, Peroni, Poggi, Vitali, 2011; Peroni, Gangemi, Vitali, 2011), a markup metalanguage, that can express both the syntax and the semantics of markup as OWL assertions, and an ontology of markup that make explicit the implicit assumptions of markup languages. LA-EARMARK is an extension of EARMARK with the Linguistic Act module of the Linguistic Meta-Model that allows one to express and assess facts, constraints and rules about the markup structure as well as about the inherent semantics of the markup elements themselves.
2) Structural Pattern Ontology (Di Iorio, Peroni, Poggi, Vitali, 2014), whose goal is to identify a small number of patterns that are sufficient to express how the structure of digital documents can be segmented into atomic components.
The specification of markup semantics for the various TEI Simple elements is done by means of LA-EARMARK class and properties. The general Earmark class for any markup element is earmark:Element. The element is defined as follows:
Prefix earmark:
Prefix co:
Prefix tei:
Class: tei:abbr a
earmark:Element that
earmark:hasGeneralIdentifier "abbr" and
earmark:hasNamespace ""
LA-EARMARK allows us to link particular class of elements with the actual semantics they express. From our point of view there are at least two semantic levels that we explicitly define:

one concerning the structural behavior of markup that is described by means of the Pattern Ontology (PO);
the other regarding the intended semantics of an element (e.g., the fact that an element is a paragraph rather than a section, a personal name reference rather than a geographical reference), that is described by TEI Semantics Ontology or by a combination of already existing ontologies.

TEI Semantics Ontology is the core component that gives the actual semantics of TEI elements. Its definition is based on a categorization of the elements of the TEI Simple, based on a refactoring of the TEI model Classes.
The link between the class describing kinds of elements and their related semantic characterization is possible by means of the property “semiotics:expresses”. The associations of semantics to markup elements can be contextualized according to a particular agent's point of view, in order to provide provenance data pointing to the entity that was responsible for such specification. This is possible by means of the Linguistic Act Ontology included in LA-EARMARK that allows one to consider all these markup-to-semantics links as proper linguistic acts done by someone.

The work we have done so far is limited to the Simple subset of TEI. We envision some further development:

Refine the TEI Semantics Ontology component.
Extend to some other areas of TEI that are suitable for formalization.

We think that in the long term this ontological formalization could become the primary formalization of the TEI encoding schema, independently of any serialization format. Today XML is still the better strategy to encode digital texts in real word projects for many practical reasons. But there is no reason for the TEI to be strictly based on it, as it is
de facto now. Technical issues should not determine the choice of a formalization language. In the end, we believe that our effort can give a substantial contribution to the TEI to envision the shape of its own future.

Ciotti F. and Tomasi F. (2014). Formal ontologies, Linked Data and TEI semantics.
Journal of the Text Encoding Initiative, 9.

Ciotti F., et al. (2015). An ontology for the TEI: one step beyond. TEI Conference and Members Meeting 2015. Lyon.
DeRose, S. J., Durand, D. G., Mylonas, E. and Renear, A. H. (1997). What is Text, Really?. Journal of Computer Documentation, 21(3): 1–24.
Di Iorio, A., Peroni, S. and Vitali, F. (2009). Towards markup support for full GODDAGs and beyond: the EARMARK approach.
Proceedings of Balisage: The Markup Conference 2009, Balisage Series on Markup Technologies, Vol. 3, Montreal, Canada. doi:10.4242/BalisageVol3.Peroni01.

TEI Consortium. (2015).
TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.9.1. Last updated 15th November 2015.

Cummings, J. et al. (2015). TEI Simple: Power, economy, and a processing model for encoders and developers. Digital Humanities 2015.
Peroni S., Gangemi A. and Vitali F. (2011). Dealing with Markup Semantics.
I-SEMANTICS 2011 Proceedings, 111-118. DOI: 10.1145/2063518.2063533.

Di Iorio, A., Peroni, S., Poggi, F. and Vitali, F. (2011). Using semantic web technologies for analysis and validation of structural markup.
Int. J. of Web Engineering and Technology, 6(4): 375-98.

Di Iorio, A., Peroni, S., Poggi, F. and Vitali, F. (2014). Dealing with structural patterns of XML documents.
Journal of the American Society for Information Science and Technology, 65(9): 1884-900. DOI: 10.1002/asi.23088.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2016
"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website:

Series: ADHO (11)

Organizers: ADHO