A Model-to-model Approach for Developing Corpus Metadata. An “Odd” TEI Customization for Encoding Metadata

paper, specified "long paper"
  1. 1. Carolin Odebrecht

    Humboldt-Universität zu Berlin (Humboldt University)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Why do we need a metamodel for metadata?
The Text Encoding Initiative (TEI; TEI-Consortium 2019) environment provides a generic document model. TEI customizations typically follow an explicit or implicit model for text representation. The TEI document model provides modules for encoding text via mark up as well as modules for the metadata referring to the TEI document and the encoded text. The specialized (“odd”) customization presented here follows an explicit Metamodel for Corpus Metadata (MCM) representation and expands the range of applications of the TEI metadata modules to non-TEI corpora. Metadata are defined as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.” (NISO 2004: 1). The application scenarios for metadata are diverse, which in turn generates requirements for metadata classifications (c.f. for example Zeng and Qin 2016). Corpus documentation is a typical example of the application of metadata. In order to (re-)use historical corpora, a researcher needs to be able to gain intellectual access to a heterogeneous set of properties of a corpus. Moreover, the reusability of (historical) corpora is an important indicator for their sustainability, following Simons and Bird (2008: 90) who stated that a “[...] resource will be used if it still exists, if it is usable, and if a user finds it relevant.” Historical corpora attest diverse methods and approaches towards corpus creation, exemplified by a variety of existing resources such as
Referenzkorpus Althochdeutsch (Donhauser, Gippert and Lühr, 2018, Version 1.1),
Deutsches Textarchiv (Berlin-Brandenburgische Akademie der Wissenschaften, 2019), and
Kasseler Junktionskorpus (Ágel and Hennig, 2008, Version 1.1). Thus, corpus documentation must cover a complex and heterogeneous field of corpus data. MCM provides a UML-based extensive and extensible metamodel for describing corpora, including their documents, annotations and preparation workflows, to address this complexity and enable intellectual access to corpus data.

In this paper, I will discuss how MCM can serve as a content model for the TEI document model, which in turn is used for a realization of MCM. This way, information is passed from model to model. Following this approach, I will also show how flexibly the TEI framework can be used and how two types of models can interact with each other.

The TEI document model
The TEI Guidelines provide an extensive and easily adaptable framework for text encoding of various text types and genres, and for the encoding of metadata (cf. Stührenberg 2012). Especially the digitization and encoding of historical documents are often achieved with the help of the TEI (cf. for example Haaf, Geyken, and Wiegand 2014; Schroeder and Zeldes 2016; Durrell, Ensslin, and Bennett 2007). Figure 1 illustrates the TEI document model, and particularly the relation of data and metadata.

TEI document model with a basic structure containing teiHeader and body. The metadata in the teiHeader refer to the body part as well as to the external source and the TEI document itself. The body part refers as well to source. The source itself can be more complex than indicated here.

A TEI-compliant document

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-TEI.html Accessed 2018-11-25.

contains a body, which ‘contains the whole body of a single text’

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-body.html Accessed 2018-11-25.

, and a teiHeader

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html Accessed 2018-11-25.

for the metadata records and for the documentation of a TEI document. The metadata in the teiHeader refer to the TEI document itself, the body and the external source (in our case historical texts). The data in the body of the TEI document refer to the source as being the digital surrogate of the source. In this way, metadata and data are integrated in a single document, are defined by the TEI content model, and refer to the same source. The annotation (tag set) of the document is typically not documented in the teiHeader. Instead, it is extracted from, and documented in, the TEI Guidelines, and modeled within the TEI environment.

A model-to-model-based approach for developing corpus metadata
The TEI Guidelines can be seen as a content standard, the TEI XML as a file exchange standard which is defined by the TEI modeling tool ‘One Document Does it all’ (ODD, Burnard and Rahtz 2004).

This is an advantage compared to other metadata schemes (e.g. ISO 2015) which do not have such a modeling environment. For a discussion of other approaches see Odebrecht (2018, chapter 5)
By modeling the TEI with the help of the TEI, the TEI is validated within TEI as well. This way, the data creation and documentation is exclusively dependent on the TEI environment. While using the TEI framework enables a high degree of interoperability across disciplines, other formats also mix TEI annotations with other annotations, or do not use the TEI at all. The complexity for corpus data documentation increases when dealing with digital surrogates of different encoding standards, which in turn are based on diverging text, data models, or formats. To handle this complexity, the MCM, by using the Unified Modeling Language (UML, cf. Rumpe 2011)

https://www.omg.org/spec/UML/ Accessed 2018-11-25.

, identifies common features across diverse corpus data and defines which corpus metadata need to be included in a corpus documentation to enable intellectual access and corpus data reuse (Odebrecht 2018). This kind of documentation needs a stronger focus on information about corpus preparation, because the preparation steps for a historical corpus are not always included in a single environment such as the TEI. Adopting the TEI approach and focusing on the metadata framework, the MCM provides an additional conceptual layer (cf. Figure 2).

The model-to-model approach: The MCM provides extended metadata for corpus documentation which are handed over to the TEI modeling environment ODD. Each TEI customization refers to a central corpus documentation part: source (text model), annotation (data model) and creation (workflow model).

The MCM serves as a blue print for metadata modeling and realization in TEI (Odebrecht 2017). It is a format-independent and extensive content model for the application of the TEI to corpus documentation. The realization of the model-to-model approach can explicitly add additional statements and information relationships which go beyond the original TEI model. From a TEI perspective, the document model with a focus on metadata is extended by including non-TEI corpora but looses the close relationship between the source and the TEI document. The customizations adapt the TEI for an originally unintended and novel use case which allows for the documentation of historical corpora of any format and any architecture. This is done via three customizations: One customization describes corpus-internal information such as the text representation which is close to the typical functionality of the TEI. The second customization contains explicit information about all kinds of annotations in the corpus. A third customization describes corpus-external information of the workflow. A concrete application scenario for historical corpora is as follows: TEI or non-TEI corpora which shall be archived in a repository (LAUDATIO

http://www.laudatio-repository.org/ Accessed 2019-04-24.

) can be documented comprehensively and deeply structured. The resulting TEI-XML metadata are then used for metadata search and display in this repository.

The TEI customizations represent an innovative perspective on the application of the TEI. The TEI already provides a full metadata structure which is integrated in a document-centric model. With the help of another model – the MCM – the application of the teiHeader can be extended. The MCM can be defined as a concrete and extensive content model for metadata information. The application of the MCM benefits from the TEI environment and its interoperability, and can make use of the modeling tool ODD and its validation mechanism. This approach enables use cases for metadata that were previously separated from the TEI universe and proves that the adaptability and flexibility of the TEI allows reuse scenarios, in this case TEI XML for corpus metadata, which have not been initially intended.


Burnard, L. and Sebastian R. (2004). RelaxNG with Son of ODD.
Extreme Markup Languages. http://www.tei-c.org/cms/Talks/extreme2004/paper.html (accessed 2018-11-26).

Donhauser, K., Gippert, J. and Lühr, R. (2018, Version 1.1
). Referenzkorpus Altdeutsch, Humboldt-Universität zu Berlin.

Durrell, M., Ensslin, A. and Bennett. P. (2007). The GerManC project.
Sprache und Datenverarbeitung (31): 71–80.

Berlin-Brandenburgischen Akademie der Wissenschaften (2019).
Deutsches Textarchiv. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache:
http://www.deutschestextarchiv.de/ (accessed 2019-04-26).

Haaf, S., Geyken A., and Wiegand, F. (2014). The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources.
Journal of the Text Encoding Initiative (8). doi:10.4000/jtei.1114.

Ágel, V. and Hennig, M. (2008, Version 1.1).
KAJUK. Justus-Liebig-Universität Gießen.

ISO 24622-1 (2015).
Language resource management. Component Metadata Infrastructure (CMDI). 01.140.20. https://www.iso.org/standard/37336.html (accessed 2018-11-26).

NISO (2004).
Understanding Metadata. NISO Press. ISBN: 1-880124-62-9.

Odebrecht, C. (2017).
Metadata for Historical Corpora. Realization of the Metamodel for Corpus Metadata with the help of TEI Customization [Data set]. Zenodo. doi:

Odebrecht, C. (2018).
MKM – ein Metamodell für Korpusmetadaten. Dokumentation und Wiederverwendung historischer Korpora. Phd Thesis. Humboldt-Universität zu Berlin, Sprach- und literaturwissenschaftliche Fakultät, Berlin. doi: 10.18452/19407.

Rumpe, B. (2011).
Modellierung mit UML. Berlin, Heidelberg: Springer Berlin Heidelberg.

Schroeder, C. T., and Zeldes, A. (2016). Raiders of the Lost Corpus.
Digital Humanities Quarterly 10 (2). http://www.digitalhumanities.org/dhq/vol/10/2/000247/000247.html (accessed 2018-11-26).

Simons, G. and Bird, S. (2008). Toward a global infrastructure for the sustainability of language resources. In
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, pp. 87–100.

Stührenberg, M. (2012). The TEI and Current Standards for Structuring Linguistic Data.
Journal of the Text Encoding Initiative (3). doi: 10.4000/jtei.523

TEI Consortium (2019).
TEI P5: Guidelines for Electronic Text Encoding and Interchange.

Version 3.5.0,
http://www.tei-c.org/Guidelines/P5 (accessed 2019-03-02)

Zeng, M. L., and Qin, J. (2016).
Metadata. Second Edition. London: facet publishing.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO