Coping With The Complexity Of The TXM Platform Annotation Services With A Unified TEI Encoding Framework

paper, specified "long paper"
Authorship
  1. 1. Serge Heiden

    Ecole Normale Supérieure de Lyon (ENS de Lyon)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
TXM is a software platform offering textual corpora analysis tools and services. It is delivered as a standard desktop application for Windows, Mac and Linux and as a web portal server application (Heiden, 2010), <
>.

TXM provides a consistent set of analysis tools combining qualitative (or close reading) tools such as word frequency lists, concordancing or text edition hypertextual navigation with synthetic quantitative (or distant reading) tools such as factorial analysis, clustering, keywords or co-occurrence statistical analysis.
To work on texts, the platform first imports the corpus sources to create a rich XML-TEI based internal pivot representation via the following general workflow:

first the “base text” of each text is established: this operation implements “digital philology” principles and consists of decoding information in the various formats of the source documents to decide primarily where are the text limits, possible internal textual structures boundaries and the words of the text.
To do this, TXM can analyze and represent three main types of corpora:

corpora of
written texts, possibly including paginated editions including display side-by-side of the transcription and the images of facsimiles;

record transcriptions corpora, possibly time synchronized at the word or at the speech turn level with the audio or video source to provide playback;

and parallel
multilingual corpora aligned at the level of textual structures such as sentences or paragraphs.

The result of this operation is represented in a pivot XML format especially designed for TXM called “XML-TEI TXM” extending the standard encoding recommendations of the Text Encoding Initiative consortium (TEI Consortium, 2017);

then, natural language processing (NLP) tools are optionally applied to the base text to automatically add linguistic information like sentence boundaries, grammatical categories (pos = part of speech) and lemma of words by eg TreeTagger (Schmid, 1994). As NLP tools generally don’t take XML format as input, the pivot representation is first converted to plain text for NLP processing and results are injected back into the XML-TEI TXM representation;
finally a specialized representation of the texts is built into TXM for efficient execution of its different tools (by indexing for search engines or by rendering in HTML 5+CSS+Javascript for text editions navigation).

From a methodological point of view:

the XML tags of the initial XML-TEI TXM representation in a) can be seen as
manual annotations added to the base text (or raw text), typically philologically edited with the help of specialized XML editors (like Oxygen XML Editor

https://www.oxygenxml.com
) outside of TXM when the source is XML native, or as
automatic annotations added by TXM when converting the sources from other digital formats (like TXT, DOC, etc.) into XML-TEI TXM.

NLP tools processing results in step b) can be seen as
automatic annotations added to the initial XML-TEI TXM representation of texts built in work step a);

All TXM tools can then be applied indiscriminately to all types of annotations through a unified textual corpus data model regardless of their origin (automatic or manual).

Thus, so far TXM has implemented a traditional digital philology workflow combining an initial “text source encoding and annotation” step to a following “application of analysis tools on annotated texts” step. The text analysis tools use text annotations (for example word pos or some internal textual structure) to offer their services and produce their results (for example the frequency index of all infinitive verbs found in a section). The workflow is unidirectional and the whole of it must be passed through again completely if any annotation needs to be corrected. To add or correct annotations, the user has to edit the sources or the annotations outside of TXM. For example word properties can be exported from the XML-TEI TXM representation in a file in tabulated format, edited in a spreadsheet and injected back into the texts before re-import into TXM
see for example this tutorial based on TXM macros:
.
.

This paper introduces new services developed in TXM to annotate directly texts from within the results view of specific tools for a better integration of philological and analytical work. Indeed, results views are great places to be aware of annotation errors or annotation needs, and to access what needs to be corrected or annotated.

Interactive annotation services in TXM
The three new annotation services concern both adding and correcting information, and all the annotations edited are meant for further exploitation by usual TXM tools.

Concordance based SyMoGIH entities annotation
The first service, developed in partnership with the LARHRA research laboratory in history

http://larhra.ish-lyon.cnrs.fr

, is based on the annotation of concordance pivots: any sequence of words composing the pivots can be annotated with any semantic category
pivotscan also optionally be annotated with simple keywords or with key-value pairs, managed by TXM in a local repository. of the SyMoGIH

historical knowledge base (Beretta, 2015). In this architecture, the SyMoGIH platform hosts the ontology of historic facts and knowledge, and TXM concordances provide the user interface to link identifiers of those data to text spans for further analysis.

As an illustration, see figure 1 the annotation of the “Faculté de droit d’Aix” entity (of id CoAc13562) in unverified OCRed texts of the “Bulletin administratif de l'Instruction publique" corpus
see the Bibliothèque historique de l'éducation (BHE) project:

.

Figure 1. TXM screenshot of a Concordance of a “Faculté de droit d’Aix” word sequence pattern to annotate (top) and of browsing SyMoGIH semantic categories to find the CoAc13562 identifier to use for the annotation (bottom).

TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with the new annotations. After re-import (after saving annotations) the new annotations are available for all TXM tools to work on like any original “annotation” of the texts (with internal textual structures and their properties).

Concordance based word properties annotation
The second service is based on the annotation of words of concordance pivots: a word present in the pivots
in TXM, pivots of concordances can be composed of a sequence of words. of a concordance can be annotated with any property. The primary goal of that service is to annotate and correct pos and lemma properties of words, but it can help to annotate any property at the single word level.

As an illustration, see figure 2 the correction of the “pos” property of some “vers.” words used in biblical references in Hobbes works lemmatized by Morphadorner (Burns, 2013).

Figure 2. TXM screenshot of a Concordance to set the “pos” property to the “n-ab” value of two occurrences of the "vers." word, selected by their concordance line.

TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with new annotations encoded in XML-TEI TXM at the word level.

Full text URS annotation in text edition
The third annotation service is based on manual annotation of sequence of words inside text editions with elements of a Unit-Relation-Schema (URS) annotation model (Widlöcher & Mathet, 2009). URS type annotations are designed to encode complex discourse entities like co-reference chains in texts (Schnedecker et al., 2017).
As an illustration, see figure 3 the annotation of the “ses loix” sequence of words with a URS unit of type MENTION, having its grammatical category to the value “GN.POS” and its referent to the value “les lois de la divinité”, in the first chapter of the 1755 edition of
De l'esprit des lois by Montesquieu.

Figure 3. TXM screenshot displaying the first page of an edition of
De l'esprit des lois highlighting in light yellow all URS units of type MENTION and in bold the unit currently selected (top window), and displaying the current unit properties value input form (bottom window): CATEGORY property at value “GN.POS”...

TXM import/export management services represent those annotations as XML-TEI stand-off annotations anchored to the word elements of the XML-TEI TXM representation of texts (Grobol et al., 2018).

Discussion
By using a common XML-TEI TXM pivot representation for internal management of corpora for all the annotation services, TXM unifies transcription, encoding and annotation activities in a single framework. In this framework, annotations can represent manual (user), semi-automatic (machine+user) or automatic (machine) interpretation results used further for analysis and interpretation work. The reflexive nature of the resulting text analysis workflow is schematized in figure 4. Texts are first digitized by OCR, transcribed or converted from digital formats. They are then possibly philologically corrected and established through XML-TEI manual encoding. Then automatically processed by NLP tools while being imported into TXM to produce the TXM internal corpus model. Corpus analysis is then assisted by TXM tools applied to the corpus model. The pivot representation that gathers all annotations produced by annotation tools is figured as the node labeled « Pivot rep. » and the interpretation workflow itself is figured as a digital hermeneutic circle.

Figure 4. Digital hermeneutic circle integration into TXM.

Legend

blue box:
manual annotation activity
black box:
tool

red box:
automatic annotation activity
green box:
TXM corpus data model

purple disk:
data representation
black arrow:
activity

green arrow:
annotation equivalence

Conclusion
All the new annotation services integrated into TXM are building a comprehensive annotation-based digital text corpora analysis platform. From an epistemological point of view, the integration in TEI of the different annotation models and tools into the platform helps its users to better define and trace what comes from the source corpus they analyze and what comes from their own or from others interpretation work.
This work was funded by the ANR and the DFG under grant numbers
ANR-15-CE38-0008 (DEMOCRAT project) and
ANR-14-FRAL-0006 (PaLaFra project).

Bibliography

Beretta, F. (2015). Publishing and sharing historical data on the semantic web : the SyMoGIH project – symogih.org. Presented at the Workshop:
Semantic Web Applications in the Humanities. Retrieved from https://halshs.archives-ouvertes.fr/halshs-01136533.

Burns, P. R. (2013). MorphAdorner v2: A Java Library for the Morphological Adornment of English Language Texts. Evanston, IL. Northwestern University.

Grobol, L., Landragin, F.
and
Heiden, S. (2018). XML-TEI-URS: using a TEI format for annotated linguistic resources.
CLARIN Annual Conference 2018. Pisa, Italy https://hal.archives-ouvertes.fr/hal-01827563.

Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In Otoguro, R., Ishikawa, K., Umemoto, H., Yoshimoto, K. and Harada, Y. (eds),
24th Pacific Asia Conference on Language, Information and Computation – PACLIC24. Sendai: Institute for Digital Enhancement of Cognitive Development, pp. 389-98. <https://halshs.archives-ouvertes.fr/halshs-00549764/en> (accessed 15 April 2019).

Schmid, H. (1994). Probabilistic Part-Of-Speech Tagging Using Decision Trees. In
Proceedings of the International Conference on New Methods in Language Processing (Vol. 12).

Schnedecker, C., Glikman, J.,
and
Landragin, F. (2017). Les chaînes de référence : annotation, application et questions théoriques.
Langue française, (195), 5–16. https://doi.org/10.3917/lf.195.0005

TEI Consortium. (2017). TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Retrieved from http://www.tei-c.org/Guidelines/P5

Widlöcher, A.,
and
Mathet, Y. (2009). La plate-forme Glozz: environnement d’annotation et d’exploration de corpus. In
Actes de la 16e Conférence Traitement Automatique des Langues Naturelles (TALN’09), session posters (p. 10). Senlis, France, France. Retrieved from https://hal.archives-ouvertes.fr/hal-01011969

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.