A Collaborative, Indeterministic and partly Automatized Approach to Text Annotation

Thomas Bögel; Evelyn Gius; Marco Petris; Jannik Strötgen

Authorship

1. Thomas Bögel

Institution Ruprecht-Karls-Universität Heidelberg (University of Heidelberg)
2. Evelyn Gius

Universität Hamburg (University of Hamburg)
3. Marco Petris

Universität Hamburg (University of Hamburg)
4. Jannik Strötgen

Institution Ruprecht-Karls-Universität Heidelberg (University of Heidelberg)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Description
The webbased system CATMA (Computer Aided Text Markup and Analysis) was designed to address the interest essentially motivating human encounters with literature: hermeneutic, i.e., “meaning” oriented highorder interpretation. In the scholarly interpretation of literature we are not looking for the right answer, but for new, plausible and relevant answers. This requires a true hermeneutic markup as defined by Pietz (2010: paragraph 1):

By "hermeneutic" markup I mean markup that is deliberately interpretive. It is not limited to describing aspects or features of a text that can be formally defined and objectively verified. Instead, it is devoted to recording a scholar's or analyst's observations and conjectures in an openended way. As markup, it is capable of automated and semiautomated processing, so that it can be processed at scale and transformed into different representations. By means of a markup regimen perhaps peculiar to itself, a text will be exposed to further processing such as text analysis, visualization or rendition. Texts subjected to consistent interpretive methodologies, or different interpretive methodologies applied to the same text, can be compared. Rather than being devoted primarily to supporting data interchange and reuse – although these benefits would not be excluded – hermeneutic markup is focused on the presentation and explication of the interpretation it expresses.

CATMA has been developed to support McGann’s (2004) openended, discontinuous, and nonhierarchical model of textprocessing. Its nondeterministic approach to markup allows the user to express many different readings directly in markup. The system not only enables collaborative research but it is based on an approach to markup that transcends the limitations of lowlevel text description, too. CATMA supports highlevel semantic annotation through TEIcompliant, nondeterministic standoff markup and acknowledges the standard practice in literary studies, i.e., a constant revision of interpretation (including one’s own) that does not necessarily amount to falsification. Moreover, it enables users to switch ad hoc between text annotation and text analysis in either direction as well as recursively.

In 2013 in a joint project, heureCLÉA , two research teams (one computer scientists, the second narratologists) started to focus on an exemplary hermeneutic "use case": the decoding of temporal information in narratives, namely the automatic detection of temporal phenomena in literary narratives.

For this purpose, we developed an approach based on both manual annotation of narratological phenomena andtherulebasedextractionandnormalizationoftemporalexpressionswhichare used as a starting point for machine learning. This project is still ongoing, but the automated annotation of temporal expressions and other linguistic features like POS (partofspeech) tagging and sentence detection, as well as tense annotations based on morphological analysis, have already been implemented in CATMA and can be used for a combined automatic and manual annotation of texts.

In our tutorial, we will introduce the core annotation and analysis functionalities of CATMA and show how they can be combined with the annotations provided automatically by HeidelTime and other components. Participants will have the opportunity of testing the tool in a handson session where they can annotate their own texts or annotate collaboratively a text we will provide. We would like to engage participants in a design critique of CATMA and its components and a general discussion about requirements for text analysis tools in their fields of interest, too.

2. Tutorial Instructors
All tutorial instructors come from the developing team of the heureCLÉA project. We have been presenting and teaching CATMA, HeidelTime and heureCLÉA on various national and international occasions in the last years. Two of us have included crucial aspects from heureCLÉA in their PhD research projects, too.

Thomas Bögel, Institute of Computer Science, Heidelberg University

Thomas studied computational linguistics and is currently working as a researcher and pursuing his PhD at the Institute of Computer Science at Heidelberg University. His research focuses on event extraction and timeline generation, as well as the development of machine learningbased systems for temporal relation extraction from narrative texts.

Evelyn Gius, Department of Languages, Literature and Media, Faculty of the Humanities, University of Hamburg

Evelyn has been trained as a computational linguist and is now working in the field of literary computing as a researcher and lecturer. For her PhD project she has explored with CATMA the benefits of applying narratological categories from literary studies to the analysis of narrations of reallife labor conflicts.

Marco Petris, Department of Languages, Literature and Media, Faculty of the Humanities, University of Hamburg

Marco is a computer scientist with a strong affinity for the humanities and has been engaged in the creation of CATMA from the very beginning. As a research developer he is involved in all aspects of the design and implementation of tools for the Digital Humanities.

Jannik Strötgen. Institute of Computer Science, Heidelberg University

Jannik studied computational linguistics and economics at Heidelberg University before he joined the Institute of Computer Science as researcher and PhD student. His research focuses on temporal and geographic information extraction and retrieval, and he is the main developer of the widelyused, multilingual, crossdomain temporal tagger HeidelTime, which achieved the best results for the task of temporal tagging at TempEval2010 (English) and TempEval2013

Contact address

Evelyn Gius

Universität Hamburg

Department of Languages, Literature and Media Institut für Germanistik

VonMellePark 6

20146 Hamburg

Tel + 49 40 42838 6942

Fax + 49 40 42838 3553 evelyn.gius@unihamburg.de

3. Target audiences and number of participants
The primary users of CATMA are literary scholars, and graduate and undergraduate students of Literary Studies. Nevertheless, this tutorial is likely to be of interest also to:

humanities scholars of ALL fields concerned with text analysis (with and without experience in digital text analysis)
software developers in the humanities interested in nondeterministic text analysis and automated annotation
Expected number of participants: We can accommodate up to 25 participants.

4. Special requirements
Participants will be asked to bring their own laptops. We will need internet access for all participants and a screen projector.

5. Outline of the tutorial
The tutorial is designed as a 3,5 hours tutorial, including a break of approx. 30 minutes. Provisional format:

introduction to CATMA (10 min)
the CATMA approach to markup: indeterministic and collaborative markup functionalities (20 min)
automated tagging of temporal expressions provided by HeidelTime and other linguistic annotations by the UIMApipeline (30 min.)
handson session: annotating texts (30 min) (break: 30 min)
handson session: annotating texts (60 min)
the heureCLÉA approach to narratological phenomena of time (15 min)
wrap up discussion (15 min)
References
www.catma.de (last seen 20140217)

Piez, Wendell (2010), Towards Hermeneutic Markup: An architectural outline, King's College, DH 2010, London. Available from: http://dh2010.cch.kcl.ac.uk/academicprogramme/abstracts/papers/html/ab743.html (last seen 20140217).

We define this distinction as follows: description cannot tolerate ambiguity, whereas an interpretation is an interpretation if and only if at least one alternative to it exists. Note that alternative interpretations are not subject to formal restrictions of binary logic: they can affirm, complement or contradict one another. In short, interpretations are of a probabilistic nature and highly context dependent.

heureCLÉA is a BMBFfunded eHumanities project run jointly by the University of Hamburg and Heidelberg University since the beginning of 2013 (cf. www.heureclea.de, last seen 20140217).

cf. the accepted paper by Janina Jacke and Jan Christoph Meister: Pushing Back the Boundary of Interpretation: Concept, Practice and Relevance of a Digital Heuristic

References

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

A Collaborative, Indeterministic and partly Automatized Approach to Text Annotation

1. Thomas Bögel

2. Evelyn Gius

3. Marco Petris

4. Jannik Strötgen

ADHO - 2014

"Digital Cultural Empowerment"