heureCLÉA (www.heureclea.de) is a BMBF-funded eHumanities project1 which combines the two conceptual perspectives on object annotation that set apart the humanities and the 'hard sciences': strictly rule-based explication of uncontroversial object features as exemplified in the measurement of values, such as length, height, density, etc., versus the hermeneutic response for which observation and emotive engagement from a subjective point of view must go hand in hand in order to facilitate interpretation. These are of course ideal types and the actual practice of annotation in the humanities is situated at their interface, which is heuristics—the methodologically controlled 'art of finding' that goes beyond pure measurement, but whose purpose it is to generate relevant questions rather than conclusive answers. Against this backdrop heureCLÉA aims to implement a digital heuristics module for the text annotation tool CATMA (www.catma.de) so that we may benefit from synergies between the computationally automated and the subjectively motivated, human generated annotation of texts.
1. The heureCLÉA Project
The practical backbone of heureCLÉA are two software developments: HeidelTime, which was developed at Heidelberg University, is a rule-based system for the extraction and normalization of temporal expressions.2 It needs to be significantly modified to cope with the complexity of literary narratives.3 – The CATMA (Computer Aided Textual Markup & Analysis) markup tool was developed at Hamburg University. The current release of CATMA (version 4.0) is open source and provides a robust web based annotation environment for collaborative markup.4 CATMA not only supports intuitive text annotation in a flexible, XML/TEI-compliant format, but also integrates markup with analytical and visualization functions (cf. Fig.1). This enables users to switch ad hoc between text annotation and text analysis in either direction as well as recursively. CATMA thus supports what Burnard referred to as the 'continuous turning of the hermeneutic wheel'5—i.e. the back and forth between formal text analysis and the generation of interpretative hypotheses. In its most recent development phase (called CLÉA)6, CATMA then progressed from a stand-alone desktop application to a web-based solution. This adds yet another conceptual dimension, that of collaborative markup:7 in CATMA researchers can now share, reuse, amend and dispute each other's markup.
Fig. 1: Exploration and Annotation in CATMA
CATMA's overall design is based on the premise that a DH tool should emulate the methodological and social practice of traditional philology as closely as possible. This practice integrates three methodological primitives: analytical/declarative operations, hermeneutic operations, and discursive critique of explications and theories.8 This conceptual high-level design goal also determined our choice between the two competing paradigms of embedded (in-line) markup and external standoff markup, which we regard as methodological rather than technological opposites. Accordingly, embedded markup represent the idea of objective taxonomic universality and the potential immanence of 'truth' in an object. External standoff markup, on the other hand, is based on the acknowledgment of the contingent nature and historical transience of object interpretation. In a contemporary philological perspective, embedded markup therefore constitutes something of a methodological anachronism: for conceptually it resembles the pre-enlightenment model of canonical text exegesis which the modern humanities have long replaced by a critical, self-reflexive hermeneutic approach. Yet this critique is of course of a purely philosophical nature when dealing with pre-interpretive analytical and declarative tasks, such as POS tagging.
However, once higher-level semantics are at stake, these considerations force us to adopt a truly 'hermeneutic' approach to markup.9 Interpretation varies depending on interpreter, context and interpretive theory. Accordingly, even elementary markup produced in order to support higher-level interpretation must still remain transparent as one possible account among many, and users must be able to produce and store ambiguous and indeed even contradictory markup for the same text in a standoff manner. Since rich interpretations are best generated in a discursive practice, it is also necessary to enable the easy sharing and combining of markup generated by different interpreters.
However, while these desiderata can all be considered emulative goals which informed the development of CATMA, their conceptual benefit for the digital humanities at large lies elsewhere. A truly non-deterministic and discursive approach to markup yields diverse annotation data—and that type of data can subsequently be analyzed in order to "push back" the boundary separating interpretation and declaration.
What this means is best illustrated by outlining the three components and phases of heureCLÉA (cf. Fig. 2):
1. Narratological Analysis of Temporal Phenomena: In heureCLÉA the identification of the temporal structure of narrative texts is approached concurrently through
(a) manual collaborative annotation with CATMA, and
(b) automated temporal tagging with HeidelTime.
In (a) we draw upon (but do not restrict our taggers to) the narratological taxonomy of Genette10, which is supplemented by a taxonomy suited to capture action and event segmentation. In (b) automated temporal annotations are generated via an UIMA pipeline11 that includes HeidelTime as a rule-based temporal tagger, the TreeTagger12 as a POS tagger, and Morphisto13 for a morphological analysis.
2. Machine Learning Approach towards an Automation of Complex Time Annotation: The next step is the learning of new rules for automated annotation from the manually generated markup. Different methods for the derivation of rules—especially those for typical co-occurrence of temporal expressions—are used. Once integrated into the components of the heureCLÉA UIMA pipeline these rules enable the system to handle more complex annotations. This process is dynamic as growing quantities of markup facilitate more complex modeling strategies based on e.g. distributional approaches (such as Latent Semantic Analysis). Finally, patterns representing typical temporal sequences may be extracted (Sequence Mining).
3. Integration of the Heuristic heureCLÉA Module into CATMA:Once a functional threshold has been passed (i.e.: reliability of automated detection of temporal references of low complexity; performance/robustness) the heureCLÉA UIMA pipeline will be integrated into CATMA as a service. It then provides a 'digital heuristics' for the partially automated, partially interactive generation of temporal markup, and will be tested and evaluated to verify the adequateness of the automatically generated markup (partially through stochastic methods.)
Fig. 2: heureCLÉA's interrelated areas of work
2. The Heuristics/Hermeneutics Divide As a Conceptual Boundary in the Digital Humanities
What is the methodological relevance of this work toward a digital heuristics? The realization that text markup is essentially interpretive per se is anything but new.14 Indeed, the argument about what markup is seems somewhat artificial; it might have sufficed to ask literary scholars what markup is there for: in their view the raison d'être of any object annotation and classification is always interpretation. But is the boundary between a declarative and an interpretive method rigidly defined for the digital humanities?
Some experiences gained in the heureCLÉA project may offer an answer. To begin with, the hermeneutic nature of building time constructs from narratives proves to be only partially owed to the 'fuzziness' of natural language. As a complex symbolic system narrative is also characterized by the intricate coupling of a referential and an indexical semiotics. 'Time' illustrates this: on the surface it is referenced as chronological structure of a particular 'story world'.15 Yet on a deeper level it is also communicated as an implicit processing instruction encoded in the form of temporal deixis and ellipsis, and of linguistic markers for the contraction or expansion of time (viz. a summary vs. a scenic description). This information is termed 'indexical' because it refers back to the instance of utterance (the narrator or narrating character). The reader needs to process that information in parallel with the referential in order to reconstruct two intersecting chronologies—that of the 'story world', and that of the representational discourse which he is trying to reverse engineer. However, the reach of the indexical extends beyond the text-reader-system: as we read we also become subject to the indexical temporality of the 'how' of representation (discours) on an existential level.16 As Ricoeur argues, the reconstruction of temporality on the referential and discursive levels of narrative is indeed how we learn to experience temporality.17
Time is only one of many phenomena of narrative representation and understanding characterized by this triple-layered semiotics—and against this backdrop it becomes clear why hermeneutic text interpretation cannot be automated. The machine is (as yet) not an interpreting agent able to engage reflexively and speculatively with its object. It is confined to a fully explicated, operational definition of 'relevance' in terms of known tasks and objectives. While it can resolve an indexical reference in order to compute, say, chronological order and extension, the interpretation of the existential relevance of that double-encoded message as one which also addresses the interpreting mind will remain beyond its ability for as long as it is not equipped with a concept of 'mind'.
However, the foundational human interpretive operations taking place on the elementary declarative and inferential levels can in part be approximated statistically through recursive routines. This recursive approximation is the functional characteristic as well as the outer boundary of what we term a 'digital heuristic'. By this we mean a computational tool able to support the heuristic operations of analytical identification and categorization of phenomenological primitives that necessarily precede hermeneutic synthesis. These operations form the indispensable basis of any higher-order 'interpretation'. It is in this intersecting terrain of the hermeneutic and the heuristic that the digital humanities might help to "push back the border" and question the hegemony of interpretation.
1. See dbs.ifi.uni-heidelberg.de/index.php?id=129&L=1.
2. Mazur, P. and Dale, R. (2010). WikiWars: A New Corpus for Research on Temporal Expressions. "Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010)". Massachusetts, pp. 913-922.
3. Ricoeur, P. (1983ff). Time and Narrative (Temps et Récit), 3 vols. trans. Kathleen McLaughlin and David Pellauer. Chicago: University of Chicago Press, 1984, 1985, 1988 (1983, 1984, 1985).
4. Burnard, L. (2001). On the Hermeneutic Implications of Text Encoding. In Fiormonte, D. and Usher, J. (eds.), "New Media and the Humanities: Research and Applications". Oxford: Humanities Computing Unit, pp. 31-38. Available online in the 1998 version at users.ox.ac.uk/~lou/wip/herman.htm [accessed 20 September 2013].
5. "CLÉA" is short for "Collaborative Literature Éxploration & Annotation". The accent deguis was added to highlight the diacritical concerns of Non-Anglo literary scholars. The CLÉA development phase of CATMA was generously supported by two Google Digital Humanities Awards (2010, 2011). For further details see www.catma.de/clea.
6. Meister, J. C. (2012). Crowd Sourcing 'True Meaning'. A Collaborative Markup Approach to Textual Interpretation. In McCarty, W. and Deegan, M. (eds.), Collaborative Research in the Digital Humanities. Festschrift for Harold Short. Ashgate Publishers: Farnham, Surrey/Burlington, pp. 105-122.
7. McCarty, W. (1996). Implicit Patterns in Ovid's Metamorphoses. "Centre for Computing in the Humanities (CHWP 1996)". projects.chass.utoronto.ca/chwp/mccarty/mcc_8.html (accessed 11 September 2013). First published 1991.
8. Piez, W. (2010): Towards Hermeneutic Markup: an Architectural Outline. "Digital Humanities 2010. Conference Abstracts". London: Office for Humanities Communication, Centre for Computing in the Humanities, King’s College London, pp. 202-205.
9. Genette, G. (1972). Discours du récit. In id., "Figures III". Paris: Editions Du Seuil, pp. 67-282.
10. Strötgen, J. and Gertz, M. (2010). HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. "Proceedings of the 5th International Workshop on Semantic Evaluation (ACL 2010)". Uppsala, pp. 321-324.
11. Schmidt, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. "Proceedings of International Conference on New Methods in Language Processing". Manchester.
12. Zielinski, A., Simon, C., and Wittl, T. (2009). Morphisto: Service-Oriented Open Source Morphology for German. "State of the Art in Computational Morphology: Workshop on Systems and Frameworks for Computational Morphology (SFCM 2009)". Zürich, pp. 64-75.
13. See among others Coombs, J. H., DeRose, S. J., and Renear, A. H. (1987). Markup systems and the future of scholarly text processing. "Communications of the ACM", 30 (11): 933-47; Buzzetti, D. (2002). Digital Representation and the Text Model. New Literary History, 33 (1): 61-88; Burnard (2001); Piez (2010).
14. Herman, D. (2002). Story Logic: Problems and Possibilities of Narrative. Lincoln: University of Nebraska Press.
15. Genette (1972).
16. Ricoeur, P. (1983ff). Time and Narrative (Temps et Récit), 3 vols. trans. Kathleen McLaughlin and David Pellauer. Chicago: University of Chicago Press, 1984, 1985, 1988 (1983, 1984, 1985).
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)