Modelling Text-Genetic Relationships

paper, specified "long paper"
  1. 1. Dirk Van Hulle

    Universität Antwerpen (University of Antwerp)

  2. 2. Joshua Schäuble

    Universität Antwerpen (University of Antwerp)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The discipline of genetic criticism regards text as a dynamic rather than static object and tries to “put the text back into motion, opening it to the moving constellations that presided its genesis” (Contat et al. 1996, 2). Consequently, such a dynamic perspective implies an equally dynamic model of representation.
A traditional danger of manuscript research is that the researcher gets lost in the details of the archival material. Over the last decades, numerous editing projects explored the potential to capture text-genetic processes digitally. Most of these digital editions and archives, such as WoolfOnline, the Jane Austen Fiction Manuscripts, the Shelley-Godwin Archive, present the textual genesis in two ways. First by arranging the extant source documents in a stemmatologically established order, giving the reader an insight into the chronological document succession from a first note or draft to a first edition. Secondly, the individual documents are transcribed and critically annotated in rich detail, giving the user an insight into what each extant version looked like on paper. In addition to these features, some editions also integrate extra tools, such as a writer’s correspondence or an author’s own reading in the form of a digital or virtually reconstructed library.
What all these projects have in common is a hierarchical data model. Documents are stored in collections (folders and subfolders) which could well be visualized as directed rooted tree-graphs. In such a tree the edition as the overarching structure represents the root element, (sub-)collections represent its child elements and the individual documents are descendants. Below the document level, the tree continues in strictly hierarchical TEI encodings that lead us down to the smallest annotated unit of a document – a phrase, a word, sometimes a letter – nested in XML brackets. Up to this point, in terms of data structures, the edition can be visualized as a single coherent tree structure. There are three common practices to describe the textual genesis against the background of this hierarchical representation of the material.

Stemmatological metadata describe relationships that break this strict tree structure. The documents are arranged in a chronological sequence that might well vary from the documents’ physical order in the collections as represented by the tree. Yet, these genetic relationships always link elements on the same level of the tree’s hierarchy – the document level. They neither allow to
zoom in on deeper levels of the tree nor to derive information on how text units on these finer levels of granularity are genetically interconnected.

Annotating the textual genesis within individual documents, e.g. with text-genetic TEI encodings (TEI Consortium 2011 §11), allows us to link nodes of the tree (here XML elements) across the hierarchical tree structure. Additions, deletions and substitutions are assigned to groups (tei:change elements) which are put into sequences (ordered and unordered tei:listChange elements) in the metadata. Just like the stemmatological metadata, these structures represent genetic paths that break the hierarchy, yet in this case they do not allow us to
zoom out along the tree or the stemmatological document relations. We cannot draw conclusions about how the sequential making of an individual version is connected to the textual genesis across multiple versions.

Collation software such as CollateX (Dekker and Middell 2011) and the upcoming HyperCollate (Bleeker et al. 2018) detects the textual variance between different text versions and models these differences in so-called variantGraphs. Collation software allows us to capture paths that represent the textual variance between the documents on the granularity level of the token –
across the tree hierarchy of an XML encoding. Without stipulating any genetic interpretation, these graphs raise questions such as “how was a sentence/phrase altered syntactically (semantically) between draft A and draft B?”. The graph does not give explicit answers. Instead, it neutrally visualizes the variant and invariant text tokens between a selection of versions. Collation is limited to capturing connections between the documents, yet not on the hierarchical level of the document, but on the level of the token, which may well be smaller than any TEI annotation (on the level of XML text nodes). Again, this approach does not allow us to
zoom out. We cannot derive information from the stemmatological order (1), nor from the witness-specific genesis (2).

In all three cases, additional graph-structures are annotated across to the underlying hierarchical tree, which is itself a graph. Each one of these structures provides an alternative navigation for a particular level or subtree of the work and thus each of these structures represents a different aspect of the work’s textual genesis. Only very few projects, such as the Faust Edition and the Beckett Digital Manuscript Project, incorporate all three structures and even those projects have not managed to merge them in a way that allows the user to seamlessly navigate over all genetic information.
What is missing is a comprehensive model that allows to navigate seamlessly between the different types of genetic paths, to zoom in and out on writing processes (between the
macrogenetic and
microgenetic levels) and to connect external source texts to their use in the drafts (linking ‘
exogenesis’ and ‘
endogenesis’). Ideally, we should be able to implement this model in an easily accessible and extensible research environment, allowing the scholar to capture, organize, visualize and analyze genetic paths of all described types.

Building on the system of genetic paths as developed in HyperLearn (D’Iorio 2003, Barbera 2005) the proposed paper presents a digital way of modelling text-genetic relationships in an eXist-db based research environment for genetic criticism, henceforth referred to as a Manuscript Web (MW). Such an MW in the form of a customizable web application allows textual scholars to organize their document-collections consisting of facsimiles, TEI transcripts and bibliographical metadata, in four different module types: (1) virtual libraries, (2) collections of notes, (3) drafts and (4) published editions. A Manuscript Web also starts from a project tree, but unlike the projects described above it enables users not only to capture genetic paths across the given tree hierarchy, but to search the respective modules to which they belong and to store relationships (source-to-target vectors) between all identifiable elements/hierarchical levels of the project tree (that is modules, collection-folders, document-entities, XML elements and text nodes). The model thus enables users
to zoom in and out between
macro- and micro-genetic levels, as well as between
exo- and endogenesis.

For example, to link endo- with exogenesis, a scholar may connect an entire section of a TEI-encoded notebook to an identified source in the author’s virtual library to indicate that this notebook section contains reading notes from the related source. On a microgenetic level, individual notes from this section may be linked to paragraphs, phrases or interlinear additions in a manuscript draft. On a ‘higher’ level in the hierarchy level (macrogenesis), this draft may be linked to the following draft in the stemmatological sequence. Such one-to-one relations can be captured individually and regardless of the granularity level. Where the source- and target-references of independent relations overlap, they form paths and genetic graphs across the corpus. Since all these graphs refer to elements of the underlying project tree, this tree provides a navigational backbone that allows the user to zoom in and out on the genetic information. From any document within the environment the user can access all genetically related entities to answer questions such as “what does this particular paragraph look like in the next draft?” or “which literary sources inspired this paragraph?”.
The aim of the proposed model is to enable users to connect what is usually merely juxtaposed. Most digital archives and scholarly editions offer the traces of a work’s genesis as digitized items, side by side. What this paper proposes is a way to enable not only scholarly editors, but also users to discover and record the connections between these textual traces. The ability to record these connections facilitates a more comprehensive understanding of a work’s genesis.
“Put[ting] the text back into motion,” as Contat et al. described it (1996, 2), implies a dynamic model that allows users to turn the different genetic traces or “stills” – so to speak – into the “motion picture” of the genesis. With the proposed model, zooming in on the smallest level of textual change no longer entails the danger of getting lost in the labyrinth of the digital archive thanks to the possibility to zoom out again at every stage in the enquiry and see the bigger picture.
Finally, the paper shows how this dynamic model facilitates not only research into one single work’s genesis, but also comparative genetic criticism of several authors’ works. Up till now, comparative studies have been relatively rare in the field of genetic criticism, because every author’s writing method is characterized by idiosyncrasies. By modelling the text-genetic data in such a way that they become more comparable, the proposed model will contribute to the development of comparative genetic criticism.


Barbera, M. (2005). The HyperLearning project: Towards a distributed and semantically structured e-research and e-learning platform, Literary and Linguistic Computing, 21.1: 77-82.

Bleeker, E., Buitendijk, B., Dekker, R. H. and Kulsdom, A. (2018). TMI? Visualisation as Research Instrument for Computational Philology, DH Benelux Conference Abstract, (accessed 23 April 2019).

Contat, M., Hollier, D. and Neefs, J. (1996). Editors' Preface, Yale French Studies 89, 1-5.

Dekker, R. H., and Middell, G. (2011). Computer-supported collation with CollateX: managing textual variance in an environment with varying requirements, Supporting Digital Humanities, 2011: 17-18.

D’Iorio, P. (2003). Cognitive Models of HyperNietzsche: Dynamic Ontology and Hyper-Learning, Jahrbuch für Computerphilologie, 5: 179-184.

Hoenen, A., and Brüning, G. (2018). Zur Stemmatologie neuerer Überlieferungen. In

DARIAH-DE Working Papers

TEI Consortium, eds. (2011).

TEI P5: Guidelines for Electronic Text Encoding and Interchange.
P5v2.0, (accessed 23 April 2019).

Van Hulle, Dirk (2016). Modelling a Digital Scholarly Edition for Genetic Criticism: A Rapprochement, Variants, 12-13: 34-56.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.