Continuous Integration Systems for Critical Edition: The Chronicle of Matthew of Edessa

paper, specified "short paper"
  1. 1. Tara Lee Andrews

    Universität Wien (University of Vienna)

  2. 2. Anahit Safaryan

    Universität Wien (University of Vienna)

  3. 3. Tatevik Atayan

    Universität Wien (University of Vienna)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We present here a project to prepare the digital critical edition of the
Chronicle of Matthew of Edessa, which is due to finish its first stage in April 2019. The
Chronicle is a 12
th century Armenian-language historical text covering events in the Near East during the 10
th-12th centuries. This takes the reader from the apogee of the medieval Armenian kingdoms to the fall of most of them during the eleventh century as their lands were annexed to Byzantium and ultimately lost to the Seljuk Turks. The Chronicle is also an important source for the history of the First Crusade, and particularly the Crusader County of Edessa. There are about 40 known manuscripts that contain the text of the Chronicle in whole or in part, copied between the end of the 16
th century up to the 19

In our workflow we have adopted the stages of digital critical edition suggested by Robinson
(2004) – transcription, collation, stemmatic analysis, edition and publication. We found, in the process, that these stages do not occur in a strict succession; it was quite regularly necessary to move back and forth between stages, refining earlier steps on the basis of later results. One of the central features of our project was to adopt a
continuous integration (CI) system

In this case, the software in question is Concourse (

in order to manage the work across these stages in a sensible manner. The primary challenge we then had to overcome was the need to ensure that the data was cleanly maintained from beginning to end, as the nature of CI design does not allow for modifications in the middle of the pipeline.

Beginning with the transcription, where we also followed general guidelines proposed by Robinson and Solopova
(1993), diplomatic transcriptions of all available manuscripts were made in T-PEN
(Ginther et al., 2009) according to guidelines maintained in the project’s GitHub repository, and converted from T-PEN’s native Shared Canvas JSON format into valid TEI-XML using a Python library developed for the purpose. A major advantage of the digital approach is the ease with which entire categories of transcription error can be identified and corrected automatically. The CI framework enabled parsing and validation errors to be spotted immediately. Rare cases could be corrected manually in T-PEN (that is, the source of our pipeline data), but widespread mistakes required us to revise our workflow tools to behave in a way more akin to functional programming, so that we could insert custom code to handle peculiarities of our specific texts without compromising the more general design of the tools themselves.

Given transcriptions that passed our litmus tests for validation and basic accuracy of content, we were then ready to collate them using the JSON input functionality of CollateX
(Dekker and Middell, 2011). Here again we relied on custom programming within the CI setup – while programmatically taking TEI files and tokenise the text contents into individual readings is a straightforward task in its basics, the specifics will vary massively from text to text. Our tokenizer software thus provides a number of code plug-in interfaces that can be used to generate a correct token, also in the specific XML context of that reading in the document. The CI setup also allowed us to make, and preserve, a number of custom modifications to how we used CollateX, in order to maximise its accuracy.

Since one of our transcription guidelines was to leave abbreviations unexpanded, we also developed a tool using a combination of base text collation, regular expression logic, and user interactivity to (semi-)automatically expand these abbreviations and store the results in a database, which could in turn be fed into the tokenisation step on later runs of the pipeline. Here too we retained the principles of diplomatic transcription: if the word was abbreviated the way that it could be expanded in a canonical orthography, we did so, otherwise, we tried to follow the intended (or perhaps mistaken) spelling of the scribe.
Given full collations of the chronological sections within the text, our editorial analysis could begin. We have used the Stemmaweb tool
(Andrews and Macé, 2013) both to normalise and to specify classes of relationship between individual readings throughout the text, which in turn eases the stemmatic analysis of the manuscripts. Recent versions of Stemmaweb also provide a means of indicating the ‘lemma’ reading among a set of variants, so that critical edition text can be produced directly. At this time a system for annotation of the textual content is under development, which will enable us to provide a digital commentary on the historical content of the
Chronicle, as well as the edition itself.

Although our project has not extended the CI model past the stage of collation – this would require a system to save and “re-play” editorial decisions concerning the collated text, which would have required more of an engineering effort than we had resources for within the framework of the project – we consider this to be an important direction for producing a critical edition that is truly reproducible from the textual evidence at its base.


Andrews, T. L., and Macé, C. (2013). Beyond the tree of texts: Building an empirical model of scribal variation through graph analysis of texts and stemmata.
Literary and Linguistic Computing,
28(4): 504–21. 10.1093/llc/fqt032.

Dekker, R. H., and Middell, G. (2011). Computer-supported collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements. In
Supporting Digital Humanities. Copenhagen.

Ginther, J., Firey, A., O’Sullivan, T., Walker, A., Elliot, M., Gaffield, M., … Cuba, P. (2009, 2012). T-PEN: Transcription for Paleographical and Editorial Notation.

Howe, C. J., Connolly, R., and Windram, H. F. (2012). Responding to Criticisms of Phylogenetic Methods in Stemmatology.
Studies in English Literature 1500-1900,
52(1): 51–67. 10.1353/sel.2012.0008.

Robinson, P. (2004). Making electronic editions and the fascination of what is difficult.
Linguistica Computazionale,
21: 415–38.

Robinson, P., and Solopova, E. (1993). Guidelines for Transcription of the Manuscripts of the Wife of Bath’s Prologue. In
In The Canterbury Tales Project Occasional Papers 1

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO