Computer Supported Collation With CollateX

workshop / tutorial
  1. 1. Ronald Haentjens Dekker

    Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)

  2. 2. Tara L. Andrews

    Universität Bern (University of Bern)

  3. 3. David J. Birnbaum

    Department of Slavic Languages - University of Pittsburgh

  4. 4. Leif-Jöran Olsson

    Göteborg University (Gothenburg)

  5. 5. Joris J. van Zundert

    Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Computer Supported Collation With CollateX

Haentjens Dekker

Huygens ING, Netherlands, The

Tara L.

University of Bern

David J.

University of Pittsburgh


University of Gothenburg

van Zundert
Joris J.

Huygens ING, Netherlands, The


Paul Arthur, University of Western Sidney

Locked Bag 1797
Penrith NSW 2751
Paul Arthur

Converted from a Word document



Pre-Conference Workshop and Tutorial (Round 2)

text comparison

literary studies
scholarly editing
text analysis

Comparing witnesses of a text is an important part of scholarly editing. Collation is regarded as one of the scholarly primitives (Unsworth, 2000). Comparing texts by hand can be tedious and prone to error, and it can be made more efficient and reliable with the assistance of computers. This workshop will explain how to use the open-source CollateX
1 collation tool to compare witness of a texts automatically, in a way that can be used to produce critical textual editions and other types of comparative documents. Attendees will learn how to prepare source materials in any language (including those that use non-Latin scripts and directionality that is not left-to-right) for collation, how to perform automated collation using CollateX, and how to edit the results.

Full Contact Information

Ronald Haentjens Dekker (
Software Architect and Consultant
Huygens ING
The Netherlands

Ronald Haentjens Dekker is a software architect and consultant at the Huygens Institute for the History of The Netherlands ( He has been the lead developer of CollateX since 2007.

Tara L. Andrews (
Assistant Professor of Digital Humanities
University of Bern

Tara L. Andrews is assistant professor of digital humanities at the University of Bern. Her research interests include Byzantine history of the middle period (in particular, the 10th to 12th centuries), Armenian history and historiography from the fifth to the 12th centuries, and the application of computational analysis and digital methods to the fields of medieval history and philology.

David J. Birnbaum (
Professor and Chair, Slavic Languages and Literatures
University of Pittsburgh

David J. Birnbaum teaches digital humanities at the University of Pittsburgh ( and has been enhancing CollateX to collate medieval Slavic manuscript materials. Links to some of his digital philology and other digital humanities projects are available at

Leif-Jöran Olsson (leifjoran.
Language Technologist and System Developer
Department of Swedish
University of Gothenburg

Leif-Jöran Olsson is a language technologist and system developer at the Språkbanken (the Swedish Language Bank; He is also a developer of the open-source eXistDB XML database system ( and has worked on a plug-in to integrate eXistdb and CollateX.

Joris J. van Zundert (
Researcher and Developer in Computational and Digital Humanities
Huygens ING
The Netherlands

Joris J. van Zundert is scientific researcher and developer in the field of digital and computational humanities at the Huygens Institute for the History of The Netherlands. A scholar of medieval Dutch literature by training, his main interest as a researcher and developer lies in the possibilities of computational algorithms for the analysis of literary and historical texts, and the nature and properties of information and data modeling in the humanities.

Target Audience

Scholars who are interested in using tools to facilitate humanities research, especially with respect to preparing digital critical editions. Participants who wish to work with their own materials will need to bring them (in plain text or TEI markup); the organizers will provide sample data that can be used by participants who do not have their own project materials. Participants are strongly encouraged to install Python 3 and CollateX in preparation for the workshop; the workshop organizers will provide installation instructions in advance. No prior Python programming experience is required. Based on prior workshop experience, we anticipate attracting between 15 and 30 participants.

Special Requirements for Technical Support

A computer projector (HDMI or VGA) will be required for the presentation. Participants will be required to bring their laptops, and the room will need to provide sufficient plug-in electrical connections and wireless Internet connectivity for all participants.

Intended Length and Format of the Workshop

Full day, two sessions.

Session 1. 9:00–12:00: The Basics of Automatic Collation

The first session will cover the theory of collation, the basics of using CollateX, and the collation of plain text data. No prior experience with collation tools is required.
• Introduction to the theory and uses of collation.
• The collation data model: witnesses, tokens, and tokenization.
• Installing, configuring, and testing CollateX.
• Collating plain text strings and files.
• Output options and postprocessing.
• Introduction to normalization.

Session 2. 13:00–16:00: Collating XML (including TEI) Data

The second session will cover more advanced topics, most notably the collation of transcriptions that contain XML (including TEI) markup.
• The collation data model with XML (especially TEI) input.
• Advanced normalization.
• Recognizing and tracking markup information during collation.
• Processing tokens differently according to markup information.
• Output options and post processing for XML (especially TEI) output.

Call for Participation

We asked applicants on relevant mailing lists (such as Humanist, TEIL, Digital Medievalist) to tell us about their interests, needs, and prior experience with respect to collation. The instructors listed above will serve as the workshop program committee. For participants, up to 30 participants were to be accepted.
1. The main CollateX website is CollateX Python is freely available in the Python package repository: The source code is open and available at For a report about a recent application of CollateX, see
Haentjens (2014).











, J. (2014).
Computer-Supported Collation of Modern Manuscripts: CollateX and the Beckett Digital Manuscript Project.
Digital Scholarship in the Humanities (2014),,

Unsworth, J. (2000). Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This? In
Symposium on Humanities Computing: Formal Methods, Experimental Practice. London: King’s College,

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2015
"Global Digital Humanities"

Hosted at Western Sydney University

Sydney, Australia

June 29, 2015 - July 3, 2015

280 works by 609 authors indexed

Series: ADHO (10)

Organizers: ADHO