Process Data for Digital Scholarly Editions

paper, specified "long paper"
  1. 1. Gunter Vasold

    Karl-Franzens Universität Graz (University of Graz)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Working on scholarly edition in general can be seen as a chain of working steps. Each step leads to an intermediate result, which other steps may build upon. At the end of this chain stands the published edition as the final result. Traditionally these steps, thus the editorial process, are described in form of methodological guidelines and editorial principles. Digital scholarly editions not only add new aspects like the separation of data and presentation or the chance to include additional materials 1234, they might also change the way we document editorial processes. This seems particularly important for new forms of digital scholarly editions like open source editions5, work in progress editions 6, editions 2.0 7 or social editions8. Such editions are evolving over time, they are possibly incomplete (work in progress), they might have been created in a collaborative way by many editors, or they provide not only canonical texts but also accompanying materials or results from intermediate stages like graphemic transcriptions, pre-normalized texts etc. as raw data for further enrichment and research.

As a consequence it is questionable if traditional ways of documenting how an edition was made are still appropriate. If we want to deal with the fact that digital scholarly editions are no longer static resources but collections of evolving results, we must put a stronger focus on process aspects. For evolving scholarly editions we should consider to understand formal process documentation as an essential part of such editions.

First we have to define what we mean with editorial processes. Creating a scholarly edition includes a large number of activities like searching, collecting and evaluating sources, other textual witnesses, literature and images, the review of existing and preliminary editorial work, making transcriptions, collations, normalized texts, abstracts, indexes, glossaries. It also implies tasks like dicscrimen veri ac falsi, stylometric, paleographic, prosopographic, and sphragistic research. Some of these tasks have to be done only once, but the majority of editorial activities is either recursive and/or has to be repeated over and over again, e. g. for each charter of a collection.

Documenting such processes seems easy, but if we take a closer look, we find that the documenting of single processes requires a clear and formalized concept of processing steps based on the definition that a process is a directed activity which leads from one state to another. I assume that editorial activities can be separated from each other and can therefore be handled as addressable and formally describable units of action. This is a basic prerequisite for process documentation, because we need distinct entities to refer to. Each process has to be defined with its starting point, a defined ending point and a set of activities which lie in between. This means that we do not only have to describe activities, but we also have to create relations between a specific activity and the data sets which represent the starting and ending points of the activity.

Another issue is how we should describe editorial processes. This includes three basically inseparable sub-questions: (1) what should be documented to what extent (2) how should this be put into practice and (3) which data formats are feasible. The extent of documentation depends on the objectives of an editorial project, but a minimal set should include timestamps, actors (persons or algorithms), a formalized description and possibly an additional free-text description of the action and, as mentioned before, pointers to the initial and resulting data.

The second question deals with the problem that documenting single editorial activities seems to put a massive overhead to editorial work. But for the greater part this data can be generated automatically if we use appropriate working environments. The question of feasible data formats is twofold: we have to distinguish between process data used within a project and data which becomes a formal part of the edition. The former needs little discussion, as it depends on technical environments. But the latter is still an open problem which requires further discussion and research although some solutions already exist. It seems that graph based models are particularly suitable for the problem 910, but also XML-based standards for process descriptions like PREMIS must be investigated as possibly adaptable models.

The third and most fundamental question is what can we do with this data or more pragmatic: why should we start collecting process data at all? Formalizing editorial processes and the availability of such data has multiple benefits. From a project management point of view this data can be used to get an overview of the project status at any point in time. One can use this data to find out for example how long a task takes on average, which might become important for future project planning. Process data also can be used to trigger actions like sending a message to co-workers if a certain state has been reached or even to move a document automatically along a predefined workflow based chain of tasks. In quality control process data can be utilized to identify systematic errors.

From a user point of view quality control will become more important with new forms of scholarly editions. If an edition has many contributors, or if an edition has no final result, it is partly up to the user to decide if a particular element of an edition is trustworthy. Peter Robinson has claimed that every act of editing should be attributed to the person who did it, because this allows attribution and trust 11. Process metadata describing when and by whom an element was added or changed can therefore be an important indicator for the reliability.

Open ended and collaborative editions consist of an ever increasing number of representation forms and versions, which constitute multidimensional knowledge spaces. Process data can be used to construct references and therefore contexts between the single components of such a multidimensional scholarly edition 12. Hence, process data can give important directions when users want to leave the paths predetermined by editors.

The problem of process planning and process documentation is of course not unique to scholarly editions. One example is software development, where many contributors can work on the same project, continuously creating new and refactoring existing code. This requires the use of software versioning and control system like Apache Subversion or Git for the creation and organization of process metadata which document different versions of files and development branches. In science there is a long tradition of hand written log books with the purpose to keep laboratory work reproducible and transparent. Nowadays these log books are replaced by electronic systems which generate and log great parts of process metadata automatically 13.

Most advanced in this domain is business process management (BPM) as a discipline of business informatics 14. It is focusing on process optimization which is not the main problem with scholarly editions, but BPM has developed a domain specific terminology, conceptual frameworks, software tools and standards like WfMC, WSFL, XPDL, BPEL or BPMN, which turn out to be very useful for understanding our topic. Concerning the problems they try to solve document related technologies15, especially (enterprise) content management 16 17 seem to be more applicable to digital scholarly editions than BPM. Content management describes the life cycle of a document through a chain of states in an abstract way. Content management systems implement these formalized life cycles by providing states, workflows, users and roles. They are able to keep track of stages and versions, are able to automatically route documents through defined paths and to collect metadata for every stage and action. Enterprise content management must not be confused with web content management, as web content management systems are mostly simplistic, monolithic and proprietary applications, which are not suitable in terms of long time availability.

New open forms of digital editions require a stronger focus on editorial processes. This affects the planning of work, needs support for process specific workflows and requires process specific metadata which has to be seen as an essential part of an edition. Other fields, particularly (enterprise) content management has requirements similar to those of scholarly editions. Upcoming software tools for digital editions should investigate CMS with regard to process aspects and adopt useful features. Special focus has to be put on the development of an appropriate, interchangeable format for edition-related process metadata.

1. Sahle, P. (2013). Digitale Editionsformen, Vol. 2: Befunde, Theorie und Methodik. BoD: Norderstedt.

2. Robinson, P. (2013). Towards a Theory of Digital Editions. In: Variants – Journal of the European Society for Textual Scholarship, 10, 105-131.

3. Pierazzo, E (2011). A rationale of digital documentary editions. In: Literary and Linguistic Computing, 26, 463-477.

4. Dahlström, M. (2004). How Reproductive is a Scholarly Edition? In: Literary and Linguistic Computing, 19, 17-33.

5. Bodard, G. and Garcés, J. (2009). Open Source Critical Editions: A Rationale. In: Deegan, M, Sutherland K. (eds.) Text Editing, Print and the Digital World. Farnham: Ashgate, 83-98.

6. Kropač, I. H. (2008). Work in Progress: Vom Digitalisat zum edierten Text. In: Thumser, M., Tandecki, J. (eds.) Editionswissenschaftliche Kolloquien 2005/2007. Toruń: TNT, 167-183.

7. Boot, P. and van Zundert, J. (2011). The Digital Edition 2.0 and The Digital Library: Services, not Resources. In: Bibliothek und Wissenschaft, 44, 141-152.

8. Osborne, R. et al. (2013). eResearch Tools to Support the Collaborative Authoring and Management of Electronic Scholarly Editions. In: Digital Humanities 2013, Conference Abstracts, 334-337.

9. Osborne, R. et al. (2013). eResearch Tools to Support the Collaborative Authoring and Management of Electronic Scholarly Editions. In: Digital Humanities 2013, Conference Abstracts, 334-337.

10. DeFerrari, J. et al. (2002). Managing the Lifecycle of Information. Hamburg: AIIM.

11. Robinson, P (2013). Five desiderata for scholarly editions in digital form. In: Digital Humanities 2013: Conference Abstracts, 355-356.

12. Vasold, G. (2014). Progressive Editionen als multidimensionale Informationsräume. In: Ambrosio, S. et al. (eds.) Digital Diplomatics. The Computer as a Tool for the Diplomatist? AfD Beiheft 14, Köln: Böhlau, 75-88.

13. Potthoff, J. et al. (2011). Elektronisches Laborbuch: Beweiswerterhaltung und Langzeitarchivierung in der Forschung. In: Schomburg, S. et al. (eds.) Digitale Wissenschaft. Köln: hbz, 149-156.

14. Vom Brocke, J. and Rosemann, M. (2010). Handbook on Business Process Management. Springer: Berlin.

15. Kampffmeyer U. (2003). Dokumenten-Technologien. Hamburg: Project Consult.

16. DeFerrari, J. et al. (2002). Managing the Lifecycle of Information. Hamburg: AIIM.

17. Cameron, S. (2011). Enterprise Content Management. Swindon: BCS.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO