Integrated DH. Rationale of the HORAE Research Project

paper, specified "long paper"
Authorship
  1. 1. Dominique Stutzmann

    Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)

  2. 2. Jacob Currie

    Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)

  3. 3. Béatrice Daille

    Laboratoire des Sciences du Numérique de Nantes : LS2N (Laboratory of Digital Sciences of Nantes) - École Centrale de Nantes

  4. 4. Amir Hazem

    Laboratoire des Sciences du Numérique de Nantes : LS2N (Laboratory of Digital Sciences of Nantes) - École Centrale de Nantes

  5. 5. Christopher Kermorvant

    Teklia

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Abstract – HORAE (Hours, Recognition, Analysis, Editions) is a cross-disciplinary research project studying religious practices and experiences in the late Middle Ages and Renaissance as evidenced by the medieval bestseller, Books of Hours. Developing tools in artificial intelligence, computer vision and image analysis to “read” a very large and diverse corpus of medieval manuscripts, in Natural Language Processing (NLP) to “identify” textual units and structures, and in book history and religious practices, HORAE also implements diverse tools developed in the DH community (network analysis, data visualization, text mining). This paper presents HORAE as a research
Gesamtkunstwerk that tackles a common challenge of complexity, uncertainty and granularity with different tools from different fields. By broadening the perspective in a genuinely cross-disciplinary research, we discuss the position of DH within the humanities.

Introduction
HORAE (Hours - Recognition, Analysis, Editions) is a cross-disciplinary research project studying religious practices and experiences in the late Middle Ages and Renaissance through Books of Hours, the number one medieval best seller. It involves three partners in humanities and computer science: the Institut de Recherche et d’Histoire des Textes (CNRS), the private company TEKLIA and the Laboratoire des Sciences du Numérique de Nantes (LS2N). More than 10,000 of these books survive and the production of such a large number of manuscripts is a major phenomenon and witness to profound changes in late medieval society, on cultural, religious, and industrial levels: speculative book production rather than on commission for specific clients, devotio moderna and imitation of clerical practices by lay people, customization of devotional objects, etc. Books of Hours are at once deluxe items of social display and intimate objects of devotional intensity, used for one’s salvation (Wieck et al., 1988; Reinburg, 2012; Hindman and Marrow, 2013). Their textual content is immense (300+ folios per volume; only 1.7% miniatures), yet scarcely studied.
HORAE combines the research and expertise of all three partners: in artificial intelligence applied to computer vision and image analysis to automatically “read” a very large and diverse corpus of medieval manuscripts; in natural language processing (NLP) to parse and “identify” textual units and structures; and in book history and religious practices. The project aims to create an integrated chain from image treatment to producing new knowledge by placing the end user at the center of developments and focusing on ergonomics and data visualization.

Aims, corpus, relevance
The aims encompass: (1) re-using the many digitized manuscripts which are online but underused; (2) new open-source software for Handwritten Text Recognition (HTR); (3) tools for segmentation and plagiarism detection, adapted to the transcription of medieval manuscripts produced by the machine, in order to identify the texts in Books of Hours; (4) identifying and editing unpublished texts; (5) visualization of manuscript clusters which share the same textual characteristics, either in the order of the different parts (Hours of the Virgin, votive offices, suffrages, prayers), or in the order of their smaller textual units; (6) studying the diffusion and circulation of devotional and liturgical texts at the end of the Middle Ages in order to better understand the cultures and faith in the 13th c.-16th c.

Big, linked, open, usable data. With its aims and methods, HORAE addresses the largest corpus of manuscripts in medieval studies so far. It changes research methods in auxiliary sciences and tackles the challenges of big data. Books of Hours have been little studied until new because they are too numerous, too complex and too standardized: now, their very number, repetition and complexity allow this project to develop new and efficient technologies and methodologies, and to gain new knowledge about the Middle Ages. The suitability of liturgical data for fruitful historical analysis was recently illustrated by data from calendars (Heikkilä and Roos, 2018).

Libraries, infrastructures, and digital humanities. Several hundred thousand medieval manuscripts and millions of medieval archival documents are preserved worldwide. The textual wealth of these, mostly unpublished, resources is far from exhausted. It is for this reason that HTR and NLP are needed. Books of Hours have been massively digitized because they are often heavily illuminated and of interest to art historians; in spite of this, these books, as texts, are massively underused. By making new use of documents first digitized for other audiences and other purposes, HORAE demonstrates the opportunities created by open-access digitization using the IIIF (International Image Interoperability Framework) protocol

https://iiif.io/
.

Community relevance. The late Middle Ages and Renaissance were times of extreme religious tension within communities. Direct comparison with modern times is of course impossible, and history does not solve current problems. However, studying how religious practices contributed to shape communities and served as common identifiers, and how texts were circulated as a basis for specific religious tendencies, can help produce relevant comparisons. This can inform sociology of religion and is of clear contemporary relevance.

Methods and realizations
Based on previous and preliminary works (Stutzmann, 2019; Plummer and Clark, 2016; Baroffio, 2015; Cavet, 2004), we have built a first reference corpus of Books of Hours worldwide, including more than 7,000 items, which will be enhanced during the course of the project (cf. fig. 1). In accordance with previous observations, they mostly originate in Italy, France, and the Low Countries (fig. 2), but their liturgical use shows the extent of the influence of Rome (fig. 3).
Fig. 1: Map of 7,000+ preserved Books of Hours (visualized in Dariah-DE GeoBrowser version 2.2.1 using PLATIN ©2018 DARIAH-DE. Open source code)

Fig. 2: Place and date of production of 2,400+ Books of Hours (visualized in Dariah-DE GeoBrowser)

Fig. 3: Liturgical destination of 2,300+ Books of Hours (visualized in Dariah-DE GeoBrowser)

In this corpus, we have selected 500 manuscripts that are available through the IIIF protocol (manifest URLs were collected manually and images accessed for analysis, but not stored in our system). We have successfully applied a “page classifier” to distinguish bindings, blank pages, calendars, full page or half page miniatures, historiated initials, and plain text pages and published the results as IIIF manifests at
https://github.com/oriflamms/HORAE (fig. 4). We then segment the page into lines of text. Next, we “read” or “recognize” the text with the same techniques which were applied successfully in the HIMANIS project and now allow full text search to 199 volumes, the largest corpus of medieval manuscripts yet indexed (Stutzmann et al., 2018). They rely on deep learning (based on the open-source Kaldi library) and a dedicated ground truth (fig. 5).

NLP techniques are then applied to match the automatically-generated text with editions, in order to identify the different texts and parts of texts that are present in the manuscripts. Our automated transcription is sufficiently accurate to create a “mid-level” approach at “text” level and to correctly identify texts despite the remaining errors. The authors develop new techniques, but can also rely on existing software such as Tracer, Collatex and ITEAL (Haentjens Dekker and Middell, 2010; Büchler, 2015; Jänicke and Wrisley, 2018) (fig. 6). In future steps, we will compare the contents of manuscripts in order to understand historical patterns and identify and edit hitherto unknown texts.
Fig. 4: Page classifier: ¾ page miniatures and historiated initials from Books of Hours preserved at the Houghton Library, Harvard University (visualized in Mirador IIIF viewer)

Fig. 5: Automated reading of a medieval manuscript as IIIF annotations (visualized in Mirador IIIF viewer)

Fig. 6: Visualization of correspondence between the biblical text of the Psalms and the content of a given Book of Hours (elaborated and visualized in ITEAL, developed by David J. Wrisley and Stefan Jänicke)

Challenges: uncertainty, sequence, and granularity
The following methodological challenges should be explained.

Uncertainty. HTR produces an error-prone text. Text identification based on methods of text reuse detection should be able to ascertain the textual source and thus create correct data from fuzzy, incorrect and automatically-generated data.

Sequential text A+B ≠ B+A. Books of Hours present sections whose ordering may represent only a social practice (e.g. “mixed hours”) or may distinguish their “liturgical use”, i.e. their geographical validity and applicability. Most techniques of distant reading tend to erase the sequential order of sentences and words. We thus need accurate sequential tables of contents and to identify which variations are relevant for research purposes.

Granularity. Books of Hours are built as a sequence of many sections with subsections in a highly hierarchical manner (fig. 7), with conflicting logical (text) and physical (page) structures. Any sequence of a coherent section of several texts can be subsumed in the first sentence of the first text, and words can be represented by their initial letters (fig. 8). This circumstance creates numerous ambiguities, and it is also extremely difficult to ascertain if some parts of the text were omitted in writing but supposed to be said or if they are not meant to be there at all. Therefore, in order to establish a correct table of contents, we have to record what is there and what is not, but also to interpret what
should be there.

Fig. 7: Simplified structure of a Book of Hours

Fig. 8: Granularity of textual content

Integrating DH in the humanities Gesamtkunstwerk
There are many definitions of DH (Terras et al., 2016). We stress the parallel between DH and “auxiliary” sciences (also called “ancillary” or “fundamental”, in German “Hilfs-”, or “Grund-wissenschaften”). They are diverse, have a large focus on epistemology (source criticism and history of science and scholarly practices), and need specific training, curricula, and data infrastructures. They are also dispersed and taught in several faculties. Equally, they strive for autonomy when their results and tools are used by other disciplines, sometimes without any interest in their construction, relations and inner coherence.
HORAE
integrates several fields of research in “digital humanities” with a common goal (image analysis, linguistic computing, text mining, data modelling and visualization). It makes an extensive use of DH-specific methods and tools. Having a strong focus on manuscript studies and text transmission, it combines digital research methods and historical study in a mutually-enriching process. It is therefore connected to both traditional auxiliary/fundamental sciences and DH.

Beyond that, it implements technologies developed without connection to the humanities. The alignment of research questions from different fields makes it possible to have non-humanities and non-DH partners engaged in mutually-beneficial research, in which no partner is only a service provider to the others. As a consequence, it also
disintegrates DH as a specific field by extracting a set of questions, research, and tools, as the new normal in the humanities. With this perspective, we suggest that DH could benefit from analyzing experiences in fundamental sciences to reflect on their fluid position within the humanities.

Bibliography

Baroffio, G. (2015). Iter liturgicum italicum
Iter liturgicum italicum. Répertoire des manuscrits liturgiques italiens http://liturgicum.irht.cnrs.fr/fr/ (accessed 22 March 2017).

Büchler, M. (2015). TRACER
ETRAP (Electronic Text Reuse Acquisition Project) https://www.etrap.eu/research/tracer/ (accessed 27 November 2018).

Cavet, L. (2004). Les Heures de la Vierge : identification liturgique et origine du manuscrit Billet
Les Carnets de l’IRHT https://irht.hypotheses.org/643 (accessed 22 March 2017).

Haentjens Dekker, R. and Middell, G. (2010). CollateX – About the Project
CollateX – Software for Collating Textual Sources http://collatex.net/about/ (accessed 24 October 2016).

Heikkilä, T. and Roos, T. (2018). Quantitative methods for the analysis of medieval calendars.
Digital Scholarship in the Humanities,
33(4): 766–87 doi:10.1093/llc/fqy007.

Hindman, S. and Marrow, J. H. (eds). (2013).
Books of hours reconsidered. Turnhout: Brepols.

Jänicke, S. and Wrisley, D. J. (2018). Semi-automated Alignment of Text Versions with iteal.
Digital Humanities 2018 Puentes-Bridges: Book of Abstracts Libro de Resúmenes (Mexico City, 26-29 June 2018). Mexico, pp. 700–03 https://dh2018.adho.org/en/semi-automated-alignment-of-text-versions-with-iteal/ (accessed 10 July 2018).

Plummer, J. H. and Clark, G. T. (2016). Greg Clark | Beyond Use
Beyond Use: A Digital Database of Variant Readings in Late Medieval Books of Hours http://arthur.sewanee.edu/BeyondUse/ (accessed 22 March 2017).

Reinburg, V. (2012).
French books of hours: making an archive of prayer, c. 1400-1600. Cambridge: Cambridge Univ. Press.

Stutzmann, D. (2019). Les écritures des livres d’heures dans l’espace français (1290-1550). In Overgaauw, E. and Schubert, M. J. (eds),
‘Change’ in medieval and Renaissance scripts and manuscripts. Proceedings of the 19th Colloquium of the Comité international de paléographie latine (Berlin, 16-18 September, 2015). (Bibliologia 49). Turnhout: Brepols, pp. 101–20.

Stutzmann, D., Kermorvant, C., Vidal, E., Chanda, S., Hamel, S., Puigcerver, J., Schomaker, L. and Toselli, A. H. (2018). Handwritten Text Recognition, Keyword Indexing, and Plain Text Search in Medieval Manuscripts.
Digital Humanities 2018 Puentes-Bridges: Book of Abstracts Libro de Resúmenes (Mexico City, 26-29 June 2018). Mexico, pp. 298–302 http://dh2018.adho.org/wp-content/uploads/2018/06/dh2018_abstracts.pdf.

Terras, M., Nyhan, J. and Vanhoutte, E. (eds). (2016).
Defining Digital Humanities : Edward Vanhoutte, Julianne Nyhan, Melissa Terras. Routledge (accessed 5 May 2019).

Wieck, R. S., Poos, L. R., Reinburg, V., Plummer, J. H. and Walters art museum (1988).
Time sanctified: the Book of Hours in medieval art and life. New York: G. Braziller.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.