From Text and Image to Historical Resource: Text-Image Alignment for Digital Humanists

paper, specified "long paper"
  1. 1. Dominique Stutzmann

    Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)

  2. 2. Théodore Bluche


  3. 3. Alexei Lavrentiev

    ICAR Laboratory - Ecole Normale Supérieure de Lyon (ENS de Lyon)

  4. 4. Yann Leydier

    LIRIS Laboratoire d’Informatique en Image et Systèmes d'information - INSA

  5. 5. Christopher Kermorvant


Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

From Text and Image to Historical Resource: Text-Image Alignment for Digital Humanists


Institut de Recherche et d'Histoire des Textes (CNRS), France


A2iA, France


ICAR, Interactions, Corpus, Apprentissages, Représentations (ENS de Lyon – UMR 5191), France


LIRIS Laboratoire d’Informatique en Image et Systèmes d'information (INSA de Lyon – UMR 5205)


Teklia, France


Paul Arthur, University of Western Sidney

Locked Bag 1797
Penrith NSW 2751
Paul Arthur

Converted from a Word document



Long Paper

Digital Palaeography
Text-Image Alignment

corpora and corpus activities
image processing
digital humanities - nature and significance
medieval studies
data modeling and architecture including hypothesis-driven modeling
resource creation
and discovery
scholarly editing
software design and development
text analysis
content analysis
interdisciplinary collaboration
cultural studies
linking and annotation
standards and interoperability
data mining / text mining

Written texts are both abstract and physical objects: ideas, signs, and shapes, whose meanings, graphical systems, and social connotations evolve through time. Beyond authorship and writer identification or palaeographical dating of textual witnesses, the materiality of text and the connexion between the ideas and their written instantiations are a matter of cultural history, historic semiology, and history of communication and representations. In the context of large, growing digital libraries of texts and digitized medieval manuscripts, the question of the cultural significance of script and the ‘dual nature’ of texts may at last be addressed.
Several research projects, interfaces, and software allow for a closer text-image association during the editing process (TILE,
1 T-PEN,
3) and for data visualisation (Mirador,
4 DocExplore
5), with interoperable annotations schemas (SharedCanvas
6). But in most cases, the finest granularity is at line-level with an alignment being done by hand on small amounts of text (a few pages).

In this paper, we present a new method, derived from Handwritten Text Recognition, to automatically align images of digitized manuscripts with texts from scholarly editions, at the levels of page, column, line, word, and character. During the Oriflamms project funded by the French National Research Agency and Cap Digital,
7 it has been successfully applied to two datasets: ‘
Graal’ (130 pages, 13th c. manuscript, online edition at and ‘
Fontenay’ (104 pages, 12th–13th c. charters). Partial results of alignment at word and character level are online:

Our contention is that such an alignment method is the only way to gain access to new questions on a large-scale basis and massively transfer the results of traditional humanities and textual scholarship into digital humanities. The automation causes not only a change of scale (larger corpora) but also a change in granularity (page-by-page or line-by-line alignment to word and character alignment). It avoids the tedious task of drawing boxes by hand around characters and allows a systematic analysis of the data. It outmatches the preceding attempts made to more closely associate both aspects of text and also opens perspectives on automated transcription.
Text-image alignment from existing transcripts is a newly opened, specific challenge. It has relevance for the humanities, as scholars work with a large set of already transcribed texts from manuscripts (handwritten texts). In other fields, the priority was given to pure (handwritten) text recognition. OCR-free methods have been proposed by our team (Leydier et al., 2014) and others (Hassner et al., 2013), which requires a precise transcript at line level, with either lesser results or no metrics on evaluation. In the present method, we use Handwritten Text Recognition (HTR), which aims to automatically transcribe the image of a handwritten text (‘text image’) into an electronic text, and implement it as an alignment method.

‘Forced Alignment’: From HTR (Handwritten Text Recognition) to Handwritten Text Segmentation

Most of the current HTR models are based on the Hidden Markov Model (HMM) associated with a sliding window approach to segment the input image. These models may produce a precise character segmentation of the ‘text image’ as a by-product. When the transcript of a line image is known, the HMM focuses on the retrieval of characters’ positions and forces the output to correspond to the actual sequence of characters (so-called forced alignment). If only the whole document transcript is available, and not the positions or transcript of the words or lines, we can use line detection methods and map the transcripts to the detected lines. Such methods also relax the annotation effort needed to produce character segmentation.
A method to automatically segment and annotate handwritten words from line images using forced alignments was proposed by Zimmermann and Bunke (2002). The problem has then shifted to mapping the whole transcription of historical documents to segmented words or lines (Kornfield et al. 2004). When the word or line segmentation is not known, a global forced alignment of the full transcript is possible, as proposed by Fischer et al. (2011), or at different levels (word, line, sentence, page), as proposed by Al Azawi et al. (2013). Our model is similar to the last one, but is based on both on A2iA proprietary image processing libraries and on Kaldi, an open-source toolkit for speech recognition, which has been adapted for the task of text-image alignment. This system adds yet another level of complexity since it also deals with abbreviations and handwritten cursive text with no blank space between the letters, for which the character segmentation cannot easily be found.

Proposed Method for Handwritten Text Character Segmentation

The character segmentation results of an incremental process (Figure 1): (1°) we convert the image to gray scale and remove black borders; (2°) we apply a text line segmentation algorithm, adapted from Zahour et al. (2007) to the full page; (3°) we keep the pages with a correct number of detected lines; (4°) we assign the line transcript to the line images and use them to train a first Hidden Markov Models based on Gaussian Mixture Models (GMM-HMM) recognizer, which is (5°) used to align the line transcription with the line images as described in Bluche et al. (2014). (6°) Based on this result, we train a new recognizer. This process is repeated until all text lines are correctly transcribed. Afterwards, we train a final text recognizer based on deep neural network HMMs. This model is trained with a discriminative criterion and yields better transcription results and segmentation accuracy than the standard GMM-HMM (example of forced alignment at word and letter level in Figures 2a and 2b).

Evaluation of Alignment Accuracy (

Two methods were applied to evaluate the automatic alignment at word level. First, a tabular view is used to validate/reject each occurrence of a word, evaluate average accuracy, and spot problematic lines. Then a complete validation is performed by a palaeographer line by line. The accuracy is computed by the distance in pixel between the automatic segmentation and the ground-truth word boundaries (distribution in Figures 3 and 4). In
Graal, 95% and 89%, respectively, of the left and right boundaries are correct with a 10px (0.84mm) tolerance. In
Fontenay, the results are 91.8% and 89.4%, respectively, with a 30px (3.58mm) tolerance and 85% and 74.4%, respectively, with a 15px (1.79mm) tolerance, less than half of the average character width (23px, i.e., 1.94mm in
Graal and 45px, i.e., 5.37mm in
Fontenay): this is a great achievement.

The alignment was also performed at a character level on
Graal and
Fontenay. The immense number of characters makes the validation difficult. Samples show a very high accuracy rate for complex graphic structures (e.g., 100% for st-ligature), but further tools are needed to measure the accuracy. The results of evaluation partly depend on the validator’s skills in reading ancient scripts.

Evaluation of Transcription Accuracy (

The HTR-based system was also tested for recognition and evaluated according to the word and character error rate criterion (WER/CER) by splitting the corpus into a training set (101 pages) and a test set (29 pages). For the recognition, we used a lexicon of 7,005 unique words and 4gram statistical language model estimated on the train set. The standard GMM-HMM achieved 23.0% WER and 6.9% CER, whereas the hybrid deep neural network HMM achieved 19.0% WER and 6.4% CER—that is, more than 80% of words and 93% of characters are accurately recognised.
Granularity and Scalability
This method unlocks a new level of granularity and allows modelling of different letterforms (‘allographs’, e.g., s/ſ/S). In
Graal, palaeographers provided the analysis of the graphical chain, and the system had to choose between identified possible solutions (Figure 5). In
Fontenay, the system had to create several character models without any previous knowledge, resulting in allograph clustering.

Even at this fine level of granularity, this system is scalable to large corpora.
Graal includes 130 pages, 10,700 lines, 114,268 words, and more than 400,300 characters;
Fontenay includes 104 pages, 1,341 text lines, 22,276 words, and more than 99,900 characters. In comparison, the historical databases ‘Saint-Gall’ and ‘Parzival’ comprise, respectively, 60 pages, 1,410 text lines, 11,597 words; and 47 pages, 4,477 text lines, and 23,478 words. The four-year DigiPal project produced a database encompassing 61,372 manually annotated images of letters, without text transcriptions.

The system is furthermore robust (book hands and diplomatic scripts), and the data format is fully TEI compliant.
Human in the Loop: Evaluation, Verifiability, and Ergonomics
In interdisciplinary research, humanities and computer sciences scholars must articulate their respective systems of proof and uncover underpinnings and pre-assumptions, in order to produce efficient systems that present data in a way that scholars on all sides can understand, evaluate, and trust (Stutzmann and Tarte, 2014). During this research, we observed many times that the so-called ground-truth was not 100% accurate or did not correspond to what the system was expected to produce (e.g., transposed words are edited in the order one should read them, while the alignment can only match the words in their order on the line), so that we had hesitations and explored new paths. This is a challenge for future developments: large resources will all contain inaccuracies or not automatable information. Likewise, some inconsistencies in evaluation may appear. Increasing the interactivity of software tools is a solution, not only to overcome shortcomings of strictly automatic approaches, but also to correct the ground-truth and improve the tools and the data models. Therefore, this project also developed a user-friendly software (Leydier et al., 2014). Agile development and interoperability concern software creation, but also corpus enhancement, to use humanist, computer scientist, and machine competences at their maximum. The alignment was performed in three person-months. This is obviously less than by manually drawing boxes around words (or even letters). Our tools and system open a large-scale, standardized, interoperable approach of historical scripts. The human in the loop is part of an interdisciplinary work and process, avoiding tautological approaches and allowing better results, user-friendly tools, and a better understanding on all sides.


Figure 1. HTR-alignment.

Figure 2a. Forced alignment at word level.

Figure 2b. Forced alignment at character level.

Figure 3.
Graal alignment accuracy (number of occurrences / correction on left or right boundaries, in pixels).

Figure 4.
Fontenay alignment accuracy (number of occurrences / correction on left or right boundaries, in pixels).

Figure 5. Modelling allographs and graphical connexions.


Al Azawi, M., Liwicki, M. and Breuel, T. M. (2013). WFST-Based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations.
Proceedings of the SPIE, Burlingame, CA,
Document Recognition and Retrieval, vol. 20 (4 February), doi:10.1117/12.2003134.

Bluche, T., Moysset, B. and Kermorvant, C. (2014). Automatic Line Segmentation and Ground-Truth Alignment of Handwritten Documents.
International Conference on Frontiers in Handwriting Recognition.

DigiPal: Digital Resource and Database of Manuscripts, Palaeography and Diplomatic. (2011–2014). London.

Fischer, A., Frinken, V., Fornés, A. and Bunke, H. (2011). Transcription Alignment of Latin Manuscripts Using Hidden Markov Models.
Proceedings of the Workshop on Historical Document Imaging and Processing.

Hassner, T., Wolf, L. and Dershowitz, N. (2013). OCR-Free Transcript Alignment.
International Conference on Document Analysis and Recognition.

Kornfield, E. M., Manmatha, R. and Allan, J. (2004). Text Alignment with Handwritten Documents.
First International Workshop on Document Image Analysis for Libraries.

Leydier, Y., Eglin, V., Bres, S. and Stutzmann, D. (2014). Learning-Free Text-Image Alignment for Medieval Manuscripts.
International Conference on Frontiers in Handwriting Recognition.

Stutzmann, D. and Tarte, S. (2014). Executive Summary. In Hassner, T., Sablatnig, R., Stutzmann, D. and Tarte, S.
Digital Palaeography: New Machines and Old Texts (
Dagstuhl Reports),
4(7): 112–34, 112–14, 132.

Zahour, A., Likforman-Sulem, L., Boussellaa, W. and Taconet, B. (2007). Text Line Segmentation of Historical Arabic Documents.
International Conference on Document Analysis and Recognition.

Zimmermann, M. and Bunke, H. (2002). Automatic Segmentation of the IAM Off-Line Database for Handwritten English Text.
16th International Conference on Pattern Recognition.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.