Book History and Software Tools: Examining Typefaces for OCR Training in eMOP

Todd Samuelson; Matthew J Christy; Katayoun Torabi; Bryan Tarpley; Elizabeth Grumbach

Authorship

1. Todd Samuelson

Texas A&M University
2. Matthew J Christy

Texas A&M University
3. Katayoun Torabi

Texas A&M University
4. Bryan Tarpley

Texas A&M University
5. Elizabeth Grumbach

Texas A&M University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In 1936, the notable English bibliographer A. W. Pollard admitted in his Preface to Frank Isaac’s English Printers’ Types of the Sixteenth Century that “[he] had a very poor eye for distinguishing types and a very poor head for remembering them.”1 Pollard is hardly alone among experts in the history of printing in this shortcoming. Even among scholars with decades of experience in scrutinizing features of the printed book, the ability to distinguish and identify typefaces is a notorious challenge. The literature about early type designs and designers (known as punchcutters) is partial and contradictory; the variations in typefaces are subtle and, at times, inconclusive; and the ability to make differentiations has been considered less a matter of regimented principle than of elusive skill. As Harry Carter suggested, “it is evident that in considering the face of a fount of type we are in a world of art, . . . not a mechanical proceeding or anything susceptible of scientific treatment."2
However, it is precisely the consideration of “founts of type” that is currently engaging a majority of the Early Modern OCR Project (eMOP) team. eMOP, a 2-year Mellon Foundation-funded grant project underway at the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, aims to OCR the documents that comprise the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) collections. As a project that involves collecting and aggregating huge amounts of data, OCR’ing 45 million page images on a high-performance computing cluster, and the development of several software tools and services, eMOP is technology-laden. But at its heart eMOP is a Humanities project, conceived by Humanists, driven by the needs of Humanities scholars, and supported throughout by book history and an understanding of the development of print type in the 15th-18th Centuries.
For eMOP’s book historian, Dr. Todd Samuelson, one of the difficulties in conceptualizing the scope of eMOP has centered in a potential conflict between DH methodology (as encompassed by “big data”) and the traditional means of approaching type identification: as Carter noted, it is an art steeped in years of hard-won practice rather than a science with predictable and reproducible models. While DH is focused on humanities questions and methodologies, it does employ scientific principles as well, especially when dealing with a very large set of documents, and conflict can arise by trying to synthesize a skill set based on minutiae with an extremely large data set. By contrast, even when big data projects incorporate crowdsourcing and the oversight of human experts, they require the ability to find readily transferrable commonalities, rather than to establish proficiency in a small number of experts. In the course of the eMOP project, we have found that the development or adoption of specific software tools has helped to ameliorate this conflict and incorporate type history scholarship into the training of OCR engines.
One of the ideas driving eMOP work is that, by training OCR engines to recognize specific early modern fonts, we can increase the accuracy of those engines when used to OCR documents printed in those fonts. To accomplish this, the eMOP team has spent most of the last year investigating font history, creating a database of early modern printers and the fonts they used, and developing and testing tools and techniques to train Tesseract (an open-source OCR engine) to recognize these fonts. The ability to distinguish between different, but sometimes closely related, fonts, and to train Tesseract to recognize these distinctions has been a central focus. For example, the general classification of different families of typefaces has been attempted by book historians, including Adrian Weiss, who categorized unknown English typefaces of a certain period as either “S-face” or “Y-face”.3 So, though the source of the typeface may not be ascertainable, certain characteristics can be defined which allow scholars (and potentially OCR engines) to identify and group the typefaces more accurately.
As has already been noted, identifying examples of S- and Y-face characters and distinguishing between them, especially when both can be present in one document, is a difficult enough task for an expert. Trying to find all instances of the lower-case letter ‘w’ in a document, as an example, and then deciding which exemplars match some specified “ideal” is difficult and time consuming. Fortunately, eMOP has software tools that can drastically simplify this task, and even allow non-experts to do some of the work. Those tools were originally developed to create training for Tesseract to recognize early modern typefaces, but can also be applied to support research into the typefaces themselves.
To create specific font training for the Tesseract OCR engine, a team of undergraduate student workers, lead by IDHMC graduate student Katayoun Torabi, first process the available page images using Aletheia Desktop. (Aletheia was developed by the Pattern Recognition and Image Analysis (PRImA) Research Laboratory at the University of Salford. Apostolos Antonacopoulos, IMPACT Work Package leader for PRImA, University of Salford, has made Aletheia and other tools available at http://www.primaresearch.org/tools.php.) Aletheia Desktop includes several semi-automated tools that identify and define layout regions, lines, words, and individual characters (glyphs) within documents. Aletheia reads the text in the page image (using Tesseract) and assigns a Unicode value for each letter, number, and punctuation mark.

Fig. 1: Aletheia Desktop with identified glyphs and some of their associated Unicode values.
As output, Aletheia creates an XML file that contains a set of XY coordinates, along with the associated Unicode value, for each identified glyph. The data contained in this XML file is then ingested or imported into a tool created by IDHMC graduate student Bryan Tarpley called Franken+. Franken+ uses a MySQL database to associate each glyph image with its corresponding Unicode character. The user can then select any glyph from a drop down menu to see every instance of that character in a window (Fig. 2). With every instance of a particular glyph (for example all the ‘a’s) from a document available in one window, the user can quickly identify mislabeled glyphs and choose the best exemplar (or exemplars) for each glyph in that font set (Fig. 2). Once the user has isolated the best instance(s) of each character, Franken+ uses a standard text document to produce a set of synthetic TIFF images and XML files, producing a “Franken-text” with only these ideal characters. This Franken-text matches the characteristics of Tesseract’s expected training file and so can be used to train Tesseract to recognize the typeface being processed.

Fig. 2: Some images of the Frank+ user interface.
eMOP’s book history team immediately realized that the capabilities of Aletheia and Franken+ would tremendously benefit their research into the S-face vs Y-face font question. The ability of Franken+ to display all instances of a given letter from a set of page images in one window dramatically simplifies the task of identifying all examples of any letter in a set of pages. And, being able to examine all these examples alongside each other makes comparing similarities or differences much easier and faster (Fig. 3). After a quick installation of Franken+ and less than an hour of training, the book history team was able to commence work on their research question in earnest. Since Franken+ was introduced at the 2013 Doc Eng Conference,4 the eMOP team has been contacted by several international scholars interested in learning more about Franken+ for use in their research on typefaces.

Fig. 3: An image from Franken+ of a set of exemplars of the “a” glyph from one document.
The study of early modern fonts is a road less traveled in the landscape of Humanities research. Based as it is on minutiae and requiring incredible attention to detail, this work traditionally has been left to a handful of individual scholars. However, the development of Franken+ for eMOP, when used in conjunction with Aletheia, promises to open up this field of study to scholars who may have been interested in it, but found the challenges too daunting. This paper will describe aspects of the eMOP work being done in the field of early modern type research, and will introduce Franken+ as a valuable new tool in this research. The creation of tools like Franken+ have the potential to increase attention and alter research methodologies for this field.
References

1. Pollard, A. W. (1936) Preface. Isaac, Frank. English Printers’ Types of the Sixteenth Century. Oxford: Oxford UP. (v-vii).
2. Carter, Harry Graham (1969). A View of Early Typography Up to About 1600. Oxford: Clarendon.
3. Weiss, Adrian. (1990) Font Analysis as a Bibliographical Method: the Elizabethan Play-Quarto Printers and Compositors. Studies in Bibliography 43: 95-164.
4. Torabi, Katayoun, Jessica Durgan, and Bryan Tarpley (2013). Early Modern OCR Project (eMOP) at Texas A&M University: Using Aletheia to Train Tesseract. ACM Press. 23. doi:10.1145/2494266.2494304. Web. 31 Oct. 2013.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Book History and Software Tools: Examining Typefaces for OCR Training in eMOP

1. Todd Samuelson

2. Matthew J Christy

3. Katayoun Torabi

4. Bryan Tarpley

5. Elizabeth Grumbach

ADHO - 2014

"Digital Cultural Empowerment"