Distributed "Forms of Attention": eMOP and the Cobre Tool

paper, specified "long paper"
  1. 1. Anton Raymund duPlessis

    Texas A&M University

  2. 2. Laura Mandell

    Texas A&M University

  3. 3. James Creel

    Texas A&M University

  4. 4. Alexey Maslov

    Texas A&M University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

A recent article by Paul Gooding, Melissa Terras, and Claire Warwick argues that a gap in our knowledge about the impact upon scholars of “large-scale digitized collections” of textual data has spawned “myths” about mass-digitization—those surrounding the distant- vs. close-reading debate, as well as dystopian arguments about digitally disrupted attention spans (Gooding et. al.; Moretti; Trumpener; Guillory; Hayles). “Where understanding lags behind innovation,” Gooding, Terras, and Warwick argue persuasively, “the rhetoric of technological determinism can fill the void” (633). Central to the myth that digital media produce “crowds of quick and sloppy readers” (632) is ignorance: the last decade has witnessed the emergence of a spate of digital editing and annotating tools as well as the emergence of what might be called “network editing.”[2] As has been amply revealed by the Australian Newspaper Digitisation Program, as well as other projects involving the crowd in correcting textual transcriptions such as Transcribe Bentham, people are as much engaged by the task of modifying digital textual archives as they are in using them (Halley, Terras), and textual “modding”[3] by networks of users requires paying close attention to texts, as might networked monasteries of monks in the process of transcribing them. The need for distributed networks of people helping to solve problems endemic to creating large textual corpora, in other words, fosters close attention to text. This fact is demonstrated by the Early Modern OCR Project (eMOP)[4]: the attempt to produce searchable text for 45 million page-images of texts published between 1473 and 1800 would fail were it not for the project’s adaptation of Cobre, a tool originally designed for closely examining various 16th Century Iberoamerican imprints, and Cobre elicits careful scholarly attention from globally distributed experts and citizen scholars wishing to take part in improving the quality and kind of information available on the Internet.

The eMOP project focuses on how to digitize the archive of early modern texts, despite problems entailed. The printing process in the hand-press period (1473-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Adding to these factors, the quality of the digital images of these books is very poor: ProQuest’s Early English Books Online dataset (EEBO) and Gale’s Eighteenth-Century Collections Online (ECCO) contain page images that were digitized in the 1990s from microfilm created in the 1970s and 80s. Hand-press printing as well as skewed low-quality images with no gray-scale originals creates a problem for Optical Character Recognition (OCR) software. OCR engines are notoriously bad at translating into texts digital images of early modern texts even under the best of circumstances (Gooding, Terras, and Warwick). That is trying to translate the images of these pages into archiveable, mineable texts.

The Early English Books Online dataset (EEBO) consists of a collection of approximately 15 million digital page images of texts published between 1473 and 1700, and these page images are practically impenetrable to OCR engines. Moreover, metadata for such early texts is notoriously unreliable: according to David Foxon, title pages don't only lie, they sometimes joke, naming the printer typically used by a rival author as a way of implicating that author in the text's composition, or naming a bookseller in an area of London such as the theater district, for satirical purposes. Not only the binding of books, but the re-use of previously printed materials in "new" books makes it very difficult to know what is actually proffered by any title--what editions of other works might be included, unacknowledged in the metadata. A consortium of libraries called the Text Creation Partnership (TCP) has decided to key in, type by hand, one instance of each title in the collection, but obviously, in this context, “the same title” rarely indicates what “buried treasures” lie beneath its mark (Jackson). Moreover, it is even the case that individual witnesses of the same edition vary because of stop-press additions and corrections, changes made during the run of a single printing of one edition.

Gibbs muses that “even once we have more reliable OCR technology, it would be nice to have an infrastructure to allow the manuscripts to be viewed together and improved by user expertise” and expresses hope for a transcription editor with an unobtrusive, functional, and intuitive interface, that allows text to be easily (re)configured while displaying variations between versions. Cobre (COmparative Book REader), a suite of image viewers and tools developed to facilitate detailed interaction with the collection of 16th Century New World imprints in the Proyecto losPrimeros Libros de las Américas: Impresos mexicanos y peruanos del Siglo XVI en la bibliotecas del mundo meets those needs.[5] Cobre ingests content from an OAI/PMH enabled digital repository. To populate a Cobre instance with texts for eMOP triage, we first structure the page images and their associated OCR transcriptions in the DSpace Simple Archive Format for ingestion into a DSpace repository, from which they can be imported into Cobre. Intrinsic to Cobre’s functionality is a Detailed View that not only places page images in context, via a filmstrip metaphor, but provides multiple zoom levels and the ability to drag the page in the viewer pane (Liles et al). Cobre’s Comparison View likewise uses a filmstrip view of two or more books together (Liles et al). These filmstrips can be locked, keeping them aligned when any one filmstrip is moved and when a thumbnail in the filmstrip is clicked, a side-by-side view of all the pages appears (Liles et al) in a Quick Comparative View.

Fig. 1: Cobre's Detailed View with imported OCR and pane for text correction

Fig. 2: Cobre's Comparative View of multiples exemplars

Fig. 3: Cobre's Quick Comparative View of page images and OCR output side-by-side.

Though the Cobre tool was built for the purposes of transcription and thus is technically a crowd-sourced transcription tool like the Bentham wiki and the other tools listed by Melissa Terras,[6] it resembles Ben Brumefield’s FromThePage, also mentioned by Terras, in being half transcription tool and half social editor of the sort described by Ray Siemens et. al. Like Bentham, Siemens’s own Devonshire ms. uses Wikimedia so that editing can be discussed as well as implemented. Cobre too allows for transcription, annotation, and—a key attraction for experts—editing and adding information to page images and to the metadata.

We performed user studies on the tool, bringing in book history experts James Raven and Robert D. Hume to test the tool, and we videotaped their eye movements while recording their comments. There are many things that they did not find intuitive which we fixed on our last development sprint. We will be performing another set of user studies when we teach approximately 50 people to use Cobre at a pre-conference workshop at the American Society for Eighteenth-Century Studies, to be held in Williamsburg, VA, 19 March 2014.

Transcribe Bentham and the ANDP have been very successful at recruiting experts and citizen scholars (Causer, et. al., and Holley). While the ANDP required a very light form of attention, transcription, and Transcribe Bentham slightly more – users were asked to encode as well – we will be asking users of Cobre to transcribe and compare transcriptions to multiple pages of various editions. Can a network of “authorized” users who will be carefully comparing pages of editions be generated of sufficient strength? Can there be networked—and so massive—close-reading? We will report on whether and how it is possible to generate distributed forms of attention that are required for careful digitization en masse.

[1] Kermode’s phrase encapsulates the scholarly output over the last two centuries that have produced close readings of the meanings and forms of canonical works of art and literature

[2] Siemens et. al. describe “the social edition,” a procedure that was debated at a recent SSHRC-sponsored conference called “Social, Digital, Scholarly Editing,” hosted by Peter Robinson at the University of Saskatchewan (ocs.usask.ca/conf/index.php/sdse/sdse13). Gibbs calls for a “community transcription tool [that] will reduce significantly the barrier to entry and encourage mark-up of texts,” such as: the CWRC (www.dh2012.uni-hamburg.de/conference/programme/abstracts/cwrc-writer-an-in-browser-xml-editor), FromThePage (beta.fromthepage.com) and “Textual Communities” (www.textualcommunities.usask.ca)

[3] Craig Chappel’s view that game-modding is in decline may portend a trend against participation, but that view is arguable.

[4] emop.tamu.edu

[5] primeroslibros.org, libros.library.tamu.edu

[6] melissaterras.blogspot.com/2010/03/crowdsourcing-manuscript-material.html: Scratch, Remote Writer, and the tools hosted by the Australian National Digitisation Program (ANDP) and BYU Historica Journals.

Causer, T., J. Tonra, and V. Wallace (2012). Transcription Maximized; Expense Minimized?: Crowdsourcing and editing The Collected Works of Jeremy Bentham. Literary and Linguistic Computing 27.2, pp. 119-137.

Causer, T., and V. Wallace (2012). Building a Volunteer Community: Results and Findings from Transcribe Bentham. Digital Humanities Quarterly 6.1. www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html. Accessed 8 January 2014

Chapple, Craig (2013). An FPS Insurgency – Breaking into a Crowded Genre. 13 September 2013. www.develop-online.net/interview/an-fps-insurgency-breaking-into-a-crowded-genre/0117719 Accessed 31 October 2013.

Dahlström, Mats. ( 2009). The Compleat Edition. In Text Editing, Print and the Digital World. Eds. Marilyn and Kathryn Sutherland. Farnham, MA: Ashgate. 27-44.

Foxon, David, with James McLaverty. (1991). Pope and the Early Eighteenth-Century Book Trade. Oxford: Clarendon Press.

Gibbs, Frederick W (2011). New Textual Traditions from Community Transcription. Digital Medievalist 7. www.digitalmedievalist.org/journal/7/gibbs

Gooding, Paul, and Melissa Terras, Claire Warwick (2013). The Myth of the New: Mass Digitization, Distant Reading, and the Future of the Book. Literary and Linguistic Computing 28.4: 629-39.

Guillory, John (2010). Close Reading: Prologue and Epilogue, ADE Bulletin 149: 8-14.

Hayles, N. Katherine (2007). Hyper and Deep Attention: The Generational Divide in Cognitive Models, Profession 2007: 187-199.

Holley, Rose. (2009) How Good Can It Get: Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 1.3/4.

---. Many Hands Make Light Work. March 2009. National Library of Australia. ISBN 978-0-642-27694-0

Jackson, Millie. (2008) Using Metadata to Discover the Buried Treasure in Google Book Search. Journal of Library Administration 47.1/2: 165-73.

Kermode, Frank (1985). Forms of Attention. Chicago: University of Chicago Press.

Liles, B., Creel, J., Maslov, A., Nuernberg, S., duPlessis, A., Mercer, H., McFarland, M. and Leggett, J. (2012). Cobre: A Comparative Book Reader for Los Primeros Libros. Proceedings of the 45th Hawaii International Conference on System Sciences (HICSS45), Maui, HI, USA. January 4-7, pp. 1707-16. dx.doi.org/10.1109%2fHICSS.2012.155

Moretti, Franco.Conjectures on World Literature, New Left Review 1: 54-68. (2000)

---. Graphs, Maps, and Trees: Abstract Models for a Literary History. New York: Verso, 2005.

---. More Conjectures, New Left Review 20 (2003): 73-81.

---. Style, Inc. Reflections ‘Relatively Blunt,’ 172-174.

Robinson, Peter.Textual Communities. www.textualcommunities.usask.ca

Siemens, Ray, and Meagan Timney, Cara Leitch, Corina Koolen, Alex Garnet.Toward Modeling the Social Edition: An Approach to Understanding the Electronic Scholarly Edition in the Context of New and Emerging Social Media. Literary and Linguistic Computing 27.4 (2012): 445-461.

Terras, Melissa. (2010) Crowdsourcing Manuscript Material. 2 March 2010. Adventures in Digital Humanities Blog. melissaterras.blogspot.com/2010/03/crowdsourcing-manuscript-material.html. Accessed 8 January 2014

---. Crowdsourcing or crowdsifting? Results and experiences from Transcribe Bentham. Paper given at Social, Digital, Scholarly Editing Conference, University of Saskatchewan, Saskatoon, Saskatchewan, Canada. 11 July 2013.

Trumpener, Katie. (2009). Critical Response I: Paratext and Genre System, Critical Inquiry 3.1: 159–71.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO