Texas A&M University
Texas A&M University
Texas A&M University
Introduction
Collation is an important step in textual criticism and it is most often an arduous task for most scholars involved in scholarly edition. Unsworth includes “collation” as one of the scholarly primitives which have been basic to scholarship across eras and media. Textual variation has been a pervasive problem affecting literary text since the invention of writing. It can arise in two forms - either due to repeated copying of a manuscript such as the variants in the First Folio of Shakespeare or those advertently inserted by the author/copyist such as the changes made in Mary Shelley’s Frankenstein. In the first case collation aids the scholar in generating a critical edition. In the latter case, collation can help the scholar understand the author’s purpose. Finding variations is important for research in bibliography and book history as well.
Most of the focus in digital humanities until now has been on making documents available digitally. Much less focus has been on actually supporting the process of scholarly research. The area of collation too awaits a lot more from technology. In the late 1940s Charlton Hinman invented the Hinman collator. Using optical means, it allowed manual comparison of separate copies of a text in order to detect differences. Descendants of the Hinman, collators like the Mcleod collator, the Lindstrand collator and the Hailey's Comet, are still used today. Mechanical collators take time to setup correctly, cannot be used on varying fonts, can damage the books, and are expensive and not very portable. Another approach is to perform collation using tools like Juxta or Collate X on text obtained by transcription or by OCR. This method is flawed by the limits of OCR technology and human error. The Sapheos project approaches the collation problem interestingly by attempting to unwarp the pages and registering them using SIFT key points, but this approach will fail if the text differs significantly.
Most of the tools used today are standalone which inhibits collaboration. Also scholars prefer original copies or facsimiles instead of OCR or transcription versions and most of these tools don’t support that.
This work proposes to address these problems. A prototype of the virtual Hinman (vHinman) collator was created and user-evaluation was conducted amongst scholars experienced with collation work. Image-matching algorithms along with context information are used to match words and the tool was integrated into the creativity support environment CritSpace. Moreover, CritSpace provides the functionality to easily extend the tool to support collating multiple (>2) copies.
Methodology
We developed and evaluated some approaches towards comparing page-images.
Some methods worked well only when the images are pre-registered and won’t be practical if the pages have different alignments and different font-sizes, so we used image processing techniques and image matching algorithms to perform comparison of images . We followed an approach similar to to compare word images amongst two scanned pages
We find the unique corner points on the individual words and find a feature vector for these points. Then through clustering we assign an id to each point. Then we are able to use sequence of ids for a word to compare with other words for similarity. Then we take into account the context of the words to aid in finding the exact match for the words.
Integration into CritSpace
Peter Robinson notes that the greatest effect of the digital revolution is that it is empowering a new model of collaboration among scholars, and between scholars and readers. In sync with this, the goal of this project is to integrate the collation tool into CritSpace to greatly increase its usefulness. CritSpace [Figure 1] is a web based creativity support environment that implements spatial information management strategies to assist scholars in their open-ended research tasks.
Fig. 1: Sample workspace with a text panel, image panel and facsimile viewer
The user-interface explained in figures below was planned to generate an effortless user-experience for digital scholars.
Fig. 2: Figure 2 Screenshot highlighting the differences with green boxes around the words. These are displayed when the user clicks on "Differences" button. Notable differences like missing hyphens are outlined as well as the end of the page.
Fig. 3: Figure 3 Screenshot demonstrating the tracking feature. When the user hovers the mouse over any block of word its corresponding match is highlighted in the other page in red. The ones that have already been checked are turned black
Fig. 4: Figure 4 Screenshot of the annotation feature. On enabling annotation mode, the user can select a word and a text box will appear. The text is displayed above the word every time annotation mode is set. A sample use-case has been outlined.
Fig. 5: Screenshot of collation output of two 17th century versions of The Late Tryal and conviction of Count Tariff
Fig. 6: Font variations in two versions of word "French"
Dataset
We tested the vHinman tool on various scanned texts available on the Internet Archive website and within TAMU collections. Some of them are digital copies of SherlockHolmes , copies of early printings of Donne’s Poems (1633 and 1635) and copies of The Late Tryal and conviction of Count Tariff. These books have many print and edition variants; for the pages of SherlockHolmes tested, the tool shows an average accuracy of 95% in tracking the matches.
User Evaluation
Five subjects were chosen to participate in this study which was a mix of semi-structured interview regarding the experience of scholars on collation, followed by a demo of the prototype and then by questions about the feedback of the tool and suggestions for its improvement.
ID Area of interest Career Stage
S1 Eighteenth century literature Senior
S2 Bibliography Senior
S3 Scholarly editing Senior
S4 Scholarly editing Senior
S5 Book history, Linguistics Senior
Most of the subjects had prior experience with collation either in their scholarly research or for some classroom activities. Some of the subjects had experience with mechanical collators or text based collators. Many of the subjects still prefer the paper-based manual collation method because they find the supporting tools either inaccurate or too cumbersome to use or both. The need of collation in the subjects’ research varied from the traditional scholarly editing process to bibliographic research and book history research.
S4 pointed out that he didn’t have the resources to do the transcription for each of the documents he works on and also said that they are prone to errors.
S1 pointed out the need to be able to find differences in font-styles, ligatures like the move from using the long “s” to the current “s”.
S2 liked the idea of integrating the vHinman into CritSpace which can foster collaborative work. She also liked the idea that the tool could have multiple panels (more than two). She pointed out that while supporting multiple images we can display the n-images in the form of medium sized thumbnails as is seen in “Google images”, where the scholar can select any two panels to collate at a time. She noted that the tool could bring forward new uses of collation and could get collation adopted by scholars who currently don’t focus much on it attributing the manual effort and inherent inaccuracies in the current method.
S5 suggested a novel use of the tool in verifying the authorship of a poem.
Some of the subjects felt the need to be able to point small differences like punctuation because this is important for a critical edition. Although our tool currently only supports identifying word differences, punctuation support can be added. S4 felt that the current implementation can quicken the collation process by addressing textual differences while punctuation can be addressed separately. The subjects in general liked the ability to use the original facsimile of the document via the tool rather than a transcription or a somewhat inaccurate OCR version of it.
Conclusion
This work has investigated the way humanities scholars perform collation work and what role collation plays in their research output. Collation is known to be a laborious and monotonous task unaided by technology so far. To address this problem, a prototype has been developed to perform collation in an automated manner. Image matching techniques are employed in building this prototype, so that the scholars can use the original facsimiles of the documents. The tool was integrated into CritSpace and will benefit from the collaborative environment. A user evaluation was conducted with experienced scholars. In summary, the tool looks very promising to the scholars and also has a high accuracy rate for the books tested so far. Thus this kind of tool can save a massive amount of time for scholars and set up a paradigm of digital collation encouraging even more scholars in finding new uses of collation in their work.
It extends the Hinman’s principles by allowing collating multiple editions of a book in addition to multiple copies of same edition having minor differences.
Since it is has application in creating a critical edition, bibliography and book history research, this tool has the capability of gaining widespread adoption.
Future Work
Beyond printed material, it will be interesting to evaluate the tool for handwritten documents and make it robust for such documents. Also it will be great to test the tool for non-English documents. We can try out different visualization formats and different ways the scholars can use the output in their work. A detailed usability study can be conducted where scholars can perform some real collation work on few pages and compare their traditional method and the vHinman. Also the accuracy could be tested for warped images as most of the unobtrusive scanning methods produce some warping on the images.
References
Unsworth J. (2000) "Scholarly Primitives: what methods do humanities researchers have in common, and how might our tools reflect this?” In Humanities Computing: formal methods, experimental practice (13 May 2000)
Schmidt, D., & Colomb, R. (2009). “A data structure for representing multi-version texts online.” Journal of Human–Computer Studies, 67(6), 497–514.
Smith, S.E. (2002) “’Armadillos of Invention’: A Census of Mechanical Collators, Studies in Bibliography” 55, pp. 133-170
Cream R.“Sapheos Project”, CDH University Of Southern Carolina sapheos.org
Raabe W. “Collation in Scholarly Editing: An Introduction”
wraabe.wordpress.com/2008/07/26/collation-in-scholarly-editing-an-introduction-draft
www.juxtasoftware.org
collatex.net
Lowe, D.G. (2004) “Distinctive image features from scale-invariant keypoints.” Int. J. Comput. Vision, 60:91–110, November 2004.
Audenaert, N. and Furuta, R. (2010) What Humanists Want: How Scholars use Primary Source Documents, Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 283-292.
Yalniz, I. and Manmatha, R., "An efficient framework for searching text in noisy document images," Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS'12)
Robinson, Peter M. W. (1994). “Collation, textual criticism, publication, and the computer.” Text 7: 77-94. www.jstor.org/stable/pdfplus/30227694.pdf. 16 Dec. 2011
Audenaert, N., Lucchese, G. and Furuta, R.”CritSpace: a workspace for critical engagement within cultural heritage digital libraries” ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Audenaert, N. “CritSpace: An interactive visual interface to digital collections of cultural heritage material”
Marshall, C.C., Shipman, F.M. “Spatial hypertext and the practice of information triage.” In: ACM Conference on Hypertext (Hypertext 1997), pp. 124–133 (1997)
Rosten E., and Drummond T. (2006) “Machine Learning for high-speed corner detection.” In European Conference on Computer Vision. Volume I, 430-443, May 2006.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
Lausanne, Switzerland
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)
Organizers: ADHO