Making Sense of Illustrated Handwritten Archives

Lambert Schomaker; Andreas Weber; Michiel Thijssen; Maarten Heerlien; Aske Plaat; Siegfried Nijssen; Fons Verbeek; Michael Lew; Eulalia Gasso Miracle; Katy Wolstencroft; Ernest Suyver; Bart Verheij; Marco Wiering; Rene Dekker; Joost Kok; Lissa Roberts; Jaap Van den Herik

Authorship

1. Lambert Schomaker

ALICE - Rijksuniversiteit Groningen (University of Groningen)
2. Andreas Weber

STePS - University of Twente
3. Michiel Thijssen

Brill Publishers
4. Maarten Heerlien

Naturalis Biodiversity Center
5. Aske Plaat

LIACS - Leiden University
6. Siegfried Nijssen

LIACS - Leiden University
7. Fons Verbeek

LIACS - Leiden University
8. Michael Lew

LIACS - Leiden University
9. Eulalia Gasso Miracle

Naturalis Biodiversity Center
10. Katy Wolstencroft

LIACS - Leiden University
11. Ernest Suyver

Brill Publishers
12. Bart Verheij

ALICE - Rijksuniversiteit Groningen (University of Groningen)
13. Marco Wiering

ALICE - Rijksuniversiteit Groningen (University of Groningen)
14. Rene Dekker

Naturalis Biodiversity Center
15. Joost Kok

LIACS - Leiden University
16. Lissa Roberts

STePS - University of Twente
17. Jaap Van den Herik

LCDS - Leiden University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Figure 1. Page from a bundle of field notes, describing and depicting a mouse species. Source: Naturalis Biodiversity Center, Archief van de Natuurkundige Commissie voor Nederlands-Indië. Copyright: Public Domain Mark 1.0

This paper presents initial findings of the
research project
Making Sense of Illustrated Handwritten Archives and demonstrates the recognition capabilities of the MONK artificial intelligence system developed since the early 2000s at the institute of Artificial Intelligence and Cognitive Engineering (ALICE) at Groningen University (Van der Zant et al., 2008; Van Oosten and Schomaker, 2014a, Van Oosten and Schomaker, 2014b). In a period of four years (Q1 2016 – Q1 2020), our research project aims to produce an innovative and user-friendly online environment that combines both image and textual recognition, and allows an integrated study of fragmented historical heritage collections.
Next to a short demonstration of MONK, we use this paper to outline how handwriting and image recognition helps to increase the value of illustrated handwritten collections by enriching and linking information that is now inaccessible and disconnected. By doing so it advances the state of the art in automated extraction, classification, and networking of knowledge from heterogeneous manuscript collections.

The MONK system uses shape-based feature vector methods that have very few assumptions concerning the content or style of the material. It avoids the traditional OCR approach (optical character recognition) which assumes that individual characters are essentially legible. That assumption only holds for a tiny fraction of handwritten material and a limited number of scripts. The only assumptions MONK makes are that pictorial and textual segments are separated by white spaces; and that the layout, of underlining, etc. in a specific document, is consistent throughout the document. In MONK, classification methods are used that allow for a fast bootstrap from single example instances (nearest-neighbor search) (Gast et al., 2013). With larger numbers of labeled examples, models can be computed, varying from nearest-centroids to support-vector machines and neural networks in a continuous learning process (Krizhevsky et al., 2012; Liu et al, 2015; Guo, in press). A challenging topic from the technical point of view is the relation between existing semantic knowledge (ontologies) and the statistically inferred semantics using Google’s
word2vec and current deep-learning neural networks. Can the underlying structure and style in a collection of a common and realistic size be detected by such algorithms? Can the proposed enrichment system profit from generally available text corpora? The processing power required by the proposed architecture is substantial. For this project, algorithmic optimization of the image processing and recognition system is necessary in order to create the necessary speed and flexibility of the system for use by non-technical end users. In order to tackle this challenge the consortium will make use of the combined knowledge and expertise of ALICE in Groningen, and the Leiden Institute of Advanced Computer Science (LIACS), where multiple supercomputers and high performance computing experts are present.

Because of its visual approach, MONK can handle the diversity of material that we encounter in our use case and in historical collections in general: text, drawings, and images. MONK also does not require a language model nor fully transcribed samples to quickly assess the contents of an archive page. The human-in-the-loop approach of MONK is currently ‘label’ oriented, but will be enhanced by providing the user and the system with ontologies for bootstrapping the learning process. The system will understand handwritten corpora to such an extent that the visual and textual content on individual pages is categorized, determined and networked to other pages in the archive and external sources. To construct training examples for MONK, biologists and historians of science will manually label documents to the machine learning software by means of a human in the loop approach. In addition, a crowdsourcing approach will be used to further expand this corpus of examples. Our consortium will here build on the expertise of ALICE and Naturalis Biodiversity Center, the Leiden-based National Museum of Natural History. Eventually, the computer-assisted recognition of words and visual information on a page will thus allow users to search, filter and group arbitrary archive items and enables connections with external databases. Last but not least, MONK lays the groundwork for full transcription of any handwritten-illustrated archival collection.

Figure 2. Drawing of Burro multicolor created in Buitenzorg, Java in 1827 by Pieter van Oort. Source: Naturalis Biodiversity Center, Archief van de Natuurkundige Commissie voor Nederlands-Indië. Copyright: Public Domain Mark 1.0

The central use case of our research project is the collection of the
Natuurkundige Commissie voor Nederlandsch-Indië (hereinafter NC). It is one of the top-collections of Naturalis. From 1820 to 1850, the NC charted the natural and economic state of the Dutch East Indies and returned a wealth of scientific data and specimens which are now stored in archives in the Netherlands and Indonesia. The collection comprises thousands of handwritten notes and drawings and tens of thousands biological and geological specimens. While these archival items have all been digitized, the individual pages in the notebooks, diaries and reports are not catalogued nor labeled separately. Many of the field notes combine different textual and visual elements on one page. Our short paper presentation is based on the processing of an initial set of understudied handwritten field notes which we carried out in early 2016. By doing so, we will demonstrate the efficiency of the MONK system and our approach.

Owing to the different ‘hands’ and languages used in the documents, links across handwritten field records and notes, drawings and specimens cannot be made in an efficient way. Our corpus contains material from at least seventeen different writers and the used languages range from German
(Kurrentschrift) to Latin, French, and Dutch. The labels of related historic specimens only provide very general information on collection localities and collectors. Hence, the typical use case of a scholar wishing to retrieve information on a certain species, person, drawing, or collecting locality is limited. Owing to its sheer dimension and its weak structure, it is impractical to disclose and network this archive manually. Its current inaccessibility hampers research into Southeast Asian natural history and the history of (scientific) knowledge production. Knowledge extracted from the documents will be structured and served as Linked Open Data. This will allow interlinking of content and also enable interoperability with other cultural heritage resources, for example, the physical specimens obtained during expeditions, or other historically significant data collections from the same area.

The multi-layered character of the material makes it a perfect use case for developing a technologically advanced and usability engineered digital environment for interpreting and connecting illustrated-handwritten collections. In our consortium data scientists from the Universities in Leiden and Groningen work closely together historians of science from the University of Twente and taxonomy experts from Naturalis. Fueled by an investment from BRILL publishers in a national funding scheme, this project will not only result in the disclosure of the NC archive, but will also enable the integrated study of underexplored scientific manuscript collections and archives in general.

Bibliography

Zant, T. van der, Schomaker, L. and Haak, K. (2008). Handwritten-Word Spotting Using Biologically Inspired Features.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11): 1945–57.

Oosten, J.-P. van and Schomaker, L. (2014a). Separability versus prototypicality in handwritten word-image retrieval.
Pattern Recognition, 47(3): 1031–38. (Handwriting Recognition and Other PR Applications)

Oosten, J.-P. van and Schomaker, L. (2014b). A Reevaluation and Benchmark of Hidden Markov Models.
2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 531–36.

Gast, E., Oerlemans, A. and Lew, M. S. (2013). Very large scale nearest neighbor search: ideas, strategies and challenges.
International Journal of Multimedia Information Retrieval, 2(4): 229–41.

Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.
Advances in Neural Information Processing Systems. pp. 1097–105.

Liu, Y., Guo, Y., Wu, S. and Lew, M. S. (2015). DeepIndex for Accurate and Efficient Image Retrieval.
Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. (ICMR ’15). New York, NY, USA: ACM, pp. 43–50.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S. and Lew, M. S. (2015). Deep learning for visual understanding: A review.
Neurocomputing. (Available online 26 November 2015).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2016

"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website: https://dh2016.adho.org/

Series: ADHO (11)

Organizers: ADHO