Handwritten Text Recognition and Palimpsest Analysis for Medieval Greek Manuscripts:

Takashi Okada; So Miyagawa; Kosei Kawazu; Tatsuya Ishii; Toshio Oka; Satoshi Fujimaki; Tomejiro Osawa; Yoko Shiki; Noriko Maehara; Keiichi Ishibashi; Kyosuke Sunada; Yasuhito Nakanishi

Authorship

1. Takashi Okada

Toppan Inc.
2. So Miyagawa

Kyoto University
3. Kosei Kawazu

Toppan Inc.
4. Tatsuya Ishii

Toppan Inc.
5. Toshio Oka

Toppan Inc.
6. Satoshi Fujimaki

Toppan Inc.
7. Tomejiro Osawa

Toppan Inc.
8. Yoko Shiki

Printing Museum, Tokyo
9. Noriko Maehara

Printing Museum, Tokyo
10. Keiichi Ishibashi

Printing Museum, Tokyo
11. Kyosuke Sunada

University of Tokyo
12. Yasuhito Nakanishi

Printing Museum, Tokyo

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Medieval Greek Manuscript Digital Database with High-Resolution Images
Medieval Greek manuscripts were written during the Byzantine Empire in the Middle Ages. Among various Greek writing styles, medieval Greek manuscripts are known to be difficult to read due to the frequent and diverse use of ligatures and various diacritics. The Biblioteca Apostolica Vaticana (BAV) preserves many medieval Greek manuscripts that have yet to be transcribed.
Since its establishment in 2000, the Printing Museum in Tokyo (PMT) has collaborated closely with the BAV. The Cicero Project, a joint research project that commenced in 2005 and will end in 2021, aims to create an open digital database of hundreds of the BAV’s medieval Greek manuscripts with high-resolution images of each folio, including palimpsests. After the project’s launch, Toppan Inc., PMT’s parent company, developed scanners and analysis systems.

Palimpsest Analysis
The scanner team studied page acquisition, data storage methods, and alternative methods to digitize the target pages. Then, following the decoding team’s research policy, each designated palimpsest was digitized. The palimpsest analysis software superimposed the white light image and the ultraviolet image of the palimpsest digitized by the scanner. The visible text was subsequently extracted, making it easier to decipher.

Result of the palimpsest analysis in the Cicero Project (Vat.Gr.1837)

As a result of the palimpsest analysis, a fragment of a work by 10th century Byzantine historian Leo the Deacon was found in the palimpsest manuscript Vat.Gr.1307 in the BAV collection. The fragment contained a description of Byzantine history and the origin of the Slavic peoples, which has significant differences with another already known manuscript of the same work (Janz, 2006).

Handwritten Text Recognition
It is physically impossible for the small number of researchers at the BAV to decipher approximately 30,000 images of Greek manuscripts that have been digitized. Therefore, in 2017, the new project succeeding the Cicero Project started developing a new deep-learning HTR (handwritten text recognition) system based on the
Fumi no Ha OCR technology (Toppan Inc, 2021), originally developed for recognizing handwritten cursive early modern Japanese texts.
Fumi no Ha includes our cursive script data set, AI cursive script recognition program, and the viewer that Toppan Printing Co., Ltd., currently known as Toppan Inc., already developed commercially. This system offers the following advantages:

Unlike existing line-based HTR using CRNN (Shi et al., 2015; e.g., bidirectional LSTM),
Fumi no Ha enables the identification of character coordinates for each character. As such, even when there are difficult-to-read characters, it is possible to reprint each character while referring to the image;

Recognition results can be generated together with confidence information. Users can discard characters with low confidence levels and perform more accurate single-character recognition for them instead;
Toppan Inc.’s web-browser-based image viewing software makes it easy to compare the original manuscript and the transcription. No special application is required for browsing; and
The HTML format output can be displayed anywhere as long as a web browser is available. There is no requirement for dedicated systems or maintenance costs.

Training data in HTR for Medieval Greek manuscripts

Training data were prepared by character on images of handwritten manuscripts provided by the BAV (Figure 2). In this phase, two experts on medieval Greek philology directed the training data: So Miyagawa and Kyosuke Sunada. Miyagawa’s experience of training Coptic OCR (Miyagawa et al., 2019) contributed to this training phase.

Fumi no Ha uses the hybrid system of character-based and line-based recognition systems. This hybrid system enables the recognition of highly complicated character layouts found in Japanese cursive manuscripts by employing the character-based system. At the same time, by using the line-based strategy, it ensures ease in the preparation of ground truth data. Medieval Greek manuscripts have various ligatures and an eccentric layout of letters. For this kind of complex character layout, the hybrid system is more appropriate than only line-based systems, such as Transkribus (Kahle et al., 2017) and OCR4all (Reul et al., 2019).

Conclusions
By taking full advantage of this set of the manuscript scanning, palimpsest analysis, and HTR programs, it is possible to build a useful database for the scholarly community, which provides high-resolution images and texts of medieval Greek handwritten manuscripts.

Bibliography

Janz, T. (2006).
Un nouveau témoin fragmentaire de Léon le Diacre le Vat. Sofia: Centre Dujčev.

Kahle, P., Colutto, S., Hackl, G., and Mühlberger, G. (2017). Transkribus: A Service Platform for Transcription, Recognition and Retrieval of Historical Documents.
Proc. 14th IAPR International Conference on Document Analysis and Recognition, pp. 19–24. DOI: 10.1109/ICDAR.2017.307.

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, C., Büttner, A., and Puppe, F. (2019). OCR4all: An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings.
Applied Sciences
9(22). DOI: 10.3390/app9224853.

Shi, B., Bai, X., and Yao, C. (2015). An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). DOI: 10.1109/TPAMI.2016.264637.

Miyagawa, S., Bulert, B., Büchler, M., and Behlmer, H. (2019). Optical Character Recognition of Typeset Coptic Text with Neural Networks.
Digital Scholarship in the Humanities,
34(Suppl. 1): i135–i141. DOI: 10.1093/llc/fqz023.

Toppan Inc. (2021). Fumi no Ha. https://www.toppan.co.jp/biz/fuminoha/ (accessed 21 April 2022).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022

"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO