Kraken - an Universal Text Recognizer for the Humanities

paper, specified "short paper"
Authorship
  1. 1. Benjamin Kiessling

    Leipzig University of Applied Sciences, Université Paris Sciences & Lettres (PSL)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Retrodigitization of both printed and handwritten material is a common prerequisite for a diverse range of research questions in the humanities. While optical character recognition on printed texts is widely considered to be fundamentally solved in academia, with the most commonly used paradigm (Graves et al., 2006) dating back to 2006, this hasn't translated into increased availability of adaptable, libre-licensed OCR engines to the technically inclined humanities scholar.
The nature of the material of interest commands a platform that can be altered with minimum effort to achieve optimal recognition accuracy; uncommon scripts, historical languages, complex or archaic page layout, and non-paper writing surfaces are rarily satisfactorily addressed by off-the-shelf commercial solutions. In addition, an open system ameliorates the severe resource constraints of humanities research by enabling sharing of artifacts, such as training data and recognition models, inaccessible with proprietary OCR technology.

Kraken
The Kraken text recognition engine is an extensively rewritten fork of the OCRopus system. It can be used both for handwriting and printed text recognition, is easily (re-)trainable, and great care has been taken to eliminate implicit assumptions on content and layout that complicate the processing of non-Latin and non-modern works.
Thus Kraken has been extended with features and interfaces enabling the processing of most scripts, among them full Unicode right-to-left, bidirectional, and vertical writing support, script detection, and multiscript recognition. Processing of scripts not included in Unicode is also possible through a simple JSON interface to the codec mapping numerical model outputs to characters. The same interface provides facilities for efficient recognition of large logographic scripts.
Output includes fine-grained bounding boxes down to the character level that may be used to quickly acquire a large number of samples from a corpus to assist in paleographic research. Kraken implements a flexible output serialization scheme utilizing a simple templating language. Templates are available for the most commonly used formats ALTO, hOCR, TEI, and abbyyXML.
While including implementations of all the subprocesses needed in a text recognition pipeline, most functional blocks can be accessed separately on the command line, allowing flexible substitution of specially optimized methods. A stable programming interface allows total customization and integration into other software packages.

Recognition

Network architecture (H: sequence height, W: sequence length, C: alphabet size)
The recognition engine operates as a segmentation-less sequence classifier using an artificial neural network to map an image of a single line of text, the input sequence, into a sequence of characters, the output sequence. The artificial neural network employed is a hybrid convolutional and recurrent neural network trained with the CTC loss function (Graves et al., 2006) that reduces training data requirements to line-level transcriptions (Figure 3). Regularization is mainly provided by dropout (Hinton et al., 2012) after both convolutional and recurrent layers. User intervention in determining training duration and model selection is largely eliminated through early stopping.

Specialized networks, e.g. for particularly complex scripts, can be assembled from building blocks with a simple network specification language although the default architecture shown in Figure 1 is suitable for the vast majority of applications.
Processing of dictionaries and library catalogues with extensive semantic markup such as italic, underlining, and bolding, is also possible through specially prepared training data.

Layout Analysis and Script Detection

Sample output of the trainable segmentation method.
Kraken's layout analysis extracts text lines from an input image for later processing by the recognition engine. Apart from a basic segmenter taken from OCRopus a trainable line extractor is in the process of being implemented. Full trainability of layout analysis is of utmost importance to a truly universal OCR system, as text layout and its semantics varies widely across time and space, e.g. hand-crafted methods for printed Latin text are unlikely to work reliably on Arabic text or manuscripts with extensive interlinear annotation.

The trainable layout analysis module consists of a two-step instance segmentation method: an initial seed-labelling network operates on the whole page labelling the area between baseline and mean of each line. As the output of the network is a probability of each pixel belonging to a baseline it is binarized using hysteresis thresholding after smoothing with a gaussian filter. The binarized image is then skeletonized and end point are extracted with a discrete convolution. Finally, the vectorized baseline between the endpoints is rectified and a variable environment calculated based on the distance of connected components from the labelled area is extracted.
The seed-labelling network is a modified U-net (Ronneberger et al., 2015) on the basis of a 34-layer residual network (He et al., 2016) pretrained on ImageNet.
Preliminary results on a page from a publicly available dataset of Arabic and Persian manuscripts (Kiessling et al., 2019) can be seen in Figure 2.
Script detection, the basis for multi-script support in the recognizer, is implemented as a segmentation-less sequence classification problem, similar to text recognition. Instead of assigning a unique label to each code point or grapheme cluster we assign all code points of a particular script the same label. The network is then trained to output the correct sequence of script labels (Figure 3). The output sequence is then used to split the line into single-script runs that can be classified with monolingual recognition models (Figure 4).

Original and modified ground truth (top: original line, middle: transcription, bottom: assigned script classes)

Results

Mean character accuracy
Standard deviation
Maximum accuracy

Prints

Arabic (Kiessling et al., 2017)
99.5%
0.05
99.6%

Persian
Mid-20th century printing

98.3%
0.33
98.7%

Syriac
Late-19th century printing in Serṭā form

98.7%
0.38
99.2%

Polytonic Greek
Late-19th century printing

99.2%
0.26
99.6%

Latin (Springmann et al., 2018)
98.8%
0.09
99.3%

Latin incunabula (Springmann et al., 2018)
99.0%
0.11
99.2%

Fraktur (Springmann et al., 2018)
99.0%
0.31
99.3%

Cyrillic
99.3%
0.15
99.6%

Manuscripts

Hebrew
Midrash Tanhuma, BNF Héb 150

96.9%
-
-

Medieval Latin
Josephus Latinus, Bamberg 78 with augmentation

98.2%
-
-

Sample output of the script detection on a bilingual French/Arabic page. Note that Eastern Arabic are always classified as Latin text
Kraken has been used on a wide variety of writing systems, achieving uniformly high character accuracy (CER). Sample accuracies for a diverse set of scripts spanning across multiple centuries of printing are shown in Table 1.

As a special use case we evaluated recognition of text and emphasis in a mixed English and romanized Arabic library catalog on a training set of 350 lines (50 lines in the validation set) resulting in an averaged CER of 99.3% (σ=0.16) over 10 runs with 95.38% CER on cursive and text with increased spacing (σ=1.46). When using only emphasized text accuracy as the stopping criterium mean accuracy rises to 99.03% (σ=0.28).

Bibliography

Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
Proceedings of the 23rd International Conference on Machine Learning. ACM, pp. 369–376.

He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors.
ArXiv Preprint ArXiv:1207.0580.

Kiessling, B., Miller, M. T., Maxim, G., Savant, S. B. and others (2017). Important New Developments in Arabographic Optical Character Recognition (OCR).
Al-ʿUṣūr Al-Wusṭā,
25: 1–13.

Kiessling, B., Stoekl Ben Ezra, Daniel and Miller, Matthew Thomas (2019). BADAM: A Public Dataset for Baseline Detection in Arabic-script Manuscripts.

Ronneberger, O., Fischer, P. and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation.
International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 234–241.

Springmann, U., Reul, C., Dipper, S. and Baiter, J. (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin.
ArXiv Preprint ArXiv:1809.05501.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.