Building an OCR Pipeline for a Republican Chinese Entertainment Newspaper:

paper, specified "long paper"
Authorship
  1. 1. Konstantin Henke

    Heidelberg Centre for Transcultural Studies, Heidelberg University

  2. 2. Matthias Arnold

    Heidelberg Centre for Transcultural Studies, Heidelberg University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


In recent years, the digitisation of newspapers has made a lot of progress, and large national and international initiatives like Trove
https://trove.nla.gov.au/newspaper/
, Chronicling America
https://chroniclingamerica.loc.gov/newspapers/
, Europeana Newspapers
http://www.europeana-newspapers.eu/
, Impresso
https://impresso-project.ch/
, NewsEye
https://www.newseye.eu/
, Oceanic Exchanges
https://oceanicexchanges.org/
, OCR-D
https://ocr-d.de/en/
, Deutsches Zeitungsportal
https://www.deutsche-digitale-bibliothek.de/newspaper
,
and Living with Machines
https://livingwithmachines.ac.uk/
emerged that are building up on and going beyond sheer digitisation, venturing into various areas of content analysis (Oberbichler et al, 2021). Also, the outcomes of these initiatives are usually provided online with open access, and publications increasingly follow the FAIR principles (Wilkinson et al, 2016). However, most of the textual content covered is printed in Latin script languages, and to a large degree the analytical systems rely on linguistic features like word boundaries, digital lexica, or tagged corpora.

Responding to this from an Asian perspective, i.e. looking at materials from regions where non-Latin scripts prevail, the situation is different. In our case we are working with newspapers from Republican China. Although there are some projects working on historical Chinese newspapers (Stewart et al, 2020), results have so far rarely been published. Other initiatives provide their final results as commercial products. In general, a certain reluctance can be observed when it comes to publishing research methodologies, not to mention the open access sharing of ground truth, test corpora, or trained models (Arnold et al, forthcoming).

In our project we collected periodicals from the Republican era as image scans (Sung et al, 2014
)
and started OCR experiments to transform them into machine readable full text.

Fig. 1: One of the 9385 fold scans of
Jing bao

Looking closer at
Jing bao
晶報

(The Crystal)
(cf. Fig. 1), an entertainment newspaper that ran from 1919 to 1940, we soon learned that the key issue of OCR’ing the material actually lies in the page-level segmentation. We therefore started creating ground truth (GT) for geometrical data featuring semantically grouped bounding boxes with labels (article, image, advertisement, marginalia). We then used the resulting dataset to train dhSegment
and have the network detect content areas on the folds (Arnold, forthcoming) (Fig. 2).

Fig. 2: Automatic page segmentation results. Blocks with text content are shown in yellow.

Additionally, we created a text GT that not only covers all text in a machine readable local XML format, but also contains information about reading sequence running direction of the text. Based on this GT we were able to process a first set of manual crops, introducing a character segmentation method for grid-based printing layouts which produces over 90,000 labeled images of single characters (Henke, 2021). In this work, a GoogLeNet is trained as an OCR classifier on said character images after extensive pre-training on synthetical character image data created from font files. Additional error correction using language models yields an accuracy of 97.44%.

In our presentation we introduce our work on developing a document image processing pipeline currently focusing on Republican Chinese newspapers with complex layouts like the
Jing bao. We will present the following concrete contributions:

A page-level segmentation approach (as seen in Fig. 2) yielding single text blocks.
An OCR pipeline taking single text blocks as input.

While Arnold (forthcoming) presented first promising experiments regarding (1), in this presentation we will concentrate on (2). Our evaluation metric for OCR output is the character error rate (CER) with regard to the ground truth annotation of every text block crop, which, based on the Levenshtein distance, is computed by:

(S,
D,
I

= number of substitutions, deletions, insertions; L = length of the reference sequence, i.e. corresponding GT text).

The character segmentation approach presented in Henke (2021) can however only process text blocks where characters are printed in a grid-like layout, which accounts for a very small portion of the
Jing bao. Hence, there is a particular need for efficient character detection in less stable layout situations within text blocks, before passing single character images on to the actual OCR engine. As a baseline, we leverage the publicly available state-of-the-art OCR tool Tesseract (Smith, 2007) which provides out-of-the-box segmentation+recognition models even for vertically printed traditional Chinese. Tesseract however seems to struggle with the low input image resolution (~ 25x25 px per character) and overall inconsistent scan quality, leading to a very high CER of
47.85%
on the test set from Henke (2021).

To solve this issue, we use the readily-trained HRCenterNet from
Tang et al. (2020)
for character detection, and crop the bounding boxes to feed them into the GoogLeNet trained in Henke (2021).
However, while our crops have a great variety of aspect ratios, the HRCenterNet expects at least nearly-squared rectangles. Hence, we cut the original images into 250x250 px tiles with a 50 px overlap (both horizontally and vertically, Fig. 3c). Bounding boxes (Fig. 3d) found in the overlapping sections are filtered during the non-maximum suppression (NMS) operation already included in the HRCenterNet pipeline (Fig. 3e).

12
231

a

b

c

d

e

Fig 3: a) original image, b) image after contrast enhancing, c) tiling with overlap, d) bounding boxes found by HRCenterNet before NMS, e) final result after reconnection of tiles and NMS
In addition, Fig. 4 shows how the HRCenterNet largely profits from contrast-enhancement (Fig 3b) during image pre-processing, especially for low-contrast input images.

Fig 4: Effects of contrast enhancement on character detection using HRCenterNet

Using the above method, the CER on the test set of Henke (2021) is reduced to
5.64%.

In the presentation we will show how the results can be confirmed on a non-grid-based section of the corpus, for which we currently create GT annotations. We are confident that the additional pre-processing of crops and individual character images will help to further reduce the CER, and in combination with (1), yield a powerful document-level OCR pipeline for the
Jing bao
and similar Republican newspapers. This will not only open the door to further processing with the tools of Digital Humanities, but also further contribute to FAIR-based work in the diverse Asian sphere.

Bibliography

Arnold, M. (2021). Multilingual Research Projects: Challenges for Making Use of Standards, Authority Files, and Character Recognition. Digital Studies/Le Champ Numérique, forthcoming. DOI: 10.11588/heidok.00030918 (preprint).

Arnold, M., Paterson, D. and Xie, J. (forthcoming). Procedural Challenges: Machine Learning Tasks for OCR of Historical CJK Newspapers. International Journal of Digital Humanities, Special issue on Digital Humanities and East Asian Studies. (manuscript accepted by special issue editors, currently under review by journal).

Henke, K. (2021). Building and Improving an OCR Classifier for Republican Chinese Newspaper Text. B.A. thesis, Heidelberg University. DOI: 10.11588/heidok.00030845

Oberbichler, S., Boros, E., Doucet, A., Marjanen, J., Pfanzelter, E., Rautiainen, J., Toivonen, H. and Tolonen, M. (2021). Integrated Interdisciplinary Workflows for Research on Historical Newspapers: Perspectives from Humanities Scholars, Computer Scientists, and Librarians. Journal of the Association for Information Science and Technology. DOI: 10.1002/asi.24565

Smith, Ray (2007). An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629-633. DOI: 10.1109/ICDAR.2007.4376991

Stewart, S., 朱吟清 Zhu, Y., 吴佩臻 Wu, P., 赵薇 Zhao, W., Gladstone, C., Long, H., Detwyler, A. and So, R. J. (2020). 比较文学研究与数字基础设施建设:以“民国时期期刊语料库(1918-1949),基于PhiloLogic4”为例的探索 (Comparative Literature Research and Digital Infrastructure: Taking the ‘Republican China Periodical Corpus (1918-1949), Based on PhiloLogic 4’ as an Example). 数字人文 Digital Humanities, no. 1: 175–82. online version

Sung, D., Sun, L. and Arnold, M. (2014). The Birth of a Database of Historical Periodicals: Chinese Women’s Magazines in the Late Qing and Early Republican Period. TSWL 33, no. 2: 227–37. URL: https://www.jstor.org/stable/43653333

Tang, C., Liu, C. and Chiu, P. (2020). HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents. IEEE International Conference on Big Data (Big Data). DOI: 10.1109/BigData50022.2020.9378051

Wilkinson, M. D., Dumontier, M., Aalbersberg, IJ. J., Appleton, G., Axton, M., Baak, A., Blomberg, N. et al. (2016). The FAIR Guiding Principles for Scientific Data Management and Stewardship. no. 1. Scientific Data 3, no. 1: 160018. DOI: 10.1038/sdata.2016.18

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO