Tikkoun Sofrim – Combining HTR and Crowdsourcing for Automated Transcription of Hebrew Medieval Manuscripts

Tsvi Kuflik; Moshe Lavee; Daniel Stökl Ben Ezra; Avigail Ohali; Vered Raziel-Kretzmer; Uri Schor; Alan Wecker; Elena Lolli; Pauline Signoret

Authorship

1. Tsvi Kuflik

University of Haifa
2. Moshe Lavee

University of Haifa
3. Daniel Stökl Ben Ezra

EPHE - École Nationale des Chartes - Université PSL, Orient et Méditerranée UMR, Cognitions Humaine et Artificielle (CHart) - Université Paris 8 Vincennes - Saint-Denis
4. Avigail Ohali

EPHE - École Nationale des Chartes - Université PSL, Cognitions Humaine et Artificielle (CHart) - Université Paris 8 Vincennes - Saint-Denis
5. Vered Raziel-Kretzmer

University of Haifa
6. Uri Schor

University of Haifa
7. Alan Wecker

University of Haifa
8. Elena Lolli

EPHE - École Nationale des Chartes - Université PSL, Cognitions Humaine et Artificielle (CHart) - Université Paris 8 Vincennes - Saint-Denis
9. Pauline Signoret

EPHE - École Nationale des Chartes - Université PSL, Cognitions Humaine et Artificielle (CHart) - Université Paris 8 Vincennes - Saint-Denis

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The French-Israeli project

Tikkoun Sofrim
is part of a network of projects (
,

Sofer Mahir
,

eRabbinica
,

Scribes of the Cairo Genizah
,

Scripta-PSL
) aimed at developing a future framework for comprehensive and systematic textual availability, editions and deep annotation of (Hebrew) manuscripts via a pipeline that combines HTR with crowdsourcing the corrections and validation by scholars. End uses are scholarly editions (with

TEI-Publisher
), library service (integration via

mirador
), automatic annotation, distant-reading and big-data studies.

The sample
The project focuses on Tanhuma-Yelamdenu Midrashim, late rabbinic exegetical works without full scholarly critical edition, even thought vulgate editions in two recensions [1,2] and translations exist [3]. Unlike other classical works, they were not subject to formal canonization, but went through a long period of fluid transmission [4]. Manuscripts differ significantly from each other on the micro- and macro-levels. The texts reflect different possible hierarchies following the biblical text, liturgical rites or divisions in early print editions. Hence, the envisaged data structure will need to go beyond a simple canonical structure to support flexible accessibility to the texts both from library and critical edition perspectives.

HTR
The current Layout analysis is based on the z-profile-method for column detection [5] and the heartbeat-seam-carve algorithm for line detection [6], well adapted for literary manuscripts with regular layout. Textual ground truth was created by manual transcription from scratch or through the adaptation of transcriptions sent by scholars who had worked on manuscript-parts. Using the open-source kraken engine [7], we trained models for four large manuscripts (

Geneva Com. Lat. 146
,

BNF Cod. Héb 150
,

Bibl. Vaticana ebr. 44 vol. 1
and

2
,

Parma, Palatina, Cod. Parm. 312

2) [8]. The CERs vary between 2.8% (BNF), 2.9% (Parma), 6.9% (Vatican) and 8.9% (Geneva). The likely reasons for the weaker results for Vatican and Geneva were the use of low resolution images (Vatican) as well as binarization issues, curved lines and superposed words (Geneva). These issues will be dealt with in an upcoming, much improved layout analysis based on state-of-the-art CNNs part of

kraken@eScriptorium
at PSL [9,10,11].

Crowdsourcing
Our

crowdsourcing platform
opened on Feb. 13th, 2019 [12]. To maximise our reach, we offer a designated UI for mobile and desktop. The technological stack used for development was selected to simplify future code contribution and collaboration with other platforms: Vue.js / Google-Firebase for the operational site and MySQL for analytical purposes.

Since classical rabbinic literature is still thriving in contemporary ritual and religious study cycles, we based community recruitment and engagement on links to websites for daily studies following annual and other cycles of studies. The launching was set to the beginning of Deuteronomy in

929
, a community daily Hebrew Bible chapter project, and we assigned fitting pages to each biblical chapter.

While the link with 929 proved very fruitful for a first broad exposure to the public, by and large ongoing traffic mainly came from intensive facebook activity in relevant groups and pages.

Attracted first by hilarious mistakes in the automatic transcription, users continued into long term engagement. Other crowd-engagement methods involved presentations in schools, synagogues, universities and homes for senior citizens Israel, France and the US. Two articles in the press, one of which included the URL of our website, greatly augmented the number of users in week 6 [13,14].

Results
Until April 26, we received on the average 5005 lines per week:

Mega-users (>1000 lines) and super-users (>100 lines), presenting 4.6% and 19.3% of the number of user, provided almost 80% of the total amount of transcriptions.

Based on the submitted transcriptions, we produced an alignment on the word level with

collatex
[15,16]. With a majority vote, we established a preliminary complete text for the 590 pages of Geneva 146. According to manual check, the automatic reconstruction results in about two mistakes per page (ca. 1700 glyphs/page), i.e. an accuracy of ca. 99.8%, most of which concern markup for additions or deletions.

A transcription based on collating contributions of 5 users/line.

Conclusions and outlook
We now have the tools and know-how for high-quality automatic transcription of Hebrew manuscripts. Crowdsourcing is a promising mechanism for correcting remaining errors. Future versions of the platform should include embedded community engagement, feedback elements to ensure continuity and optimize social media engagement activity.
The project [17] provides PoC for future large scale production of manuscript texts, based on combined deep learning and crowdsourcing mechanisms. These will be based on optimizing the performance of HTR engines and crowd task-assignment, using crowd based transfer learning and weak supervision
.

Bibliography
N.N. (1960). Midrash Tanhuma, Jerusalem. [standard ‘print-edition’]
Buber, S. (1885).
Midrash Tanhuma, Jerusalem 1964=Vilna 1885 [‘Buber-edition’]

Townsend, J. (1989-2003). Midrash Tanhuma. Translated into English with Introduction, Indices and Brief Notes, Hoboken, NJ, New York.
Bregman, M. (2003). The Tanhuma-Yelammedenu Literature: Studies in the Evolution of the Versions. Piscataway, NJ (Hebrew).
Stökl Ben Ezra, D., Lapin, H. (2019). Z-profile: Holistic Preprocessing Applied to Hebrew Manuscripts for HTR with Ocropy and Kraken. Submitted. Code available at:
https://github.com/ephenum/z-profile_column_segmentation.

Seuret, M., Stökl Ben Ezra, D., Liwicki. M. (2017). Robust Heartbeat-based Line Segmentation Methods for Regular Texts and Paratextual Elements. HIP@ICDAR: 71-76.
Code available at:
https://github.com/ephenum/heartbeat_seamcarve_line_segmentation

Kiessling, B. (2019). Kraken - an Universal Text Recognizer for the Humanities, DH2019, in preparation. Kraken github :

Geneva Com. Lat. 146
,

BNF Cod. Héb 150
,

Bibl. Vaticana ebr. 44 vol. 1
and

2
,

Parma, Palatina, Cod. Parm. 3123

Kiessling B., Stoekl Ben-Ezra D., Miller M. T., BADAM: A Public Dataset for Baseline Detection in Arabic-script Manuscripts, submitted to ICDAR 2019.
Stokes, P., Stökl Ben Ezra, D., Kiessling, B., Tissot R. (2019). A New Digital Platform for the Study of Historical Texts and Writing. Digital Humanities 2019 Book of Abstracts. Utrecht.
eScriptorium gitlab :

https://gitlab.inria.fr/scripta/escriptorium

Wecker, A., Stökl Ben Ezra, D., Raziel-Kretzmer, V., Schor, U., Kuflik, T., Ohali, A., Elovits, D., Lavee, M., Stevenson, P. (2019). Tikkoun Sofrim: A WebApp for Personalization and Adaptation of Crowdsourcing Transcriptions. Poster at UMAP’19 Adjunct, June 9-12, 2019, Larnaca. New York: ACM Press.

https://doi.org/10.1145/3314183.3324972.

Tikkoun Sofrim website:

https://tikkoun-sofrim.firebaseapp.com/
.
Tikkoun Sofrim github:

https://github.com/drore/tikkoun

Dvir, N. (2019). חוכמת ההמונים תעזור לפענח טקסטים עבריים עתיקים, Israel Hayom. 21.3.2019
Harari, O. (2019). בואו לעזור: שומרים על האוצר היהודי. Arutz 7, 21.3.2019.
Dekker, R. H., van Hulle, D., Middell, G., Neyt, V. and van Zundert, J. (2015). Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript project. Literary and Linguistic Computing, 30: 452-470.
Dekker, R. H. and Middell, G. (2011). Computer-Supported Collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements. Supporting Digital Humanities 2011. University of Copenhagen, Denmark. 17-18 November 2011.
PHC Maimonide 41146YC (2019-2020)

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO