University of Tokyo
University of Tokyo
University of Tokyo
Toppan Inc.
Toppan Inc.
Toppan Inc.
Toppan Inc.
This poster outlines the objectives of the research project titled “Development of a
Devanāgarī Optical Character Recognition (OCR) System.”
Devanāgarī is an abugida script, which has been adopted as the writing system of several languages such as Hindī, Marāṭhī, Nepālī, and Sanskrit. Recently, digitizing Sanskrit texts written in
Devanāgarī has been one of the most pressing and important tasks in the field of Sanskrit philology. Prosopographical Database for Indic Texts (PANDiT), Sanskrit Knowledge-System Project, and Göttingen Register of Electronic Text in Indian Languages (GRETIL) are some of the leading research projects.
However, owing to the costs associated with time and human labor, building a database based on manually input text data is challenging. We have seen this in some preceding projects led by scholars in Germany, India, and Japan. We expect that the existing
Devanāgarī OCR systems, developed often based on the contemporary Indic languages such as Hindī ([1][2]), may not accurately recognize the more complicated Sanskrit consonant clusters.
In light of this situation, a team comprising Sanskrit language experts from the University of Tokyo and AI-OCR developers from Toppan Inc. have undertaken a cooperative research project. This project aims to develop a
Devanāgarī OCR system and establish a Sanskrit text database automatically digitized by the OCR.
In this poster presentation, we will
Review the writing system of
Devanāgārī and describe how we correlate each combining letter with the Unicode encoding scheme. We took each letter as a composite of several elements. In this case, our experience of the Chinese character—a character consisting of multiple and irregularly ordered elements—served effectively. In this regard, we set a unit of letter called the “character shape.”
Introduce and evaluate preceding and on-going OCR software such as the “Sanskrit OCR” run by ind.senz and “Google Document OCR.” According to our detailed analysis of these software programs, there are some specific cases where these OCRs frequently fail to recognize the letters. For example: some combined letters with the vowel sign
“i (ि),” where the sequence of letter elements (right to left, e.g.
k+i) goes against the stroke order (left to right, e. g.
ि+क,
i+k); some irregularly typeset dots, which indicate the nasal sound
(anusvāra), and some lengthy consonant clusters such as
(र्त्स्न्य,
rtsny-a). Focusing on these inadequacies that were insufficiently handled by the preceding studies, we show our design of an AI-OCR model, highlighting the uniqueness of this project.
Expound the process of designing the “training data” through which an AI-OCR is generated. We obtained the training data from books included in Ānanda Āśrama Sanskrit Series, most of which were printed in metal typesetting. The strategy on how we define the “character shape” in the typesetting shall be explained in detail. An AI-OCR was generated through machine learning using the datasets prepared through the above process. Following is a brief overview of the outcomes obtained from the generated AI-OCR model.
Outcomes of Single Character Recognition (as of February 15, 2022)
Out of the 2,434 sample letters:
a. 2,340 letters exactly recognized* (Accuracy rate 96.14 %)
b. 2,397 letters correctly listed** (Accuracy rate 98.48 %)
* Only when each letter is listed as the first choice.
** Including cases where the correct letter is listed as a candidate.
Based on the comparison, "Google Document OCR" and "Sanskrit OCR" showed an accuracy rate of 95.00 % and 89.92 %, respectively, on average for the common samples in this presentation. Once the factors affecting the accuracy are understood, we will outline the adjustments needed to improve the accuracy rate of the AI-OCR.
Bibliography
Bansal, V and Sinha, M. (2001). A Complete OCR for Printed Hindi Text in Devanagari Script.
Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 800-804.
Suryaprakash Kompalli, Sankalp Nayak, Srirangaraj Setlur and Venu Govindaraju (2005). Challenges in OCR of Devanagari documents.
Eighth International Conference on Document Analysis and Recognition (ICDAR'05), pp. 327-331.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Tokyo, Japan
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)
Organizers: ADHO