National Institute for Japanese Language and Linguistics (NINJAL)
National Institute for Japanese Language and Linguistics (NINJAL)
Nagoya Women's University
1. Introduction: “Christian Materials”
We constructed a corpus of “Christian Materials,” which is invaluable material for the study of the Japanese language in the sixteenth to the seventeenth century CE(Muromachi period). They comprise documents written by Catholic missionaries who came to Japan for the purpose of proselytization.
We chose “Feiqe no monogatari” and “Esopo no Fabulas” (Figure 1, herein after called Feiqe and Esopo) from these materials and constructed a corpus. Feiqe is a digest text of the Japanese epic, the Tale of the Heike. Esopo is a Japanese translation of Aesop’s Fables. These have special characteristics among the Christian materials: they were written in the Japanese colloquial language of the time using the Roman alphabet with Portuguese spellings. They were written as readers for missionaries to learn Japanese. In order to propagate Christianity, it was necessary for the missionaries to converse naturally with the Japanese people, so they had to study the current colloquial language. Writing the materials using the Roman alphabet made it possible for missionaries to read these books even if they had not learned Japanese characters.
Many of the existing Muromachi-period Japanese documents are written in the literary language; the rare materials written in the colloquial language are very valuable. The Roman alphabet spelling reveals information about the phonology of the Japanese language, which is not apparent when referring only to Japanese characters. For example, voiceless or voiced consonants or open or closed o-vowels (ŏ or ô) are not clear from the Japanese characters.
There are no other materials that include these two characteristics. It should be noted that only one single copy of Feiqe and Esopo still exists, housed in the British Library in London. As there is no doubt that these materials are extremely valuable for the study of colloquial Japanese in the Muromachi period, they need to be more widely and conveniently available.
Figure 1. Images of “Feiqe” and “Esopo”
2. Construction of the corpus
One of the features of our corpus is that it has two texts, the Roman alphabet text and the Japanese character text.
We referred to the original prints and faithfully converted the original Roman alphabet text to electronic text. Letters like “à,” “ã,” and “ſ” are replicated by Unicode.
We also prepared the Japanese character text and encoded it in XML files. We studied tags and document type definitions with reference to TEI P5. Based on that, we selected and added required tags for the structure of Feiqe and Esopo. We also referred to Kawase et al. (2014) to design the tag set. We carried out morphological analysis on these XML files. We used the UniDic and MeCab morphological analysis tools to divide the entire text into linguistic units and add morphological information such as lemma, readings, and parts of speech. MeCab is a morphological analyzer based on the conditional random field (CRF) analytical method that achieves state-of-the-art performance in contemporary Japanese morphological analysis. UniDic is a dictionary for the morphological analysis of Japanese that can lemmatize variations of orthography and word forms.
The grammar and vocabulary of Feiqe and Esopo reflect the transition of Early Middle Japanese to Modern Japanese. We therefore used UniDic for Late Middle Japanese (Ogiso et al., 2015) for accurate morphological analysis. This dictionary was developed using the same method as UniDic for Early Middle Japanese (Ogiso et al., 2012). We have added the vocabulary specific to the Feiqe and Esopo to the dictionary and performed machine learning using corpora of the relevant era as training data. As a result of the morphological analysis using this dictionary, the accuracy of the distinction of allomorphs (including verification that word segmentation, part-of-speech tagging for items, and lemmatization are all correct) is 0.932 (F measure). The corpus size is approximately 140,000 words. By utilizing the appropriate dictionary, we were able to keep the manual effort required for correcting errors to a minimum.
We then aligned the Roman alphabet text and the Japanese character text counterparts into parallel texts. By morphological analysis using UniDic, morphological information about its pronunciation was added to each word using a Katakana character. For example, the pronunciation of “涙”(tears) is “ナミダ.” We romanized each Katakana character using the modern Hepburn system, like “ナ” into “na,” “ミ” into “mi,” “ダ” into “da.” In this way, we generated a modern Roman alphabet text and compared it to the original Roman alphabet text (Figure 2). Both Roman alphabet texts are quite similar, thus we were able to align these texts into parallel texts in an efficient manner. The accuracy of automatic alignment is 0.982 (F measure). By carrying out morphological analysis on the Japanese character text and aligning both texts into parallel texts, we succeeded in creating a corpus that can access morphological information from both the Japanese character text and the original Roman alphabet text.
Figure 2. Comparing the modern Roman alphabet and the original Roman alphabet text
3. Publication of the corpus and link to the image of the original print
The corpus has been made publicly accessible through an online search application called “Chūnagon.” With “Chūnagon,” it is possible to perform searches that specify complex combinations of different morphological information. The Roman alphabet text is displayed in parallel with the Japanese character text (Figure 3).
Another important feature of our corpus is that it includes a direct link to a clear photographic image of the original print in the British Library. Based on a memorandum with the British Library, we received permission to make the photographic image of the original print accessible in the public domain. The image files are one file per page of the original, and the size of each file is one megabyte or less. The photographic image is now on available on the NINJAL website, which is open access. “Chūnagon” has a link that allows you to access the page that contains the corresponding word.
Figure 3. Search results for “Chūnagon”
Acknowledgement: The work reported in this presentation was supported by the NINJAL collaborative research project “Construction of Diachronic Corpora and New Developments in Research on the History of Japanese.”
Bibliography
Kawase, A., Ichimura, T., and Ogiso, T. (2014). Problems in Encoding Documents of Early Modern Japanese.
Proceedings of the conference on Digital Humanities 2014, http://dharchive.org/paper/DH2014/Paper-934.xml (accessed 17 April 2019).
Kudo, T. (2006). MeCab: Yet Another Part-of-Speech and Morphological Analyzer. http://taku910.github.io/mecab/ (accessed 17 April 2019).
National Institute for Japanese Language and Linguistics (2016). UniDic for Late Middle Japanese. https://unidic.ninjal.ac.jp/download_all#unidic_kyogen (accessed 17 April 2019).
National Institute for Japanese Language and Linguistics (2018). Corpus of Historical Japanese, Muromachi Period Series, Volume II: Christian Materials. https://pj.ninjal.ac.jp/corpus_center/chj/muromachi-en.html (accessed 17 April 2019).
National Institute for Japanese Language and Linguistics (2019). Images of the Amakusa edition of Heike monogatari, Isoho monogatari and Kinkushū in the British Library. https://dglb01.ninjal.ac.jp/BL_amakusa/en.php (accessed 17 April 2019).
Ogiso, T., Komachi, M., Den, Y., and Matsumoto, Y. (2012). UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese.
Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC2012), pp.911-915. http://www.lrec-conf.org/proceedings/lrec2012/pdf/906_Paper.pdf (accessed 17 April 2019).
Ogiso, T., Kono, T., and Ichimura, T. (2015). Morphological Analysis of Japanese Kyōgen Text.
Proceedings of the conference on Digital Humanities 2015, http://dh2015.org/abstracts/xml/OGISO_Toshinobu_Morphological_Analysis_of_Japanes/OGISO_Toshinobu_Morphological_Analysis_of_Japanese_Ky_g.html (accessed 17 April 2019).
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO