National Institute for Japanese Language and Linguistics (NINJAL)
National Institute for Japanese Language and Linguistics (NINJAL)
Meiji University
National Institute for Japanese Language and Linguistics (NINJAL)
In this talk, we wish to discuss the construction
of the corpus “Meiji-Taisho Series I - Magazines,” (hereinafter called CHJ-Magazines) which was released as a part of the Corpus of Historical Japanese (CHJ). This corpus contains magazines published in Japan in the Meiji and Taisho periods (1868-1911) representative of Modern Japanese language, with the total number of words used in the text reaching a size of some 14,000,000 items. This corpus is the first large-scale corpus with morphological information of Modern Japanese and will contribute to research into that period of the Japanese language.
At the National Institute for Japanese Language and Linguistics, a joint research project entitled “Construction of a Diachronic Corpus and New Developments in Research of the Japanese Language” was begun in 2016. CHJ Magazines is one part of this project, using a number of representative magazines for each of several periods, with the selected published material spread out over set time intervals. The result is a large-scale corpus that makes it possible to examine the state of the written Japanese language in the Meiji and Taisho periods, as well as to examine how the language underwent changes in those periods
1870's
1880's
1890's
1900's
1910’s
1920’s
Total
"MstoteiZassbi"
180
180
"Kflfeiraja no Jarofi"
1,000
1,000
"Taiyo"
2,280
4,300
2,050
2,300
10,930
"fcgstaZaW’
680
1,890
"teæü&kSsW
590
"BùtaSiratoi"
620
Total
180
1,000
2,960
4,890
2,050
2,920
14,000
Table 1. Corpus Size for Each Year of Publication (Units:
Thousand Words)
One characteristic of this corpus is that morphological (word) information is annotated to each text. In order to annotate highly accurate morphological information to the corpus, it was necessary to develop and utilize a dictionary that could enable automatic morphological analysis for the colloquial and literary styles that was mixed in the
written Japanese of the aforementioned periods. We
customized UniDic, a dictionary for morphological analysis of Japanese, which can lemmatize variations of orthography and word forms. We also used MeCab, a state-of-the-art Japanese morphological analyzer with this customized UniDic.
While the accuracy of automatic analysis is high at approximately 93%, it is extremely difficult to append morphological information uniformly without any mistakes in such a large amount of text. To address this difficulty, it was necessary to establish a “core” data set, for which we could guarantee the attachment of highly accurate morphological information (with accuracy higher than 99%) through the utilization of automatic computer analysis together with manual (human) correction. For core data, 500,000 words were selected, taking into consideration the balance of publication year, style and genre of each of the articles sampled. Core data is suitable for Japanese research which requires highly accurate morphological information. This data was also used as training data for the dictionary for morphological analysis described above. The automatic analysis results of the “non-core” data set, meanwhile, received a certain degree of manual correction which, while not being exhaustive, strikes a balance between quality and volume.
This corpus is made publicly accessible by way of an online search application called "Chünagon.” With "Chünagon,” it is possible to carry out searches that specify complex combinations of different morphological information (lemma, part-of-speech,
conjugation type, lexical strata, etc.). The search results also display information on the source (magazine title, article title, year of publishing), author (name, sex, year of birth), data type (core, non-core), text type (conversation, quotation, other), text style (literary, colloquial, mixture, verse), etc. The author information includes a link to catalogues in the National Diet Library. By making use of this convenient service, it is possible to check how to read the name of the author of the article in question, to check when he or she was born and died, and, at the
same time, to see if he or she made use of multiple
other names. The “Chunagon” search results also provide links that allow the viewing of images of corresponding pages in the original magazines.
Bibliography
Center, National Institute for Japanese Language and
Linguistics (2014 - 2016) Corpus of Historical Japanese. http://pj.ninjal.ac.jp/corpus center/chj/overview-
en.html
Ogiso, T. (2014), “Dictionary and Morphological Analysis in
the Historical Corpus.” Nihongogaku, 432, pp. 83-95. Meiji Shoin. (In Japanese)
Mabuchi, Y., and Asuko, K. (2016), “Explanatory notes of symbols in the text, Display categories for Chunagon: Corpus of Historical Japanese, Meiji - Taisho Series I -Magazines (Short Unit Ver. 0.9)”.
http://pj.ninjal.ac.jp/corpus center/chj/doc/abstract-
meiji-taisho-2016.pdf (In Japanese)
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at McGill University, Université de Montréal
Montréal, Canada
Aug. 8, 2017 - Aug. 11, 2017
438 works by 962 authors indexed
Conference website: https://dh2017.adho.org/
References: http://web.archive.org/web/20170802132745/https://www.conftool.pro/dh2017/sessions.php
Series: ADHO (12)
Organizers: ADHO