Chinese Dunhuang Mural Vocabulary Construction Based on Human-Machine Cooperation

Xiaoguang Wang; Hanghang Cheng; Huinan Li; Xu Tan; Qingyu Duan

Authorship

1. Xiaoguang Wang

Wuhan University
2. Hanghang Cheng

Wuhan University
3. Huinan Li

Wuhan University
4. Xu Tan

Wuhan University
5. Qingyu Duan

Wuhan University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1.Introduction

Dunhuang grottoes is an internationally recognized cultural heritage which locates on the ancient Silk Road, in Gansu province, China. The well-known Mogao caves there, which meets all 6 evaluation criteria for the World Cultural Heritage, has 492 caves and the total size of the murals reaches more than 45,000 square meters(Whitfield and Ōtsuka, 1995). It is a gallery with 1,600 years of art history and is regarded as a grand “Library” on the wall. The numerous huge number of Dunhuang murals contain exceedingly rich content, making this site a significant academic treasure with an abundance of vivid materials depicting various aspects of medieval politics, economics, cultures, arts, religions, ethnic relations, and daily dress in Western China.

However, the absence of Dunhuang mural vocabulary not only hinders the reasonable and efficient organization of relevant Dunhuang mural resources, but also limits the Dunhuang mural studies and its value exploration. We applied machine learning to the quick construction of Chinese vocabulary of Dunhuang murals with the hope to advance the retrieval efficiency and accuracy of Dunhuang mural information(Zeng, 2008), and facilitate the process of mural annotation standardization. Besides, Dunhuang mural vocabulary could also further the development of digital humanities applications such as deep semantic annotation(Wang et al., 2018), knowledge extraction, and multi-modal data modeling.

Figure 1. Human-machine Cooperation Mechanism of Vocabulary Construction

2.Top-down Process: Structure Design

(1) Primary word bank construction. The construction of vocabulary began with the entry collection from Dunhuang-mural-related Chinese dictionaries, from which 7,121 entries were obtained for the primary word bank.

(2) Structure development. The resulting entries were categorized referring to existing vocabularies in humanities, among which AAT(Baca and Gill, 2015) presented a rather satisfying results in regards to its word categorizing accuracy. Therefore, the primary vocabulary structure was developed drawing on the category structure of AAT. Dunhuang mural tends to be very inclusive as the outgrowth of Eastern-Western cultural and art communications. The way we referred to AAT’s idea of facet plays a positive role in the realization of applying our vocabulary to tangible and intangible cultures that vary in time and location, which in turn would benefit the versatility and inclusiveness of the vocabulary.

(3) Structure adjustment and verification. Modifications had been carried out after the discussion with multiple experts from Dunhuang Academy China who study on Dunhuang murals, such as historians, artists, religious researchers, engineers for mural preservation etc., and then the primary vocabulary structure was verified. Current vocabulary consists of 5 first categories, 25 second categories and 103 facets, and presents a 8-level hierarchical structure.

Figure 2. Dunhuang Mural Vocabulary Top-level Structure

3.Bottom-up Process: Vocabulary Optimization

(1) Literature collection. Through web crawling, we collected the fundamental dictionary Dunhuang Studies Dictionary and over 700 papers from two Chinese top-level Dunhuang journals.

(2) The corpus building. This step was carried out through the combination of OCR and manual checking to build the corpus for machine learning.

(3) Candidate words finding. Viterbi algorithm, a HMM-model-based algorithm commonly used in Chinese segmentation, was adopted with new word detection mechanism to perform words segmentation. The segmentation of Dunhuang Studies Dictionary resulted in 30,854 new words, which were classified into 3 categories, the candidate words, stop words and error words. According to our classification, the segmentation of Dunhuang Studies Dictionary proved an accuracy of 72.13% (candidate words 58.50% and stop words 13.63%). The error rate was a bit high in that we classified numerals as error words, and also the literature contained ancient Chinese that was significantly different in grammar, semantics and pragmatics from modern Chinese. To improve this situation, we would continue the optimization of algorithm for better accuracy. The resulting candidate words and stop words had all been stored into the word bank to assist the new word detection and accuracy improvement. The segmentation of papers resulted in 122,991 new words which are still in the process of classification.

(4) Structure adjustment and content expansion. New candidate words are currently being added into the vocabulary by indexers, and the structure would be redeveloped if it isn’t reasonable enough to fit new words. New words and structures would be audited by experts on a regular basis.

The optimization could be iterative, as more resources would be processed to enrich the machine learning corpus, to further detect Dunhuang mural related words, and finally to enhance our vocabulary. The vocabulary that has been examined and confirmed by experts is being transferred into vocabulary management system which supports image format annotation, API call and vocabulary publishing in multiple formats such as SKOS, XML and JSON.

Table 1. Statistical Results

4.About Future

As for future work, it would be interesting to: (1) process more literature to expand the vocabulary, and continue the improvement of word detection algorithm to increase the accuracy of segmentation, so that it could be better applied to the ancient Chinese and art expressions; (2) explore the method that could realize the automatic creation of new entries in the vocabulary; and (3) work on the vocabulary translation through both machine translation and manual checking, and publish multi-language versions on LOD after the semantic processing.

Bibliography

Baca, M. and Gill, M. (2015). Encoding Multilingual Knowledge Systems in the Digital Age: the Getty Vocabularies.
KNOWLEDGE ORGANIZATION,
42(4): 232–43 doi:
10.5771/0943-7444-2015-4-232.

Wang, X., Song, N., Zhang, L. and Jiang, Y. (2018). Understanding subjects contained in Dunhuang mural images for deep semantic annotation.
Journal of Documentation,
74(2): 333–53 doi:
10.1108/JD-03-2017-0033.

Whitfield, R. and Ōtsuka, S. (1995).
Dunhuang: Caves of the Singing Sands: Buddhist Art from the Silk Road. London: Textile & Art Publications.

Zeng, M. L. (2008). Knowledge Organization Systems (KOS).
KNOWLEDGE ORGANIZATION,
35(2–3): 160–82 doi:
10.5771/0943-7444-2008-2-3-160.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review