Flexible Computing Services for Comparisons and Analyses of Classical Chinese Poetry

paper, specified "short paper"
  1. 1. Chao-Lin Liu

    National Chengchi University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

As for many civilizations, poetry is an essential part of Chinese literature. Poetry has influenced the development of the literature and language of both classical and vernacular Chinese. Certain of the words that we use today can be tracked all the way back to the Shijing

(tt&®/shi1 jingl/ -- We show the pronunciations of Chinese characters in Hanyu Pinyin followed by their tones. Here, /shil/ is for "if" and /jinl/ is for c. 1046BC.

Research on Chinese poetry is thus instrumental for understanding Chinese culture, and a lot of invaluable results have been accumulated over the past thousands of years from the study and analysis of Chinese poetry.

The availability of digital tools and resources enable researchers to compare and analyze the poetry from certain perspectives that were hard to achieve in the past. In many cases, we can verify the claims of previous researches with solid data, and, in others, we may enrich our understanding of the poetry.

The accessibility of increasingly larger datasets strengthens our research potential. In earlier stages of digital humanities, pioneers focused their work on Tang and Song poetry (it is beyond our capacity to list all previous research in this proposal, and we provide just two samples: Hu and Yu, 200l; and Lo et al, l997) . Now, we can access digitized texts of poems that were published in the periods from l046BC to modern days.

Software tools allow us to study the data from a wide variety of perspectives in an efficient way. Search engines and information retrieval techniques (Manning et al, 2008) help us extract relevant texts from a large dataset. Then, researchers can employ domain knowledge for advanced studies with the use of additional tools.

In this paper, we showcase research results that we achieved by handling the available data with existing tools in flexible ways. We collected nine representative corpora of Chinese poetry, one each for a major dynasty in Chinese history between 1046BC and 1644AD. We list the corpora in Table 1, where we assign an acronym to each corpus for ease of reference (See Notes, 1). We also show their Chinese names (Collection) and the periods of publication (Time). A collection for the Qing dynasty is unavailable yet because an editorial committee is still working toward the completion of this very challenging goal (Zhu, 1994). Excluding the punctuation marks that were added by the data providers, we have more than 16.5 million characters (see Notes, 2) in the corpora.
















Before 589AD













Table 1. The corpora of poetry of 1046BC-1644AD used in this study

By flexibly integrating and migrating tools to offer new functions, we can provide researchers with opportunities to investigate Chinese poetry from new perspectives. In the first example, we show a new way to compare the ways that poets use words in their poems. In the second, with our own tools, we can find shared collocations and patterns of poems in different corpora, and this capability allows us to study and compare the styles of individual poets and their dynasties.
















13; 5;18

2; 3; 5


4; 5; 9





20; 7;27

0; 0; 0


0; 0; 0

0; 4; 4


0; 3; 3


8; 6;14


2; 2; 4

2; 1; 3

3; 1; 4

13; 5;18


6; 0; 6


9; l;10

2; 0; 2

2; 6; 8

0; 4; 4

2; 5; 7


2; 2; 4



0; 2; 2

0; 2; 2

3; 3; 6

4; 4; 8


1; 2; 3


6; 2; 8

Table 2. The frequencies of selected words used in poems of LSY, LB, DM, and MF

A Multi-Faceted Comparison
Jiang (2003) compared the usage of "wind" (M /fengl/) and "moon" (^ /yue4/) in the poems of two of the most famous poets, Li Bai (^0 /li3 bai2/) and Du Fu ($±W/du4 fu4/), of the Tang Dynasty (which existed between 6l8 and 907AD) by comparing the contents of selected poems. Liu et al. (20l5) listed the frequencies of frequent words that used "wind" and "moon" in Li's and Du's poems. The numerical comparison shows the differences of the poets in a vivid way.

The software tools can be designed so that we can inspect not just the original poems or the raw statistics about the original poems, but also more complex comparisons.

Table 2 lists the frequencies of frequent bigrams

(again, see Notes, 2) that appeared in the poems of

four poets, i.e., LSY (for M^H/li3 shangl yin3/), LB (for Li Bai), DM (for ^ife/du4 mu4/), and DF (for Du Fu) (Note: MMffi/li3 shangl yin3/ and|±^/du4 mu4/

are two very famous poets of the Tang Dynasty)

These bigrams are special in that they are formed by concatenating either "W”/chun1/ or "W/qiu1/ with another character, and they represent something related to "spring” and "autumn”, respectively (when used individually, "W”/chun1/ or "S”/qiu1/ typically represent "spring” and "autumn”, respectively- see Notes, 1). The numbers "14;2;16” in the row of "WM; ^M” and in the column for "LSY” indicate that we have 14 and 2 of LSY’s poems in which "WM” and "^ M” were used, respectively. "16” is the sum of 14 and 2.

The statistics in Table 2 shed light on the differences in word preferences among the poets. Note that the samples in Table 2 are limited, and that a close reading is necessary to reach any further interpretations. Despite these limitations, we still can explore comparisons from various perspectives. "WM” and " ^M” are the most common choices among all of the rows(They appeared 192 times, i.e., 16+98+29+49). In contrast, "W^ ” and "^W ” were not as popular (They appeared only 42 times, i.e., 40+2), and none of the poets used "W^ ”. In terms of personal preference, "WM ” appeared in LB’s poems three times often than "^M ”. The results of LSY are similar to those fore LB, but DF seems to prefer "^M” instead (The ratio for "WM”: "SM” is 72:26 for Li Bai, 14:2 for Li Shang Yin, and 19:30 for Du Fu).

The entries that have "0”s can be linked to strong personal preferences. For instance, LB did not use "W S” and "WM”, while he did use "^S” and "^M”. DM is special in that he did not use "WM” or "^M”.

We can provide different ways to compare the styles of poets, e.g., converting the frequencies in Table 2 to probabilities of seeing the same word in the poems. By building a vector space representation (Manning et al, 2008) for each poet, we can calculate a score of similarity for style as in many other researches. Networking Names and Words

In addition to comparing the words of the famous poets, we may also attempt to compare the words and themes of the poems that were produced by friends. We can look up whether two poets were friends in professional databases like the China Biographical Database (CBDB) (Fuller, 2015). A database like the CBDB can also provide alternative names of poets so that we may algorithmically find friendships among poets by checking their writings (Liu et al, 2015). After identifying a group of poets who were friends, we can investigate whether the words, styles, and themes of their poems are related. A procedure such as which we used to produce statistics like those in Table 2 may be useful.

A poet may be influenced by another poet even though they were not personally acquainted. It is believed that poets of the same school of poetry (we use "school of poetry” to translate "WM” /shi1 pai4/) share similar styles or words. Hence, information about the membership of a school of poetry provides a starting point for an investigation.

We may also search for poets who shared the same words and collocations in their poems as a clue for an indirect friendship. Given the millions of characters in our corpora, we need to have an efficient mechanism to identify poems that shared collocations and patterns (Liu and Luo, 2016), and using our own tools, we can precisely identify words that were shared by poems of different poets and of different dynasties (see Notes, 2).

The ability to identify the shared words between individual poets also automatically allows us to compare the patterns that are frequently shared between any two corpora. In Figure 1, two words are connected if they frequently co-occurred in poems. Part (a) shows the shared collocations in poems in the YSX and CTP, and (b) is for the shared collocations in poems of the LCSJ and CTP. The differences between (a) and (b) indicate that the highly shared collocations changed from dynasty to dynasty, i.e., from Tang to Yuan and from Tang to Ming. A collocation with a different word may suggest that the word contributes a different sense in the poems, e.g., "WM-ttS” and "WM_MM” in (b), and this can be verified by reading the poems that used these collocations. Sometimes, the links suggest replaceable words, for instance, both "MS” and " MS” can go with "MS” in both (a) and (b). It should be noted that the collocations often carry information about the imagery of the poems.

Concluding Remarks

Figure 1. Frequently shared collocations between poems of two corpora (a) YSX and CTP (b) LCSJ and CtP

We briefly discussed how we studied two new research problems by flexible applications of our tools. The new tools provide new forms of data as in Table 2 and Figure 1 for interesting and useful research. We are working toward an in-depth understanding of Chinese words by studying when, who (Liu and Luo, 2016), and how the words (see Figure 2) and their collocations and patterns were used in Chinese poetry, and our tools will help domain experts study challenging and interesting problems about it (Liu, 2016). We also hope that the information and visualization that we have found and established for words can contribute to an interactive version of the complete Chinese

lexicon (Cheng et al, 2016).

The poet ^^/chen2 fangl/ of the Yuan Dynasty (1271-1368AD) produced the following poem (title


»ft. IS

Efisms? "

The words in red appeared in a poem of Du Mu of the Tang Dynasty (618-907AD), and the words in blue words appeared in a poem of IS» /Iu2 Iun2/ of the Tang Dynasty.

Figure 2

This research was supported in part by the grant

104-2221-E-004-005-MY3 from the Ministry of Science and Technology of Taiwan, the grant USA-HAR-105-V02 from the Top University Strategic Alliance of Taiwan, and the Senior Fulbright Research Grant 2016-2017.

1. /shil jingl/, i-A /chu3 ci2/, ^M(AS

) /han4 fu4 (wen2 xuan3)/,

^W/xianl qin2 han4 wei4 jin4 nan2 bei3 chao2 shil/, ^WW/quan2 tang2 shil/, AA W/quan2 song4 shil/, AAn^/quan2 song4 ci2/, ftW^/yuan2 shil xuan3/,

/lie4 chao2 shil ji2/

2. It is important to briefly mention the differences between Chinese characters and words for readers who are not familiar with the Chinese written language. Characters are the basic units for Chinese words. A Chinese word

can be formed by one or more characters. For

instance, "A”/shui3/ and "^”/guo3/ are two characters. They can be used individually to represent "water” and "results”, respectively. A word consisting of n Chinese characters can be called an n-gram in linguistics, e.g., "A^” is a bigram that represents "fruit”. While the majority of the words in vernacular Chinese are bigrams and trigrams, the proportion of unigrams in classical Chinese is very large.

Cheng, W.-H., Liu, C.-L., Chiu, W.-Y., and Hsu, C.-T. (2016).

Phenomenology of emotion politics of color: Digital humanities research on the lyrical genealogy of ‘White' in

the poetry of the middle Tang dynasty

), Digital Humanities: Between Past, Present, and Future,

J. Hsiang (Ed.), 57-101, National Taiwan University Press.

Fuller, M. (2015). The China Biographical Database User's

Guide, Harvard University. <http://projects.iq.har-vard.edu/cbdb/home>

Hu, J. (^É^) and Yu, S. (WAS). (2001). The computer aided research work of Chinese ancient poems, ACTA

Scientiarum Naturalium Universitatis Pekinensis,


Jiang, S.-Y., (W^S). 2003. "Moon” and "Wind” in Li Bai’s and Du Fu’s poems - Using computers for studying classical poems, Proc. of the 1st Int'l Conf. on Literature and Information Technologies. (in Chinese)

Liu, C.-L. (2016). Quantitative analyses of Chinese poetry of Tang and Song dynasties: Using changing colors and innovative terms as examples, Proc. of the 2016 International Conference on Digital Humanities, 260-262.

Liu, C.-L., and Luo, K.-F. (2016). Tracking words in Chinese poetry of Tang and Song dynasties with the China Biographical Database, Proc. of the Workshop for Language

Technology Resources and Tools for Digital Humanities,

The 26th International Conference on Computational Linguistics, 172-180.

Liu, C.-L., Wang, H., Hsu, C.-T., Cheng, W.-H., and Chiu, W.-Y. (2015). Color aesthetics and social networks in complete Tang poems: Explorations and discoveries, Proc. of the 29th Pacific Asia Conference on Language, Information and Computation, 132-141.

Lo, F., (SSÄ), Li, Y., and Cao, W. (1997). A realization of computer aided support environment for studying classical Chinese poetry, J. of Chinese Information Processing, 1:27-36. (in Chinese).

Luo, Z. (WH), ed., (1986). Comprehensive Chinese Word Dictionary Shanghai Cishu Publisher (i

(in Chinese)


Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Owen, S. (2015) The Poetry of Du Fu. De Gruyter. <https://www.degruyter.com/view/product/246946 >).

Zhu, Z.-J (OÜ*). (1994). Establishing the editorial board for the Complete Qing Poems

Studies in Qing History 0(3):96.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2017

Hosted at McGill University, Université de Montréal

Montréal, Canada

Aug. 8, 2017 - Aug. 11, 2017

438 works by 962 authors indexed

Series: ADHO (12)

Organizers: ADHO