Stylometric Analysis of Chinese Buddhist texts: Do different Chinese translations of the ‘Gandhavyūha’ reflect stylistic features that are typical for their age?

  1. 1. Marcus Bingenheimer

    Temple University

  2. 2. Jen-Jou Hung

    Dharma Drum Buddhist College

  3. 3. Cheng-en Hsieh

    Dharma Drum Buddhist College

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Buddhist Hybrid Chinese is a form of Classical Chinese that is used in the translation of Buddhist scriptures from Indian languages to Chinese between the 2nd and the 11th century CE. It differs from standard Classical Chinese of the period in vocabulary (esp. the use of compounds and transcriptions of Indian terms), register (esp. the inclusion of vernacular elements), genre (esp. the use of prosimetry), and rarely even syntax (at times imitating the syntax of the Indian original). Texts in Buddhist Hybrid Chinese are central to all traditions of East Asian Buddhism, which is practiced in China, Korea, Japan and Vietnam.
No comprehensive linguistic description of Buddhist Hybrid Chinese has been attempted so far and perhaps never will, due to the great diversity between translation idioms that at times use different Chinese terms for one single Indian term, and in other cases one single Chinese term for different Indian terms. In as far as Buddhist Hybrid Chinese has been described, the research generally concentrates on grammatical particles (e.g. Yu 1993), single texts (e.g. Karashima 1994), single terms (e.g. Pelliot 1933) or even single characters (e.g. Pulleyblank 1965). The stylometric study of Buddhist Hybrid Chinese – as that of Classical Chinese in general – has only just begun. Only since 2002, when the Chinese Buddhist Electronic Text Association (CBETA) distributed the texts in XML are the canonical texts available in a reliable digital edition.1
The Chinese Buddhist canon was printed first in the 10th century and regarding texts before that date its contents have been relatively consistent since then. The currently most widely referenced edition (the Taishō edition, published 1924-34) is based on a Korean edition from the 14th century. It contains ca. 2200 texts from India and China. Due to insufficient and unreliable bibliographic information for texts translated before the 7th century, the attributions to individual translators – where they exist at all – are often questionable. This again has an impact on the dating of the early texts, as they are usually dated via their translator(s). Since most stylometric methods, including those for authorship attribution, were developed for European languages, they often rely on easily parsable word-boundaries, which in the case of Buddhist Hybrid Chinese do not exist. Our wider aim is therefore to develop methods to identify stylistic clues for certain eras in Chinese translations from Indian texts. Can we, based on stylometric features, find a way to date Chinese Buddhist texts or at least to meaningfully corroborate or contradict traditional attributions?
In this study we have compared three translations of the same text, i.e. the Gandhavyūha section (ch. Ru fajie pin 入法界品) of the Avatamsakamsūtra (ch. Huayan jing 華嚴經). The Gandhavyūha, which contains a long narrative of the quest of the young man Sudhana to visit spiritual teachers, was translated into Chinese three times:
T. 278 by Buddhabhadra 佛陀拔陀羅et. al. (Chang’an 418-20 CE)
T. 279 by Śiksānanda實叉難陀 et. al. (Chang’an 695-699 CE)
T. 293 by Prajña 般若et. al. (Chang’an 796-8 CE)
Our task in this particular case was to develop an algorithm that can demonstrate that the T.278 was translated three to four hundred years earlier than T.279 and T.293, and show which of its features can identify a translation idiom that is earlier or at least different from that of T.279 and T.293. Can it be shown that the two Tang dynasty translations (T.279 and T.293) truly are more closely related to each other than to the translation from the Eastern Jin (T.278)?
Our approach here combines a general statistical weighing of n-grams with a focus on grammatical particles (xuci 虛詞). A ranking of their importance for our corpus must factor in occurrence as well as variance. The algorithm must also provide for the fact that characters that function as particles can also be used in nominal or verbal compounds. These instances must be filtered out by applying a list of compounds from a large dictionary of Buddhist terms (Soothill & Hodous 1937). The algorithm for this is developed in the first section.
The following sections describe the sampling procedure and the preparation of the corpus. Although ostensibly all versions of the same Indian text, the three translations differ greatly in length, mainly because the volume of the Indian Gandhavyūha expanded between the 5th and the 8th centuries. To counter this problem and to produce enough samples for our analysis, each translation will be divided into sub-divisions of equal length. Then, the frequencies of grammatical particles in these divisions will be calculated and used for defining the stylometric profile of the three translations. We will therefore deal with text clusters on which we can use Principle Component Analysis (PCA), which we have used in a previous study (Hung, Bingenheimer, Wiles 2010). Using PCA on the extracted profiles and plotting the values of first and second components in 2-d charts we are able to discern clearly that T.279 and T.293 are closer to each other and more distant/different from T.278. The two Tang dynasty translations seem indeed to differ from the Jin dynasty translation in its use of particles, and the first and second component of the PCA analysis result shows, which particles create the distinction.
Thus stylometric analysis can give us a better understanding of the translation styles of Buddhabhadra, Śiksānanda and Prajña. All translators have several other translations attributed to them and comparing theirGandhavyūha translation to the rest of their corpus, and then again their corpora with each other, could in the future help us to improve our algorithms that ideally would be able to describe and demarcate the work of different translators. The general aim is to get a first handle on the quantitative analysis of the corpus written in Buddhist Hybrid Chinese and extract significant features, which can then be used for a more accurate linguistic description of the idiom.
What the analysis does not account for is changes in the Indian text. The Eastern Jin translation was translated from a somewhat different version of the Indian text than the two Tang translations 300-400 years later. This does, however, not impact our analysis. It is possible to distinguish how grammatical particles were used by different translators, because they reflect different styles of Buddhist Hybrid Chinese, which is what we are looking to describe. Even taking into account that the Sanskrit text of the Gandhavyūha has evolved between the 5th and the 8th century, its grammar could not have changed to the degree as there are changes in the translation idiom.
Hung, J.-J., M. Bingenheimer, and S. Wiles (2010). Quantitative Evidence for a Hypothesis regarding the Attribution of early Buddhist Translations. Literary and Linguistic Computing 25(1): 119-134.
Karashima, S. 幸嶋靜志 (1994). Chōagonkyō no gengo no kenkyū – onshago bunseki o chushin toshite 長阿含經の原語の研究 – 音写語分析を中心として[Study of the language of the Chinese Dīrghāgama]. Tokyo: Hirakawa平河出版社.
Pelliot, P. (1933). Pāpīyān > 波旬 Po-siun. T’oung Pao (Sec. Series), 30 (1-2): 85-99.
Pulleyblank, E. G.. (1965). The Transcription of Sanskrit K and Kh in Chinese Asia Major 11 (2): 199-210.
Soothill, W. E. and L. Hodous (1937). A Dictionary of Chinese Buddhist Terms. London: Kegan. [Reprint Delhi: Motilal, 1994]. Digital as XML/TEI file at
Yu, L. 俞理明 (1993). Fojing wenxian yuyan佛経文献語言 [The Language of the Buddhist Scriptures]. Chengdu: Bashu shushe巴蜀書社.
1.The CBETA edition is an openly available digital edition of the Chinese Buddhist Canon (the texts can be downloaded in various formats at

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2012
"Digital Diversity: Cultures, languages and methods"

Hosted at Universität Hamburg (University of Hamburg)

Hamburg, Germany

July 16, 2012 - July 22, 2012

196 works by 477 authors indexed

Conference website:

Series: ADHO (7)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None