SMART Project: Methods for Computer-based Research of Premodern Chinese Texts

  1. 1. Christian Wittern

    Chung-Hwa Institute of Buddhist Studies

This presentation will start with a look at some of the problems encountered so far in a number of projects that tried to apply TEI [TEIP3] markup to premodern Chinese Buddhist texts. I have been working with the TEI Guidelines for more than seven years and published the first text, rather heavily marked up in TEI fashion, in 19951. Since then I became involved with some other projects digitizing Chinese Buddhist texts, most prominently the work by the Chinese Buddhist Electronic Texts Association (CBETA) 2. We now have about 200 MB of texts basically marked up3 according to the Guidelines.

All of these projects worked from printed editions published 80-100 years ago. One of the most obvious problems we encountered is the large amount of non-standard characters found in these texts, but TEI and SGML in general is quite able to handle this elegantly - nevertheless there are some important details that should be noted4. Some of the more subtle problems involve structural elements specific to texts of the sphere of Chinese cultural influence. Examples of these elements include the notion of a scroll, that is carried over from the time when the documents were actually written on scrolls, but still mark divisions in the printed editions. Being based on the physical medium, they fall into a similar category as the LB, PB and MILESTONE elements in TEI, but they are usually associated with some other heading-like text, colophons and the like. While this could be taken care of with the FW in some way, we decided to come up with our own solution, which was to introduce a new element, JUAN, (Chinese for scroll) and encode the information therein. Other structural elements that presented difficulties include colophons or other backmatter-like text at the end of a scroll, but in the middle of a DIV element that continued on the next scroll and sound glosses in the text.

A second part of this presentation will give an overview of the recent developments in the SMART (System for Markup and Retrieval of Texts) project5. This project aims at providing a working environment for research and markup on East Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other international, open standards. The environment tries to enable network based collaboration and layered, private markup added to a central repository of texts, but it is intended to make it possible to use it on stand-alone machines without a live connection to the Internet. So far, the basic framework has been outlined and some of the utilities built. Originally, the plan was to develop this into a collection of open modules, that can interact through an open protocol in the spirit of presentations at ACH/ALLC 1999 by Michael Sperberg-McQueen, Jon Bradley and others. However, since such a protocol specification is far from being finalized, I found that I would rather have a concrete implementation to play with and to iron out problems. I therefore recently decided to build the tools I would need on top of the Zope6 Web-Application platform. This is an OpenSource™ project build mainly with Python, implementing an object-oriented database and a complete framework for developing dynamic Web-Applications. It has a strong support for XML and related standards and thus seems especially suited for the purpose at hand. All the methods are exposed through a URL-based interfaced, but also callable through XML-RPC.

The presentation in the context of the ALLC/ACH conference aims at contributing to a discussion of how such an open framework can be implemented, while at the same time showing some of the problems that arise when dealing with East Asian languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally mark the word boundaries and even the definition of a word is highly disputed among linguists. In this situation, a list of all occurring words in the manner of a word-wheel cannot be applied. Additionally, the texts used here contain markup of textual variants, which complicates the creation of an index. Furthermore, different representations of the same character in machine-readable encodings have to be accounted for. An indexing method that takes these problems into account and also provides an abstraction from indexing of actual low-level locations in the text has been developed7.

The SMART project will be utilized in two different contexts:

1. As a retrieval and interface engine for the Buddhist text database produced by the Chinese Buddhist Electronic Text Association. SMART will allow for retrieval with enhanced queries, and add markup based on these queries, thus providing a powerful way to gradually enrich the markup.
2. As the central research platform for a research project of texts of the Chan school in Chinese Buddhism. A smaller corpus of texts is here used for building not only text with rich markup, but also supporting databases of proper names, sites and historical dates to allow for knowledge-base centered retrieval of the texts.

A demonstration of both applications will be given in this presentation.


1. The Chan-Buddhist genealogical history Wudeng Huiyuan (first printed in 1253) on the ZenBase1 CD-ROM, see [App et al 95].
2. The CBETA project website (mostly in Chinese) is at <>
3. This basic markup follows the general ideas lined out in [Wit96].
4. I will not go into detail for this audience, but some references to these problems can be found in the work by the Chinese Characters Analysis Group. More recently, we based our efforts on the work done by the Mojikyo Font Institute in Japan <>.
5. The project website is at <>.
6. For more information on Zope see <>.
7. More information can be found in [Wit99]


