SMART Project: Methods for Computer-based Research of Premodern Chinese Texts

paper
Authorship
  1. 1. Christian Wittern

    Chung-Hwa Institute of Buddhist Studies

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This presentation will start with a look at some of the problems encountered so far in a number of projects that tried to apply TEI [TEIP3] markup to premodern Chinese Buddhist texts. I have been working with the TEI Guidelines for more than seven years and published the first text, rather heavily marked up in TEI fashion, in 19951. Since then I became involved with some other projects digitizing Chinese Buddhist texts, most prominently the work by the Chinese Buddhist Electronic Texts Association (CBETA) 2. We now have about 200 MB of texts basically marked up3 according to the Guidelines.

All of these projects worked from printed editions published 80-100 years ago. One of the most obvious problems we encountered is the large amount of non-standard characters found in these texts, but TEI and SGML in general is quite able to handle this elegantly - nevertheless there are some important details that should be noted4. Some of the more subtle problems involve structural elements specific to texts of the sphere of Chinese cultural influence. Examples of these elements include the notion of a scroll, that is carried over from the time when the documents were actually written on scrolls, but still mark divisions in the printed editions. Being based on the physical medium, they fall into a similar category as the LB, PB and MILESTONE elements in TEI, but they are usually associated with some other heading-like text, colophons and the like. While this could be taken care of with the FW in some way, we decided to come up with our own solution, which was to introduce a new element, JUAN, (Chinese for scroll) and encode the information therein. Other structural elements that presented difficulties include colophons or other backmatter-like text at the end of a scroll, but in the middle of a DIV element that continued on the next scroll and sound glosses in the text.

A second part of this presentation will give an overview of the recent developments in the SMART (System for Markup and Retrieval of Texts) project5. This project aims at providing a working environment for research and markup on East Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other international, open standards. The environment tries to enable network based collaboration and layered, private markup added to a central repository of texts, but it is intended to make it possible to use it on stand-alone machines without a live connection to the Internet. So far, the basic framework has been outlined and some of the utilities built. Originally, the plan was to develop this into a collection of open modules, that can interact through an open protocol in the spirit of presentations at ACH/ALLC 1999 by Michael Sperberg-McQueen, Jon Bradley and others. However, since such a protocol specification is far from being finalized, I found that I would rather have a concrete implementation to play with and to iron out problems. I therefore recently decided to build the tools I would need on top of the Zope6 Web-Application platform. This is an OpenSource™ project build mainly with Python, implementing an object-oriented database and a complete framework for developing dynamic Web-Applications. It has a strong support for XML and related standards and thus seems especially suited for the purpose at hand. All the methods are exposed through a URL-based interfaced, but also callable through XML-RPC.

The presentation in the context of the ALLC/ACH conference aims at contributing to a discussion of how such an open framework can be implemented, while at the same time showing some of the problems that arise when dealing with East Asian languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally mark the word boundaries and even the definition of a word is highly disputed among linguists. In this situation, a list of all occurring words in the manner of a word-wheel cannot be applied. Additionally, the texts used here contain markup of textual variants, which complicates the creation of an index. Furthermore, different representations of the same character in machine-readable encodings have to be accounted for. An indexing method that takes these problems into account and also provides an abstraction from indexing of actual low-level locations in the text has been developed7.

The SMART project will be utilized in two different contexts:

1. As a retrieval and interface engine for the Buddhist text database produced by the Chinese Buddhist Electronic Text Association. SMART will allow for retrieval with enhanced queries, and add markup based on these queries, thus providing a powerful way to gradually enrich the markup.
2. As the central research platform for a research project of texts of the Chan school in Chinese Buddhism. A smaller corpus of texts is here used for building not only text with rich markup, but also supporting databases of proper names, sites and historical dates to allow for knowledge-base centered retrieval of the texts.

A demonstration of both applications will be given in this presentation.

Notes

1. The Chan-Buddhist genealogical history Wudeng Huiyuan (first printed in 1253) on the ZenBase1 CD-ROM, see [App et al 95].
2. The CBETA project website (mostly in Chinese) is at <http://ccbs.ntu.edu.tw/cbeta.>
3. This basic markup follows the general ideas lined out in [Wit96].
4. I will not go into detail for this audience, but some references to these problems can be found in the work by the Chinese Characters Analysis Group. More recently, we based our efforts on the work done by the Mojikyo Font Institute in Japan <http://www.mojikyo.gr.jp>.
5. The project website is at <http://www.chibs.edu.tw/~chris/smart/>.
6. For more information on Zope see <http://www.zope.org>.
7. More information can be found in [Wit99]

References

RHComN: Research in Humanities Computing, Oxford: Clarendon, 1991ff. N is the sequential number of the volume.
[ApWi96} App, Urs and Wittern, Christian (1996) A New Strategy for Dealing with Missing Chinese Characters, Humanities and Information Processing No. 10, February 1996, S. 52-59.
[App et al 95] App, Urs, Kumiko, Fujimoto and Wittern, Christian (1995). ZenBase CD1. International Institute for Zen Buddhism, Kyoto.
[CCAG80-85] Chinese Character Analysis Group (Ed.) (1980 - 85). Chinese Character Code for Information Interchange, Vol. I-III, Taipeh 1980, 1982, 1985.
[CaZa91] Calzolari, Nicola and Zampolli, Antonio "Lexical Databases and Textual Corpora: A Trend of Convergence between Computational Linguistics and Literary and Linguistic Computing", in: [RHCom1], p273-307.
[Lanca91] Lancashire, Ian (Ed.) (1991). The Humanities Computing Yearbook 1989-90 A Comprehensive Guide to Software and other Resources. Clarendon Press, Oxford.
[Latz92] Latz, Hans-Walter (1992). Entwurf eines Modells der Verarbeitung von SGML-Dokumenten in versionsorientierten Hypertext-Systemen Das HyperSGML Konzept, Diss. Berlin 1992.
[Neum96] Neuman, Michael (1996). "You Can�t Always Get What You Want: Deep Encoding of Manuscripts and the Limits of Retrieval", [RHCom5], p209-219.
[Rob94] Robinson, Peter M.W. (1994). "Collate: A program for Interactive Collation of Large Textual Traditions", [RHCom3], p32-45.
[SpMcQ91] Sperberg-McQueen, Michael, C. (1991). "Text Encoding and Enrichment", [Lanca91], p503f.
[TEIP3] Sperberg-McQueen, Michael C. and Burnard, Lou (Eds.) (1994). Guidelines for Electronic Text Encoding and Interchange, Chicago and Oxford.
[Wit93] Wittern, Christian (1993). "Chinese Character Encoding", The Electronic Bodhidharma, Nr. 3, July 1993, p44-47.
[Wit94] Wittern, Christian (1994). "Code und Struktur: Einige vorläufige Überlegungen zum Aufbau chinesischer Volltextdatenbanken", Chinesisch und Computer, Nr.9, April 1994, S.15-21.
[Wit95a] Wittern, Christian (1995). "The IRIZ KanjiBase", The Electronic Bodhidharma, Nr. 4, June 1995, p58-62.
[Wit95b] Wittern, Christian (1995). "Chinese character codes: an update", The Electronic Bodhidharma, Nr. 4, June 1995, p63-65.
[Wit96] Wittern, Christian (1996). "Minimal Markup and More - Some Requirements for Public Texts", Conference presentation at the 3rd EBTI meeting on April 7th, 1996 in Taipei, Taiwan.
[Wit99] Wittern, Christian (1999). "SMART: Format of the Index Files", Technical note published on the Internet at <http://www.chibs.edu.tw/~chris/smart/smindex.htm >. (First published July 20th, 1999, last revised January 10th, 2000)
[Yas96] Yasuoka, Koichi and Yasuoka, Yasuko (1996) Kanjibukuro, Kyoto.
<http://m-media.kudpc.kyoto-u.ac.jp/~yasuoka/kanjibukuro/>

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000

Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None