The Adaption and Breakthrough of Chinese Documents Encoding – A Case Study of CBETA Digital Tripitaka and TEI

  1. 1. Aming Tu

    Chung-Hwa Institute of Buddhist Studies

Work text
  This presentation attempts to report the creating of Chinese Database in using TEI/XML Markup Language and the working procedures of Encoding applied by the Chinese Buddhist Electronic Text Association (CBETA).

CBETA has been formally established on Feb. 15, 1998 with support from the Yinshun Foundation of North-America , and Chung-Hwa Institute of Buddhist Studies to create, maintain and distribute free of charge an electronic version of the Chinese Buddhist Triptitaka. The CBETA Chinese Electronic Tripitaka is based on the Taisho Tripitaka 「大正新脩大藏經」(© Daizo Shuppansha, Inc. The right to input and distribute this database has been officially granted by the copyright holder Daizo Shuppansha.) Vols 1-55 and 85. Each volume contains around 1000 pages of the Buddhist Text (fig. 1), about one hundred millions Chinese characters in total.

(fig. 1)

The Electronic text files of the Taisho Tripitaka are created from one set of TEI/XML source files. In this distribution, the following formats are included:
Normalized version

(In this version, no footnotes are included, The text follows the Taisho format, each line has information about volume, text number, page, section and line (eg. T08n0221_p0001a09) at the beginning.
'App' version for online search

This version has the same basic features as the normalized version, but characters have been moved from the end of line to the beginning of the following line(s) where the line did not end with a fullstop. The number of characters moved is added to the string at the beginning fo the line (eg. T08n0221_p0001a09(02)). This method has not been employed for verse lines.
HTMLHelp version

This version has been created for the Microsoft HTMLHelp browser, which is included with Windows 98 (and available as update for Windows 95). The text has been normalized in the same way as the above versions, but the corrections to the Taisho text are now displayed in red. Every section of the Taisho page (eg. p0001, section a) displays on a HTML page by itself.
WORD version

This is an experimental dump of the XML source files into MS-WORD format. Since characters are converted to Unicode where possible, it requires Word versions greater than Word97 (although Word 98 for the Macintosh has not been tested). Characters not displayable in Unicode are replaced by variant system characters where possible, otherwise the fonts and numbers of the Mojikyo Font Institute are used. There is a function to recover the original XML format.

All publication forms produced by CBETA, i.e. standard normalized versions, ‘APP’ versions for online search, HTML , HTMLHelp and Word versions have been derived from one single set of master files.

CBETA Tripitaka TEI/XML Source Files

Before introducing the CBETA Tripitaka TEI/XML source files, the adaption of TEI (The Text Encoding Initiative is an international project to develop guidelines for the encoding of textual material in electronic form for research purposes) processing in the ongoing creation of the Electronic Chinese Buddhist Tripitaka by CBETA will be introduced during the presentation.

The CBETA TEI working procedures are processed in 4 different steps:
1.) Basic Markup

2.) Structure Markup
3.) Content Markup
4.) Rare Characters Markup

The problems encountered and encoding applied in the Electronic CBETA Taisho Tripitaka:

Structural Markup:
<DIV1>, <P>, <LG>…

Markup of textual variants:
<APP>, <LEM>, <RDG>

Some elements added for CBETA:
A specialized milestone element <JUAN><PIN>
A group of elements for sound glosses <FAN><ZI><YIN>
Other Markup <SKGLOSS>

[Example 1]

<juan fun="open" n="02">

*<div2 type="pin"><head>興光住品第三</head>

[Example 2]

<juan fun="close"></juan>

[Example 3]

<SKGLOSS n="066204">


In the presentation the author will not only introduce the working flow of Chinese Buddhist Electronic Text Association (CBETA) and the tag sets applied or created when the TEI Guideline is not sufficient. At the same time, will also shared the results developed form it, such as a 'scholarly' retrieval that utilizes the text critical notes of the Taisho edition to create a view of different historical editions of the Chinese Tripitaka and the comparing reading of different translated versions etc..


TEI: /


Dr. Christian Wittern:


An example of Heartsutra

<?xml version="1.0" encoding="big5" ?>
<!DOCTYPE tei.2 SYSTEM "../dtd/cbetaxml.dtd"
[<!ENTITY % ENTY SYSTEM "T08n0251.ent" >
Taisho Tripitaka, Electronic version, No. 251 般若波羅蜜多心經
<resp>Electronic Version by</resp>
<edition>Version 1.0 (Big5)</edition>
<name>中華電子佛典協會 (CBETA)</name>
Available for non-commercial use when distributed with this header intact.
<date>Dec 1998</date>
Taisho Tripitaka Vol. 08, Nr. 251
<p lang="zh" type="ly">
This is a very preliminary format-conversion.
<item>converted to XML with CBXML.BAT (99/6/30)</item>
<pb ed="T" id="T08.0251.0848c" n="0848c"/>
<lb n="0848c01"/>
<lb n="0848a05"/>
<lb n="0848c03"/>
<lb n="0848c04"/>
<div1 type="jing">
<skgloss n="084801">
Praj&ntilde;&amacron;p&amacron;ramit&amacron; h&rdotblw;daya(A.小).
<lb n="0848c05"/>
<lb n="0848c06"/>
< byline type="Translator">
<app n="084802">
<rdg wit="【宋】">&lac;</rdg>
<app n="084803">
<rdg wit="【三】">奉詔</rdg>

<p><lb n="0848c07"/>觀自在菩薩。行深般若波羅蜜多時。照見五
<lb n="0848c08"/>蘊皆空。度一切苦厄。舍利子。色不異空。空不
<lb n="0848c09"/>異色。色即是空。空即是色。受想行識亦復如
<lb n="0848c10"/>是。舍利子。是諸法空相。不生不滅。不垢不淨
<lb n="0848c11"/>不增不減。是故空中。無色。無受想行識。無眼
<lb n="0848c12"/>耳鼻舌身意。無色聲香味觸法。無眼界。乃至
<lb n="0848c13"/>無意識界。無無明。亦無無明盡。乃至無老死。
<lb n="0848c14"/>亦無老死盡。無苦集滅道。無智亦無得。以無
<lb n="0848c15"/>所得故。菩提薩埵。依般若波羅蜜多故。心無
<lb n="0848c16"/>罣礙。無罣礙故。無有恐怖。遠離顛倒夢想。究
<lb n="0848c17"/>竟涅槃。三世諸佛。依般若波羅蜜多故。得阿
<lb n="0848c18"/>耨多羅三藐三菩提。故知般若波羅蜜多。是
<lb n="0848c19"/>大神咒。是大明咒是無上咒。是無等等咒。能
<lb n="0848c20"/>除一切苦。真實不虛故。說般若波羅蜜多咒
<lb n="0848c21"/>即說咒曰</p>
<lb n="0848c22"/><p type="dharani">
<skgloss n="084804a">
<gloss>Gate gate</gloss>
<term>揭帝揭<app n="084805">
<rdg wit="【三】*">諦</rdg>

<skgloss n="084804b">
<term><app n="084806">
<rdg wit="【三】*">波</rdg>
</app>羅揭<app n="a334801">
<rdg wit="【三】">諦</rdg>
<term><app n="a334802">
<rdg wit="【三】">波</rdg>
</app>羅僧揭<app n="a334803">
<rdg wit="【三】">諦</rdg>
<lb n="0848c23"/><skgloss n="084804c">
<gloss>bodhi Sv&amacron;h&amacron;</gloss>
<term>菩提<app n="084807">
<rdg wit="【三】">薩婆</rdg>
<lb n="0848c24"/>

