Department of Computer Science - Boston University
The TEI and monolithic DTDs: problems and human factors
There are several problems with the current TEI DTD, that are inherent to the TEI's goals and community and the idea of a single DTD. DTDs are useful because they are essentially a contract that encoders make as to what their content will look like. Because the contract is expressed in a machine-readable way, a computer can check compliance with the terms of that contract. This can enhance the consistency of documents created to a particular house style, as well as easing the implementation of software to process those documents. These advantages of a DTD become more problematic for a project like the TEI, however.
The TEI must meet the needs of many different scholarly communities, all studying widely different types of primary and secondary source material. This has several effects: (1) The list of textual features becomes very large--far larger than the number of things that would be marked in any particular project. (2) The DTD must impose minimal restrictions on content models and tagging structure, being more permissive than any individual project needs in order that all projects can be accommodated. (3) The DTD must include modularity and optionality mechanisms, since whole sets of elements will be inessential to significant numbers of users of the TEI DTD. (4) Arbitrary extensions must be possible, because even at more than 1200 pages, the TEI guidelines are not sufficient to meet all the needs of humanities scholars.
TEI P3 meets many of these needs with a complex system of modules, element classes, and tag renaming rules. While the concepts are sensible and well-understood, and the design works, the complexity of modifying the DTD is significant. This is partly due to the clumsiness of SGML's parameter entity mechanisms, and partly due to the sheer scope of the DTD, which makes it difficult to understand where best to make modifications and especially what the implications of the modifications will be. The fact that parameter entities are an indirect way of modifying the SGML declarations means that one must not only understand enough about SGML to conceptualize encoding modifications (and their effect on the structure of the DTD), but one must also understand enough about the TEI customization mechanisms to execute changes to the DTD.
The work by Simons on extracting sub-DTDs from the main TEI DTD (see the first paper of this session), points the way to a different approach based on recognizing that SGML knowledge and expertise is now much more available than it once was, and that direct modification of a DTD is in fact within technical reach of most projects. Furthermore, a DTD that is controlled by a particular project can reap certain benefits from tailoring to the specific documents being encoded. A project specific DTD can be more constrained in helpful ways.
Sketch of a different approach
So what should the TEI do? Architectural forms now provide a good technique for declaring semantics and syntactic restrictions without requiring a particular DTD. Perhaps the best way to solve the problems of the monolithic DTD approach is simply to abandon it.
Instead of the current DTD, the next generation of the TEI guidelines should be structured as groups of architectural forms, organized by application areas. Since creation of a complete DTD is a difficult task, users of the guidelines should be provided with "starter sets," complete DTDs for specific applications like a critical edition of a verse drama, or a linguistic analysis of a short story, or a collection of letters. These should be carefully chosen to represent at least one each of all the current base types, with some of the more complex optional modules included in additional examples.
These DTDs would be exemplary, and could be applied as a starting point for DTDs needed for new TEI-using projects. Since they would be freed of the need to be normative for all documents in their genre, they could be kept simple to understand. As examples of the correct application of the architectural forms constituting the "new TEI" they would serve as documentation of the intended use of the tags. As smaller, self-contained DTDs, they could be read and comprehended in their entirety in a relatively short period of time (1 or 2 days), something that is not currently possible with TEI P3.
Technical advantages
A number of technical advantages are available with such an approach: (1) Creating new elements that are simply variations of existing TEI elements is much easier in an architectural approach. (2) Whereas the existing TEI DTD extension mechanisms can result in document instances that generic TEI software could not deal with, the architectural approach would facilitate development of a new generation of generic processing software based on the TEI architectural forms. (3) A simplification would result from merging multiple TEI elements with similar formal properties into a single architectural form; for instance, tags like <appendix>, <chapter>, and the like could live on as recommended standard instances of a generic architectural form (e.g. <div>).
In conclusion, the TEI approach has been proven to work, but also to have some drawbacks. Now that the idea of architectural forms has come of age--it has been applied in several areas, software is available, and the issues are better understood--the TEI should make use of it to simplify the structure of the standard. I would note that despite the standardization of HyTime architectural forms, the notion is more general, and it may be worthwhile to modify the notion to accommodate the specific needs of the TEI. Attribute-controlled processing (the notion underlying architectural forms) seems to be a great fit to the TEI's needs, but it remains to be seen if the HyTime approach will be the best way to apply it to the TEI.
Bibliography
[Cla98] Clark, J. (1998) SP:An SGML System Conforming to International Standard ISO 8879 --Standard Generalized Markup Language, version 1.3. <http://jclark.com/sp/>. See especially "Architectural form processing," <http://jclark.com/sp/archform.htm>.
[Cov98] Cover, R. (1998) "Architectural Forms and SGML/XML Architectures," in The SGML/XML Web Page. <http://www.oasis-open.org/cover/topics.html#archForms>.
[DD94] DeRose, S. and Durand, D. (1994) Making Hypermedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. See especially pages 79-90.
[Don87] Donner, W. (1987) Sikaiana Vocabulary: Na male ma na talatala o Sikaiana. Honiara, Solomon Islands: published by the author through a grant from the South Pacific Cultures Fund of the Australian government. 267 pp.
[ISO92] International Organization for Standardization. (1992) ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.
[ISO97] International Organization for Standardization. (1997) "Architectural Form Definition Requirements (AFDR)," Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.
[Kim97] Kimber, W. E. (1997) "A tutorial introduction to SGML architectures," an ISOGEN International Corporation workpaper. <http://www.isogen.com/papers/archintro.html>.
[Sim97] Simons, G. (1997) "Using architectural forms to map TEI data into an object-oriented database," TEI Tenth Anniversary User Conference, November 14-16, 1997, Brown University. To appear in Computers and the Humanities. A fuller workpaper is available at <http://www.sil.org/cellar/import/>.
[SMB94] Sperberg-McQueen, C. M. and L. Burnard (eds.). (1994) Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative. <http://www-tei.uic.edu/orgs/tei/p3/elect.html>. See especially chapter 12, "Print dictionaries," chapter 28, "Conformance," and chapter 29, "Modifying the TEI DTD."
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at University of Virginia
Charlottesville, Virginia, United States
June 9, 1999 - June 13, 1999
102 works by 157 authors indexed
Conference website: http://www2.iath.virginia.edu/ach-allc.99/schedule.html