Using architectural processing to derive small, problem-specific XML DTDs from the TEI DTD

paper
Authorship
  1. 1. Gary F. Simons

    Graduate Institute of Applied Linguistics

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The work described in this paper began as an effort to perform a particular markup task. Back in 1983 while doing field work in the Solomon Islands, I helped anthropologist William Donner produce a bilingual dictionary of the Sikaiana language [Don87]. We devised a one-of-a-kind markup system. Now, sixteen years later, we want to put this data in a form that can be shared on the Web; conversion into a standardized form of markup is needed. The leading standard for the markup of dictionaries is the SGML-based TEI (Text Encoding Initiative) DTD [SMB94].

Using this DTD presents three main problems for this project, because what we really want is to: (1) deliver the results on the Web as an XML application, (2) customize the markup in some ways, and (3) use and interchange a streamlined DTD that contains only what is needed for our application. This is a general problem. The large SGML DTDs in widespread use (e.g. HTML, DocBook, ISO 12083, CALS, EAD, TEI) offer the advantage of standardization, but for a particular project they often carry the disadvantage of being too large or too general. This paper demonstrates how architectural processing can be used to develop a problem-specific XML DTD for a project without losing the advantage of conforming to a widely-used SGML DTD.

SGML architectures
An SGML architecture is an SGML document type that is used as a basis for deriving new document types. Each of the elements in an architectural DTD is called an architectural form. An architectural form attribute on an element of the user document specifies the architectural form on which that element is based. For instance, if one were using HTML as an architecture and using html as the architectural form attribute, the tag <para html="P"> in a user document declares that this <para> element is derived from (or, inherits the semantics of) HTML's <P> element. An architectural processor is a tool that reads the architectural form attributes to translate the user document into the equivalent architectural document.
An architecture is defined by a DTD. We can exploit this fact in solving the problem at hand by using the existing TEI DTD as an architecture. We then write a problem-specific XML DTD to embody the constraints of the project and use an architectural form attribute to map the elements of our XML DTD onto the elements of the TEI architecture.

Addressing the three problems
Delivering the project results as an XML application poses two kinds of problems. First, there are problems of XML well-formedness. A data file that is valid with respect to the TEI DTD and the TEI's SGML declaration will not be well-formed XML. This problem is solved by altering the TEI's SGML declaration so that it accepts XML syntax in the document instance. Second, there are problems of XML validity. Many features of SGML DTDs are not allowed in XML DTDs. This problem is solved by constructing the project DTD as an XML DTD. By employing both of these strategies, an XML parser can be used to validate a project document against the project's XML DTD, or an SGML parser with an architecture engine can be used to simultaneously validate a project document against both the customized XML DTD and the full TEI DTD.
In a particular markup project, it may be desirable, or even necessary, to customize the DTD. It could be that different names make more sense for certain elements or attributes, that new elements or attributes need to be added, or that certain combinations of elements with fixed attribute values should be encoded as new element types. This problem is addressed by modifying the project's XML DTD as needed. When elements are renamed or added, an architectural form attribute is used to explicitly map the new elements onto the corresponding TEI elements.

A project may find that the TEI DTD is huge in comparison to the subset of elements and attributes that are actually used. Having a DTD that is limited to just the elements and attributes that are used simplifies many tasks like building project-specific software, specifying stylesheets, shipping the DTD with the data, and documenting markup practice. Even more significant for our project was the matter of reducing the permissiveness of the content models for the elements that were used. The TEI's model for dictionary markup is a descriptive one; it aims to provide the user a means of tagging anything that could be encountered in published dictionaries. But in tagging the Sikaiana dictionary, our purpose was prescriptive; we wanted to specify constraints on the structure of entries and then ensure that all entries consistently followed that structure. This problem is addressed by creating a DTD for the project that omits declarations for all the elements and attributes of the architecture that are not used and that tightens the content models to embody additional constraints the project wants to enforce.

Conclusion
In order to employ this technique, one must use an SGML parser that incorporates a full architectural processing engine. The SP parser by James Clark [Cla98] is an example of such a parser. The paper concludes by demonstrating how the SP parser can be used to simultaneously validate a project document against the project-specific XML DTD and the TEI DTD. By using this technique, a DTD developer can enjoy the benefits of a customized XML DTD without losing the benefits of the intellectual effort that went into developing the TEI DTD. By the same token, a project can have the advantages of delivering a customized XML application without losing the advantages of conforming to a widely-used SGML application.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1999

Hosted at University of Virginia

Charlottesville, Virginia, United States

June 9, 1999 - June 13, 1999

102 works by 157 authors indexed

Series: ACH/ICCH (19), ALLC/EADH (26), ACH/ALLC (11)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None