TEI: "xml:lang sucks, let's use it anyway"

Sydney D (Syd) Bauman

Authorship

1. Sydney D (Syd) Bauman

Women Writers Project - Brown University

Original URL

http://web.archive.org/web/20040903094458/http://www.hum.gu.se/allcach2004/AP/html/prop142.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper is based in large part on the thoughts, work, and efforts of the members of the TEI Character Encoding work group who met in Nancy, France in November 2003:

Michael Beddow
David Birnbaum
Patrick Durusau
Christian Wittern (chair)
Members of the work-group who were not present at this meeting have contributed greatly to the group's efforts overall, but less substantially on this particular issue:
Deb Anderson
Lou Burnard
Espen Ore
Introduction

The TEI Character Encoding work-group recently considered the issue of language identification in TEI XML documents. This paper is a presentation of our deliberations and conclusions along with extensions thereto from the authors imagination. Thus the work-group members can only be held responsible for the conclusion described here; the author is responsible for any logical fallacies or other errors presented.

Background

In July of 1990 the Text Encoding Initiative released the first public draft of their proposed Guidelines For the Encoding and Interchange of Machine-Readable Texts, commonly known as TEI P1. These guidelines, and the SGML DTD that accompanied them as an appendix, included a provision for an attribute, lang, which was available for use on every element in the entire TEI scheme.

This attribute has survived, in pretty much the same form, through the first non-draft release of the Guidelines (TEI P3) to the most recently released version (P4:2002-07), although in that version there is an explicit admission that the mechanism employed by the TEI lang attribute is likely to undergo significant revision at the next release.

In November of 1996 the W3C released the first public draft of their proposed Extensible Markup Language, commonly known as XML. This working draft had no provision for language identification. By the time the specification became an official W3C recomendation (February of 1998), however, it included a provision for an attribute, xml:lang, which (if declared appropriately for valid documents) could be used on any element in an XML document.

These two attributes serve the same major purpose: to indicate the (natural) language an element is written in. However, they have different mechanisms for acheiving this purpose, and slightly different semantics. Thus, users of TEI P4 XML have a potentially confusing choice as to which method to use.

It is perfectly acceptable to use both simultaneously, of course. However this has the obvious disadvantages of extra work, verbosity, and redundancy which serves not as a fallback if there is a problem, but rather as a source of trouble if the two are in disagreement.

TEI P4 lang

[1]

XML xml:lang

[2]

Pros and Cons

Advantages of TEI lang
name freedom: Because the value of lang is a reference to an arbitrary identifier, the user is free to choose any token (subject to the limitations of an XML namehttp://www.w3.org/TR/REC-xml#NT-Name) she wishes. Note that nothing prevents the user from using the very same names she would have used if she had the same restrictions as those on xml:lang.
descriptiveness: The user can describe the language being discussed as thoroughly as she likes in prose. From within her description she is welcome to refer to ISO 639, IANA, SIL Ethnologue, or any other standardized list of languages.
extensibility: The system permits an easy but formal mapping between languages in the document and any language, even those that do not appear in the standard authority lists. This means that the indicated language can be described very precisely. For instance, one might well have different lang identifiers for each of the various dialects of southern American English in an Mark Twain novel.
well-defined: The semantics of lang are reasonably well defined.
familiarity: TEI users have used lang for years.
brevity: it's short.
This does not mean that there are no disadvantages to the TEI scheme, of course. The most obvious, more because it is an intractable problem than because of any TEI shortsightedness, is the inability to formally describe a language. But there are others. Suggestions have been put forth, e.g., to permit the specification of the correspondence to multiple authority lists; to permit a hierarchical indication of sublanguages; and to permit the value of lang to point to an external langUsage.

However, it is reasonable to assume that if TEI were to continue using lang, at least some of these concerns would be addressed in P5.

Disadvantages of TEI lang
Little software support, and not much prospect for improved software support in the future.
Advantages of XML xml:lang
Somewhat better software support, and very good prospects for improved support in the future.
Disadvantages of XML xml:lang
scope: This is the major complaint against xml:lang: that it applies to attribute values as well as element content. This concern will be discussed in more detail in the full paper, taking each attribute type into consideration.
no pointing: There is no formal semantic for what the value of xml:lang is, except that it is a language identifier. It is not possible to point directly into the authority list being used, or to point at any further classifications or descriptions of the language being identified.
poor extensibility: While it is entirely possible for a user to develop an identifier for a particular sublanguage or dialect for which no ISO or IANA identifier exists, there is no mechanism to associate any information about the language so identified with the identifier.
Results and Reasoning

Despite the obvious disadvantages of xml:lang, the TEI Character Encoding work-group chose to recommend that in P5 the TEI drop the lang attribute and declare xml:lang instead. Why use the inferior approach? First it is important to realize that for TEI P5, the scoping problem (which is the most significant disadvantage of xml:lang is drastically minimized compared to P4, as it is expected that most string type attributes will become elements instead. Thus, said elements can bear an xml:lang attribute of their own, and the scoping problem is avoided. Given that the differences would then be minor, the work-group applied the principle that unless your method is demonstrably significantly better than the standard, you should be using the standard[4], and thus chose the more widely standardized xml:lang. The fact that it is far better to only have one attribute for this purpose, and that the TEI could not likely do much to remove the W3C one also influenced our decision.

However, in order to make up for the lost functionality of a formal association between the identified language and a description thereof, at least for non-standard languages, a mechanism needed to be developed to link the new xml:lang to a description (language in langUsage in the teiHeader). One such mechanism was proposed in Nancy, however I anticipate some active discussion on the issues, and perhaps some changes in the mechanism, over the next two months.

This paper will explore the advantages and disadvantages in more detail, with particular attention to the scoping problem mentioned above. I will argue that in fact the attribution of language to most attributes is not only useless but meaningless, and will present the new mechanism for linking the language identified to its description.

References

1. P1
2. P4
3. XML 1.0 WD
4. XML 1.1
Notes

1. A brief explanation of how lang= works goes here.
2. A brief explanation of how xml:lang= works goes here.
3. http://www.w3.org/TR/REC-xml#NT-Name
4. Affectionately called Syd's rule.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

TEI: "xml:lang sucks, let's use it anyway"

1. Sydney D (Syd) Bauman

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004