Text Theory and Coding Practice: Assessing the TEI

paper
Authorship
  1. 1. Mark Olsen

    ARTFL Project - University of Chicago

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Let us begin with the assumption that TEI/SGML is a data transfer specification. It performs this in the only way possible, by creating an abstract representation of text that can include typographic coding but goes far beyond typography to represent textual structure. Typographic conventions may be "added" or represented when SGML is loaded into a package to print or otherwise manipulate the text by assuming certain characteristics drawn from the structure. As an unintended consequence, SGML has proven to be a good way to define text structures for loading into more complex textual database systems. This was not, as I understand it, the original goal of SGML. TEI in fact decided, quite reasonably, that SGML was a very good way to encode text structure.

Drawing from this assumption, there arises a second: SGML/TEI representation will be used directly as a format Internal to any particular search engine or text manipulation system. It certainly can be so used as such. PAT, for example, is a very good system that knows about native SGML. But there are many systems that provide superior functionality that do not recognize any individual DTD (include SGML). As such, one can use the MARC record as a model for data transmission of bibliographic records. Few libraries acutally use MARC internally to their systems, but most library systems can read and write MARC from their internal data representations. MARC is a clumsy mechanism but has the advantage of being fairly consistently designed. One can write a relatively simple MARC parser/data extractor.

Unfortunately, TEI does not offer such rigidity. To really work with TEI encoded texts, one must invest considerable time and money into building complex parsers that can read DTDs and then parse TEI instantiations. As such, TEI does not serve either the individual scholar, who rarely has the technical capabilities to decode TEI texts or to the large data provider who needs to be able to accurately and cheaply parse large numbers of text.

Consider the ARTFL context. We receive many hundreds of texts from several sources that are extremely difficult to work with because of the huge variations that may legally occur, even within a single database coming from an individual source/vendor. The ARTFL search engine has considerable power to define subcorpora by using a two-stage search. The first is essentially a bibliographic search, using regular expression-capable searches on authors, titles, genres, and other classifications, along with mathematically computed date range restrictions (ie. search all the texts published between 1683 and 1727). Something as computationally easy as this is very difficult to do in a straight SGML text under PAT, for a number of reasons. Extracting that information automatically from TEI documents is extremely difficult because in practice there is almost an unlimited number of ways to represent this data in TEI. Things as simple as page numbers, chapter breaks, and almost every other structure can be defined in a host of ways, in the same database (of more than 1 text) using the same DTD.

The real problem is that the TEI was designed for maximum flexibility to represent almost any conceivable textual structure variant. In this area, the editor's involvement in medieval documents came as a serious disadvantage because he is concerned about the ability to grasp many textual elements that are very rare, and that should be captured using digitial imaging rather than text encoding. The resulting proliferation of different encoding schemes, all legal within the TEI-DTD makes it essentially a non-standard, too heavy to be usable by individual researchers and too variant to be cost effectively used by big data providers, like myself.

So, my objection is rather the opposite to suggestions that the TEI is too rigid. I see TEI encouraging a proliferation of texts encoded in almost completely exclusive schemes depending on the requirements and ideas of individual scholars. The TEI design principles hold the individual scholar representing features of a single text. I think that is not going to be the future, nor has it been the past, of humanities text processing. Rather, data collection/tagging has been the role of a relatively small number of projects, for example ARTFL, TLG, PHI, and now commercial vendors such as Chadwick-Healy and other publishers. I am not convinced that individual scholars will have the technical capability or interests in loading many texts on their own machines. The impact that consistent textbases such as the TLG have had on their respective subject areas far outweighs the impact of collections of disarticulated materials such as found at, for instance, the Oxford Text Archive.

This leads to a far greater theoretical discussion. I have argued strongly against what I have typified as "beating a single text to death with a computer". I am not convinced that much is gained by "tagging the hell" out of a single text, or small collection of texts, and then trying to analyze it. TEI is predicated on this model. I find the critical and theoretical shortcomings of this model to be deadly, and it is becoming more dated in the age of real interoperable networking and access to ever expanding textbases. The human costs of such intensive, hand coded, tagging is extremely high and of very limited value. And since much of this kind of tagging is so idiosyncratic, there will be few people who would chose to use the extensive tag sets "encouraged" (but not required, a vital distinction because many critics of TEI mistake giving space to such things with requiring them) by TEI.

The TEI came about within the ACH. The theoretical models employed by members of the ACH are under extremely serious criticism, by myself and others, because they tend to ignore modern critical theory and encourage very ineffective use of computer technology. It further ignores the use of smart systems to tag many features "on the fly". For example, we have systems in place to tag the entire ARTFL database for part of speech dynamically. There is little sense in encouraging humans to undertake such tagging by hand, when systems can be built and re-run everytime a better scheme is developed. Thus, the TEI has, at its core, a theoretical model subject to great question, a static view of the nature of textual databases, and a failure to properly grasp the revolutionary nature of interoperable smart systems running across the network.

So, what to do. I am working with Michael Sperberg-McQueen on another much more strict DTD, modified from the TEI-Lite DTD, which will be more constraining. That will help the large data provider. But there is little to rescue the TEI from my theoretical objections, which are, in my opinion, the most serious. Sperberg-McQueen is absolutely correct in suggestion that text encoding reflects theoretical concerns. Sadly, the theoretical models underlying TEI have never been debated.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None