The TextServer Standard and Initial Implementations

Ross Scaife; Ryan Gabbard; David Neel Smith; Hugh Cayless; Christopher William Blackwell

Authorship

1. Ross Scaife

Classics - University of Kentucky
2. Ryan Gabbard

University of Kentucky
3. David Neel Smith

Classics - College of the Holy Cross
4. Hugh Cayless

University of North Carolina
5. Christopher William Blackwell

Furman University

Original URL

http://web.archive.org/web/20040903094119/http://www.hum.gu.se/allcach2004/AP/html/prop15.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The discipline of Classics needs an open and modular architecture for electronic publications to reflect and serve the substantial community of scholars now capable of contributing discipline-specific technical features and materials to such an infrastructure. Key requirements include scalability, a rigorous and well-documented separation of protocol and implementation, and complete transparency: all components must carry a GPL or CC license and must be available in a public CVS. The new TextServer standard with its system of Registries is intended to meet those criteria, and is the main subject of this panel.

We propose four speakers, ordered as follows:

Neel Smith will introduce the TextServer system, and review some initial implementations.
Hugh Cayless will examine the details of the system of unified Registries that permits an interoperable whole made up of many distributed parts.
Christopher Blackwell will discuss the Registry system as well as his efforts to leverage the availability of TextServer materials for the XML publication Demos: Classical Athenian Democracy.
Ryan Gabbard and Ross Scaife will present their work on a TextServer-aware Computer Assisted Editing application.
1

The TextServer standard (shot.holycross.edu/projects/TextServer/) defines a light-weight DTD for describing an inventory of texts. Texts are notional entities that may additionally be described in terms of specific editions or translations, or specific physical exemplars of editions or translations. When inventoried texts are available on-line in TEI-conformant XML, the standard describes a protocol for: 1) identifying a canonical citation scheme for a given text; 2) retrieving valid values for citations; and 3) retrieving XML fragments by canonical reference. Applications using only these required parts of the standard include: -- a table-of-contents and paged-text browser -- a facing-page reader (allowing, e.g., a "Loeb library" format with translation and edition side by side) -- a difference viewer highlighting differences between two versions of a text Additional, optional parts of the standard define a protocol for indexing both simple string values and XML structures to strings within a chunk of text defined by canonical reference. These registry services are discussed by the following speakers.

In order for an interoperating system of TextServer implementations to be sustainable and scalable, there must be a mechanism in place for locating and resolving TextServers, their registries, and any new services created in this framework at a later date. To meet these needs and allow new distributed services to be made available to end users without the need for additional effort on their part, Hugh Cayless has defined a protocol stack consisting of a directory layer, a registry layer, and a service layer (please see accompanying diagram). The directory layer is responsible for identifying registries and forwarding requests to the appropriate place. The registry layer consists of RegistryServers which know about services such as TextServers and are able to retrieve the requested information from them. Thus applications following this protocol stack can use a simple identifier for a named entity or chunk of text to retrieve information, or to discover related information, without that identifier having to point at a specific system.

As the editor of Demos: Classical Athenian Democracy (www.stoa.org/demos/), Chris Blackwell's interest in Registry Services is two-fold. First, as an editor of TEI-conformant XML texts, he needs to mark selected categories of information with unique identifiers that can be automatically resolved in the unified system of registries described by Hugh Cayless. Second, Demos applications should be able to convert citations of ancient works into TextServer requests when those texts are available from a TextServer. An automated process, driven by a Cocoon pipeline, traverses the Registry of Ancient Works and the Registry of TextServers, collating information from both into a master TextCatalog, which is stored locally. Two other Cocoon-based applications then read the TextCatalog and allow different kinds of interaction with it. One generates a web interface so that users -- probably editors of XML documents -- may search and browse the registry, looking for the unique identifiers for various works. A second Cocoon-based application aims to serve applications rather than human readers. It accepts a single parameter (a citation given in the form of unique identifiers for a TextGroup and Work) and determines if the work is available on line from any known TextServer. If it is available, the application returns a well-formed XML fragment that gives the server-root URI for any TextServers offering an edition or translation of that text online.

The TextServer system supports (though it does not require) implicit markup of elements such as personal names, ethnics, and place names in its TEI-XML documents. To accumulate a digital library of any size therefore involves a large job of named entity recognition and classification. The TLG edition of Herodotus alone, for example, includes about 23,300 capitalized words. Ryan Gabbard and Ross Scaife are implementing a Computer Assisted Editing (CAE) system designed to improve the efficiency and accuracy of this task by computational methods. CAE makes extensive use of machine-learning-based techniques from statistical natural language processing (both supervised techniques such as adaptive boosting and unsupervised techniques such as co-training) to reduce the human editor's task to one of verification, dramatically reducing the time needed to label a work without sacrificing scholarly judgement. This server-based system allows an authenticated editor to select an available TEI-XML text, apply a series of pre-processing routines (the aforementioned statistical methods along with comparison of candidates with gazetteers and conformance with unique identifiers discovered in TextServer Registries), disambiguate as needed, and confirm the results. The CAE system also allows the editor to add results for a given text to the appropriate RegistryServers.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

The TextServer Standard and Initial Implementations

1. Ross Scaife

2. Ryan Gabbard

3. David Neel Smith

4. Hugh Cayless

5. Christopher William Blackwell

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004