teiPublisher a repository management system for TEI documents

paper
Authorship
  1. 1. Amit Kumar

    Maryland Institute for Technology and Humanities (MITH) - University of Maryland, College Park

  2. 2. Alejandro G. Bia-Platas

    Libraries - University of Alicante

  3. 3. Martin Holmes

    Humanities Computing & Media Centre - University of Victoria

  4. 4. Susan Schreibman

    Maryland Institute for Technology and Humanities (MITH) - University of Maryland, College Park

  5. 5. Ray Siemens

    Dept of English - Malaspina University

  6. 6. John A. Walsh

    Digital Library Program - University Information Technology Services

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Editors of TEI repositories working in SGML were limited to few databases, such as Dynaweb, to deliver their documents over the World Wide Web. With the release of the XML standard in 1998, industry experts predicted that there would be a proliferation of XML-aware software as programmers would find it easier to program applications to deliver XML over the Web. This has indeed come to pass. Over the past few years, a number of open source XML or native XML databases have been developed utilizing the xml:db api <www.xmldb.org>, such as eXist and Xindice, to name but two.

Programmers in the humanities computing community have begun using these databases for projects with great success. However, for projects that cannot afford programming support, the bar is still extremely high. Thus a group of programmers and content developers teamed to create an extensible, modular and configurable xml-based repository entitled teiPublisher that can house, search on, and display documents encoded in TEILite. This is an open source initiative which is being made available to the humanities computing community to allow projects with limited programming support to mount their TEILite encoded texts in a web-deliverable database. We will be launching a beta version of teiPublisher at ACH/ALLC 2004.

Functionality

teiPublisher utilizes the native XML database eXist. It generates a public interface for browsing and searching. Equally as important, it provides an administrative interface that allows editors to:

· upload and delete documents;

· analyze xml documents to determine elements for searching;

· refine ontology development;

· decide on inter and intra document links;

· partition the repository into collections;

· create backups of the entire repository;

· generate search/browse and display pages for users of the site;

· change the look of the interface;

· associate xsl stylesheets.

Some of the features mentioned above, particularly ontology development, cannot be met by the software alone. Rather, teiPublisher provides a helper application to allow content creators to view the content of elements and attributes used in controlled vocabularies, and highlight semantic inconsistencies. It also assists in selecting elements and attributes which will ultimately be searched on.

The rest of this paper will describe various components of teiPublisher in detail.

Analysis of Document Instances
A majority of elements typically searched on via a search page are contained in the teiHeader. The teiHeader provides declarative and descriptive information about the text which is composed of four distinct parts: the fileDesc, encodingDesc, profileDesc, and revisionDesc. These elements, along with their associated child elements, can be selected as areas of interest in an XML document and be checked for uniformity across the entire repository. This check can be as broad as confirming the existence of a particular element or elements, or an element set, or confirming that a predefined set of values developed for an ontology has been adhered to.

When an editor first loads documents into the repository, teiPublisher's xml analyser will take her through a series of steps that will highlight information regarding elements in the teiHeader present across the document set. It will then point out elements missing from particular instances, and will act as a visualizer so that the editors can decide if missing elements need to be added before the instance is added to the database.

Once the original set of documents has been homogenized, as new instances are added, teiPublisher will process the xml document confirming whether particular elements from the teiHeader are present, and that elements which contain controlled vocabulary information are not only present, but conform to a pre-existing scheme.

For the first release of the software, the customization tools are predicated on a project using TEILite. By developing teiPublisher for the TEILite DTD, we can make certain assumptions which can be built into the logic of the application. These rules can be overridden or customized by an editor to match a particular repository's requirements.

Repository Management
Another feature of teiPublisher is the ability to allow editors to construct search pages based on TEI elements or attributes. The selection of a node creates an Xpath expression that is used for search purposes. This mechanism also allow scholars with knowledge of Xpath <www.w3.org/TR/xpath> to further refine searches. For example, the application may automatically generate an XPath expression to search for the <author> element, but to search for both <author> and <editor>, an editor will need to modify the XPath. An example of this feature can be seen in the image below:

The design of the web pages are customizable via the wiki concept <http://wiki.org/wiki.cgi?WhatIsWiki>, thus an editor is able to control the looks of the web page through a dynamic window which reflects changes immediately. An example of this feature can be seen in the image below in which the HTML modifications made in the right hand pane are reflected in the generated page to the left:

In addition, editors are provided with an interface to customize the first result page so that only the value from selected elements (such as author, date, and title) is displayed, as in the example below:

By the same token, the browse page will allow the contents of repositories categorized by collection, such as primary and secondary texts, or another sorting mechanism, such as alphabetical or date order, to be displayed by category. The repository will also allow editors to control access to the collection by allowing only certain IP addresses or a range of IP addresses to access content.

Publishing Documents

teiPublisher will allow a broad range of customizations. Since we will be employing XSLT stylesheets to display documents, these stylesheets can be customized according to project needs. As the application includes an apache tomcat server which will publish the repository on the web, very little technical knowledge is assumed of the editors. The administrative interface will guide editors through a series of steps that will setup and publish the repository.

Architecture

Customization data generated by the user, such as XPath expressions, HTML code, CSS stylesheet, search terms, and access control is stored in several xml files. This information is used to generate the public interfaces of the repository.

The project is based on eXist <http://exist.sourceforge.net>. Since it utilizes the xml:db api, other xml databases, such as xindice and Tamino, that support the api can be plugged in.

The figure below shows the interaction between the administrative client, which generates the xml configuration files per project repository, and the teiPublisher web application which reads the configuration files to generate the repository portal:

Newer technologies based on a java platform like JavaServer Faces makes it easier to build user interfaces as reusable components. These user interfaces can be represented as stateful objects on the server, separating rendering and event management. The use of Model View Controller Architecture along with XSLT helps to decouple the binding between data, logic and view. The customization data for logic is largely based upon XPath expressions which are generated by editors' interaction with the xml analyzer. Data customization, such as the development of a controlled vocabulary, is made possible through XPath expressions which, in turn, create the search and browse facility.

Conclusion
The process of making a customizable repository is much more challenging than developing a repository for specific content. The problems or unsatisfactory results created by a one size fits all solution can be mitigated by the customization modules which provide editors with choices, from how to store the data, to what elements will be searched on, to display modes.

References

1. eXist: an Open Source Native Xml database <http://eXist.sourceforge.net:>

2. The xml:db: Initiave <http://www.xmldb.org>

3. The Text Encoding Initiative <www.tei-c.org:>

4. XML Path Language (XPath) < http://www.w3.org/TR/xpath >

5. Greenstone Digital Library Software <http://www.greenstone.org>

6. Magnolia Content Management <http://www.obinary.com/en/magnolia.html>

7. Witten, Ian. H. and David Bainbridge. How to Build a Digital Library. (New York: Morgan Kaufmann, 2003)

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags