New York University
Since the release of XML 1.0 in 1998, the academic world, along with the business and scientific worlds, was introduced to a new type of data storage that would overcome the problems of database management systems (interoperability), HTML (lack of description) and SGML (complexity). For our community the result was the ability for scholars to encode their documents in a way that was platform and application independent and allowed for very rich, descriptive markup. Following suit, the TEI released P4 of the TEI guidelines which implemented "proper XML support".
XML, and TEI P4 have proved remarkably useful in encoding documents and, as was evidenced in the last two TEI meetings (Chicago and Nancy), the interest in XML and the TEI is growing.
The one stumbling block that many users of XML and the TEI face, however, is the lack of tools to query, process, and display their XML-encoded documents on the Web. Quite simply, users wish to query their XML documents with the ease of a DBMS and display them on the Web with the ease of HTML. It is not impossible to do these two things now, but the learning curve to do either is quite steep and the tools are nowhere near as robust as they need to be for large projects.
On the display side, things are getting easier. XSLT has become established as the language to convert XML to HTML for display. XSLT processors are still a bit difficult to use but once mastered the conversion process is quite easy.
Querying is a different story. The W3C is working on the XQuery language and there are a few native XML databases, such as eXist, which are promising, but XQuery is far from becoming a useful language, and eXist, though useful, has proved very slow when querying large number of files.
This climate has resulted in those in the community interested in querying and displaying XML documents cobbling together their own systems. Two such promising ones within the community are Peter Robinson's Anastasia and Mark Olsen's PhiloLogic.
I've put together a very simple publication package, too, that I feel addresses the needs of many in the community: price, performance, and development time. I would not suggest that this package is THE solution for querying and publishing XML on the web, and in the future I hope to use Xquery or a native XML database when the technology is robust, but, for the time being, this is a very affordable, powerful, and easy to learn solution.
The system I have developed uses Apache, PHP, MySQL, and XSLT to query and publish XML documents on the web. It was developed for the web publication of The Public Writings of Margaret Sanger in the department of history at New York University. This online edition is part of a much larger endeavor, The Margaret Sanger Papers Project, which includes an already completed microfilm edition of over 9,000 of Sanger's documents, plus a four-volume book edition of Sanger's papers, the first volume of which has been published with the title The Woman Rebel, 1900-1928. The Sanger documents are TEI encoded and use the Model Editions Partnership DTD.
The backend of this system is a MySQL database. The database fields are populated by running a PHP script that parses XML files using the Expat parser. It captures the data of certain elements and inputs this data into a corresponding field in the database. The script for the Sanger project captures title, publication date, document type, category, and body. The ability to parse XML is built in to standard compilations of PHP.
This MySQL database is queried by the end user to find appropriate documents in the Sanger collection through web front end written in HTML. PHP is used as the middleware to talk to the MySQL database. For the Sanger project, users can search by full text, title, date range, document type, and category.
After performing a search, a user is presented with a list of documents in the web browser that match the criteria of the search. To view a document, he or she clicks on the appropriate link.
The document is presented to the user as HTML that is generated from the XML files (not the database) on-the-fly using PHP, XSLT, and Sablotron, an XSLT processor.
This system meets the three criteria mentioned earlier: price, performance, and development time.
Price: All of the tools used in this system are free and open source (Apache, MySQL, PHP, XSLT, Sablotron). Most are installed with any standard Unix/Linux distribution and all can be installed on Windows.
Performance: This system is very efficient. The PHP script that parses the XML and inputs data into the database runs very quickly. It can be rerun on selected XML files if changes are made and can update the data in the database when needed. This data is immediately available for searching in the MySQL database. The XML files are unchanged when parsed by the PHP script and are the same XML files that processed by Sablotron to output HTML for the end users.
The queries of the MySQL database are extremely fast and the on-the-fly XSLT transformations range from a few seconds to instantaneous.
Development time: This system requires knowledge of XSLT, PHP, and SQL. Of the three, XSLT is the most difficult to learn but must be learned for any system publishing XML files. SQL and PHP are much easier to learn.
Since MySQL, PHP, and Apache are proven technologies that are preinstalled on most Linux and Unix distributions, there is little configuration or installation to be done to make this system work. The Sablotron XSLT processor needs to be installed and PHP needs to be reconfigured with the XSLT processor extension. But when that is finished, the user has all the tools needed to query and publish XML.
This paper is not theoretical, but is intended to be more than just a software demonstration. The issue of how to process and deliver XML-encoded documents over the web is an important issue in our community and one that warrants serious attention. Though there is ample support for encoding issues through the ALLC, ACH and TEI, many in our community are less confident when it comes to processing their encoded documents. Many of the tools available today such as Cocoon and eXist are still out of reach to Humanists because of the instability and complexity of the tools. And many commercial products are out of reach because the their price. I hope this paper will give an overview of the issues involved in querying and displaying XML-encoded documents that we hope to resolve in the next years and I hope this paper would enable those with small budgets and tight schedules to produce a robust XML publishing with standard open source tools available on most Linux/Unix distributions.
1. Development version of Margaret Sanger project: http://wilde.acs.its.nyu.edu/sanger/documents/search.php
2. How-to for installing Sablotron and recompiling PHP with XSLT extensions: http://www.nyu.edu/its/humanities/docs/php_xslt.html
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Göteborg University (Gothenburg)
June 11, 2004 - June 16, 2004
105 works by 152 authors indexed
Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/