XSL - Characteristics, Status, and Potentials for Text Processing Applications in the Humanities

  1. 1. Wendell Piez

    Mulberry Technologies, Inc.

Work text
What XSL Is

The Extensible Style Language (XSL) is a specification currently being finalized (May 2000) by the W3 Consortium, the vendor consortium that proposes recommendations for web standards including HTML, CSS and now XML and its related technologies. XSL's immediate purpose is to support various kinds of presentation of arbitrarily marked-up documents in XML format. In an XSL system, any well-formed XML document could be formatted for print, displayed in hypertext (including on the web), or presented in other media, more easily and more effectively than is currently the case, and in a standards-based way. In a networked environment, processing documents for display on screen could happen on either server or client.

In order to support its task of presenting XML (that is, applying to an arbitrary tag set a formatting description for a user interface such as screen or printer), XSL evidently has to provide for granular access to markup structures, so as to be able, for example, to derive tables of contents, text for running heads, indexes, and other common (presentational) expressions of underlying document architecture. In the course of working out XSL it became increasingly clear that (as is often the case with computer data processing problems) this problem was more easily, and more powerfully, addressed, if it was treated as a special case of a more general capability, namely the "transformation" of one markup structure into another.

Accordingly, XSL is formally divided into two parts:

"XSL Transformations" (XSLT): on a standalone basis, provides a language to describe many of the kinds of rearrangement and filtering of markup structures that a reasonably powerful XML presentation language requires.

"XSL Formatting objects" (XSLFO): provides a vocabulary for describing, in a standard and abstract way, formatting of text for visual display in print or on screen (and possibly for alternative media presentation).

XSLT was accepted as an "official" W3C recommendation in November 1999. XSLFO is expected to be completed in mid-2000. The relative maturity of the specifications is reflected in the tools available: already by the end of 1999 a number of tools supporting XSLT were available on the Net for free. As of this writing, tools supporting XSLFO (which could, for example, convert XML via XSL into PDF format) are less mature.

XSLT, on the other hand, was instantly in use by mid-1999, primarily (though not exclusively) as a way of converting XML into HTML. Because XML becomes instantly useful as soon as HTML can be reliably created out of it, this has in effect jump-started the XML presentation industry, at the price of keeping on-line published versions of XML source documents limited to the capabilities of HTML, the current state of the art in browsing on the web. As a result, even before the ink is dry, we are beginning to get a sense of XSLT's capabilities for processing - while at the same time we are still unclear as to what XSL's own "design language" (its formatting objects) will look like.

XSLT's Capabilities

-Presentational XSLT
XSLT is already used to convert XML into HTML. In this, it is a ready alternative to a scripting approach (Perl, Omnimark etc.) or to the ISO standard DSSSL - and easier to learn than either. It also compares favorably in price: tools for XSLT conversions are free.

-Analytic XSLT
XSL processing is dependent on markup in the source text for navigation as opposed to (say) character offsets or line numbers. While very good at presenting information encoded in markup, it is not good at recognizing or construing implicit information such as character patterns. It does no tokenizing, hence cannot recognize "word" boundaries. By default, string processing and matching in XSLT is case-sensitive, and cannot readily be configured otherwise.

Somewhat surprisingly, however, XSL is nevertheless useful for certain kinds of analytical functions, including certain kinds of XML validation (cf. Rick Jelliffe's "Schematron"). For example, one could write an XSL stylesheet that would check the conformance of an instance of a TEI Header to a certain model, that went beyond the DTD to specify element dependencies - for example, reporting a warning (or providing defaults) if the publication statement were not filled out according to house standards. This could be done with a stylesheet and would not require altering the DTD.

And because it can perform testing on strings, XSLT can also be used for generating rudimentary concordances. A concordancer in the form of an XSL style sheet will be demonstrated as a part of this presentation.

The thing to keep in mind about any computer-facilitated analytic work is that, without being supported by information from an external source (such as a thesaurus of terms or a morphological dictionary), no algorithm is able to reveal something about a text that is not implicit in the text already. That is, while a computer can rearrange information in a text, and therefore perform such operations as counting incidences or providing indexes, it cannot actually add any "knowledge". What it does, is present a text, and information derived about the text, in such a way that a careful reader can come to conclusions about it that would otherwise be very difficult to demonstrate. This is merely to point out that, for example, a concordance is not an analysis, and by itself makes no argument, although it may facilitate the development of one.

XSLT-based analytic work is no different, and since XSLT is not designed specifically with analytic work in mind, it is in some respects an unexpected benefit if it can support this work at all. Even given its fairly rudimentary capabilities, however, XSLT has certain incidental advantages:

1. It leverages investments made in markup:
Many repositories have XML texts, or texts readily convertible into XML. These are all ready for XSL processing, and can be enhanced to support more sophisticated processing.

2. It produces "publishable" results as a natural work product:
Since the end result of an XSL transformation can be HTML or an XML format ready for further processing, it is easy to generate results in a form that can be displayed as is.

3. An investment in XSL is worth making for other reasons:
Since XSLT processors are so inexpensive (free), the real investment is in time to learn it. And XSLT is so portable and versatile, it pays off this investment in expertise fairly quickly.

4. It can be combined with other methods:
An XSLT stylesheet can also be used to prepare XML texts for other kinds of work. An XSLT stylesheet can generate COCOA encoding from XML, that can be used to support TACT or another tool that takes advantage of COCOA markup of events in a text stream (such as chapter breaks or shifts in narrative voice). [An XSL stylesheet that creates COCOA markup from an XML TEI source can be demonstrated.]

Or, an XSLT stylesheet can be used to derive SVG (Scaleable Vector Graphics) files from descriptive XML source. SVG is a graphics format which is expected, by some, to revolutionize distribution of graphics for certain kinds of applications on the web. Graphic representations of phenomena accessible to XSL transformations can be already displayed in prototype SVG viewers. [SVG frequency distribution graphs of strings in an XML file can be demonstrated.]

These are only two examples of ways XSLT can be applied to help prepare XML texts for a variety of further uses. The basic principle being applied is a layered architecture: the source data is maintained in a stable format, such as TEI XML, useful over the long term. Applied "on top" of this repository layer a separate process can expose a "view" or presentation of the source data (some readers may be familiar with the "model-controller-view" model of computer application design), ready for the special format requirements of an arbitrary application.

Role Of XSL/XSLT In The Future

- Possibilities for XSL extension:
The XSL specification also provides allowance for its extension. Extension functions, in Java or an alternative scripting language, could be made available to an XSL processor. Tokenizing functions, sophisticated string processing and matching, database-integration services (for retrieving data such as morphological variants or checking values against an authority list) could all be addressable, given a good API, from within XSL stylesheets.

It is unlikely, however, that such extensions (at least, those especially suited for the types of analysis academic humanists are interested in) would be developed in the private sector - not that they would be without profitable application there. But academic researchers, with clear focus on their own functional requirements, have to lead the way.

-An XSL browser as "analytical engine":
XSL's potentials in these respects suggest that it could play a role in the markup-aware "analytical engine" that many of us keep envisioning (cf. the ELTA initiative). An XML browser that supported XSL stylesheets could be integrated with an editing environment allowing on-the-fly emendation of the stylesheets, and/or the extension functions they call. Stylesheets and function libraries could be pulled "off the shelf," or written especially to address local problems and questions. Specialized functions would have the capability of integrating XSL's presentation/analytical capabilities with other tools such as databases or network applications.

Not only would such a system be very versatile; also, in it, research results could take the form of ready-made publishable material, in HTML or any other markup-based form. Since it would basically be an XML web browser, it could also be readily networked, especially as concerns the XML source text (the text under analysis), which could be located anywhere on the Internet. Analytical stylesheets in XSL would be portable and applicable to any text that conformed to the same (sufficiently constrained) document model.

Present Advantages [as of the end of 1999]

-XSL tools are freely available:
As of this writing, free XSLT processors are available in Java, and are not difficult to set up and run. Learning the stylesheet language itself is the biggest barrier to entry, and there are free and inexpensive resources for this as well.

-XSL is easy to get going with:
By design, XSL is a declarative language, abstracted at a fairly high level. As a result, it is not difficult to learn, at least for most ordinary operations, and is very portable (making it easier to learn from others' work).

Present Disadvantages [as of the end of 1999]

-XSL is somewhat arcane:
Although the rudiments of XSL are not difficult, some users take to it less easily than others. It is a "functional" and "declarative" language unlike most scripting languages, so expertise in other computer languages is not readily applicable to it. Naïve users seem to have less trouble learning it than experts. The model of the text on which it operates, the "document tree," although it leverages document markup in a very simple and powerful way, is not a self-evident approach to developers used to looking at text as a stream of characters.

-XSL processing is XML-based; requires well-formed XML to start:
Obviously, XSL requires an XML text to operate on. Either this is a problem, or it isn't.

-Tools are rudimentary (although improving):
Strong support for internationalization, for example, is envisioned by the specification but not yet widely implemented in interfaces or tools.

As mentioned above, it is unlikely that the private sector would, on its own initiative, develop function libraries that would provide for all the kinds of functions wanted by scholars in the Humanities. (Some, like support for sorting texts in major European and Asian languages, can be hoped for, although not necessarily for free.)


-What XSL will be good for:
Presentation, filtering/rearrangement, markup-based processing such as indexing supported by markup. Some kinds of validation. Especially extended or in combination with other methods, XSL will also be capable of supporting sophisticated analytical functions on text marked up in XML.

-What the emergence of XSL tells us about our markup projects:

the up-front investment in the text (editorial work) remains the most difficult, interesting and important phase of work. Much or most further processing "down stream," and the types of processing possible, are directly dependent on the features of the text represented through its markup.
investments in valid SGML/XML formats are demonstrating their resilience through readiness for new applications
Web Site
A web presentation of this paper will be made available at <http://www.mulberrytech.com/papers/achallc2000>

