TAPoR: Tools, Architectures, and Techniques I

paper
Authorship
  1. 1. Stephen Ramsay

    University of Georgia, University of Nebraska–Lincoln, University of Virginia

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Fifteen years ago, the chief complaint among those engaged in computer-assisted text analysis was that there weren't enough texts. At about the same time, a number of institutional efforts arose that sought to address the need for increased access to scholarly resources and additional means of preservation. The early nineties saw the coming of age of a number of older digital archives such as the Oxford Text Archive (1976), the ARTFL Project (University of Chicago, 1981), and The Perseus Digital Library (Tufts University, 1987) as well as the creation of newer archives like the Wittgenstein Archives (University of Bergen, 1990), the Electronic Text Center (University of Virginia, 1992), and the Women Writers Project (Brown University, 1993). All of these projects (and there are many more we might mention) have been, by any measure, enormously successful. They all provide an unprecedented level of access to digital materials, and all are guided by sound editorial and technical principles aimed at increasing and sustaining their longevity.

For all their success, most of these initiatives were built with an acute awareness that the real power of the collections lay in the future. They all made an attempt to adhere to open standards that would ensure not only that the content of the text would endure, but that what we might call its "procedural tractability" would also endure. Despite that--perhaps because of that--nearly all of the large full-text archives are broad but shallow. You can see much, but in the end, you can't do much other than search the collection. This is perhaps as it should be, and from a practical standpoint, it is perhaps as it must be. The technical demands of encoded full-text archives are substantial and do not leave much room for additional efforts at creating computational methods of text analysis and visualization. Text archive maintainers have their hands full, and must continue to maintain the unprecedented level of access and efficient retrieval for which the more prominent efforts are justly famous.

While much of the community was focused on the creation of full text archives, a somewhat smaller sub-community of scholars continued to create text analysis software for humanistic inquiry. The achievements of this group are similarly prodigious and include the creation of TACT, OCP, Collate, TUSTEP, and dozens of sophisticated (though usually less general) tools for undertaking stylometric analysis, authorship attribution study, text visualization, and document analysis. It was this group, of course, that lobbied most vocally for the expansion of online full-text archives in the early nineties, since it was clear that the success of their research efforts depended upon the ubiquity of well edited, computational tractable texts.

At the 2002 ALLC/ACH meeting in Tuebingen, one could discern the beginnings of a realization on both sides of the humanities computing community. Text analysis practitioners were announcing that "not enough texts" was no longer a valid excuse for whatever failures there had been in the attempt to bring text analysis into the mainstream of scholarly activity. Creators and maintainers of full-text archives were likewise beginning to voice their belief that the time had come to use these archives for something more than mere access and dissemination. Text archives have come of age, and many humanities computing practitioners have begun to turn their attention toward developing novel ways to exploit the enormous potential of such archives. The age of archives has not ended by any means, but the age of tools has clearly begun.

The prospect of marrying tool development to archive development is an exciting one, but it brings with it a number of technical (not to say social) challenges. While most of the full-text archives have created their content in either SGML- or XML-compliant TEI, tools development is not similarly standardized. Digital humanities software development has not settled upon a single language, software engineering methodology, or platform. Such standardization, indeed, would seem problematic from a research standpoint since "using the right tool for the job" remains as essential an element of innovation in humanities computing as in software development more generally. Where archive creators saw the need for standardization and consistency, software developers in our area of research saw the need for early adoption of a variety of languages, platforms, and design paradigms. Even in the context of the archives themselves one sees a wide variety of technical infrastructures.

Online delivery clearly represents the best way to make texts accessible. It is a good way to make analytical tools accessible, but the model that naturally emerges--a site where one can upload a text into a server-side analytical tool and retrieve the results--fails to leverage the power of the archives. The idea of popping one text after another from the Etext Center's archive or the Women Writers Project into an online form seems less than optimal. It also presumes that the tool is able to handle the idiosyncrasies of the document, which is a bold assumption even given the prevalence of XML-encoded data.

A second method might be to develop tools that can be dropped into the software frameworks of existing collections. This is an attractive option from the standpoint of the archive maintainers, but from a software development standpoint, it represents a rather optimistic view of the local environment. Integration of, say, Java tools in a Perl environment, while technically feasible, creates serious problems for both the developer and the integrator. Besides this, not all archives have the technical staff to undertake this sort of task.

Web services attempt to bridge the gap between these two approaches to system design by making remote programming logic available to the local environment as if it were a set of local functions. Using this sort of architecture, one can imagine tool archives on analogy with existing text archives offering mechanisms for data mining, visualization, and text analysis to existing text bases. In this model, archive maintainers would implement simple client programs that pass texts (or references to texts) over the wire to tools that conform to a common interface specification. While it is true that network latency might preclude certain operations, a great many (including word frequency analysis, data visualization, tag-set analysis, and part-of-speech tagging) could be implemented in this manner.

This paper presents just such an architecture, but it eschews the most popular approach to the implementation: namely, the Simple Object Access Protocol (SOAP). SOAP makes local functions available in a way that conforms loosely to the object-oriented paradigm while at the same time facilitating automatic creation of the code skeletons necessary for gluing various client-server processes together. In the process, however, SOAP puts forth what amounts to a protocol separate from (and philosophically opposed to) the standard protocols of the web.

Tamarind is a web service framework for XML-based text analysis that takes advantage of a recently developed architectural paradigm known as Representational State Transfer (REST). In the words of its principle advocate:

"Representational State Transfer is intended to evoke an image of how a well-designed Web application behaves: a network of web pages (a virtual state-machine), where the user progresses through an application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use" (Fielding 109).

Put less generally, REST proposes that instead of creating an entirely new set of protocols for web services (UDDI, WSDL, SOAP, etc.), we instead combine XML with existing protocols (URIs for unique identifiers and HTTP for the wire protocol) to create web services that can operate within the existing operational paradigm of the World Wide Web. The scalability and robustness of such a system has already been demonstrated, since the current World Wide Web works precisely this way.

Tamarind takes REST to a certain philosophical extreme by defining all functions at a level of granularity comparable to those found in the function libraries of common programming languages. Each one is presented as a simple resource capable of performing individual actions on one of small number of discrete types, and each one is uniquely identified through an ordinary URI. In general, Tamarind resources have the following features:

1. Every Tamarind resource can receive a valid type and return a valid type without being passed a wrapper of any kind.

2. Some Tamarind resources can act as brokers for requests among a number of other Tamarind resources.

3. Every Tamarind resource can report on its capabilities, valid types, and valid next states through a simple GET interface.

We may imagine the following use-case scenario using the Tamarind architecture:

A client (perhaps a web browser, but perhaps also a bit of client code integrated with an existing full-text archive) makes a request to a resource called WordCount indicating that it wants to send a document and retrieve a sorted hashtable of the word frequencies of all the words within "p" tags. The client indicates the steps in the process by passing a simple request specification document:

<tmrnd:request>

<tmrnd:broker>http://some.where.edu/WordCount</tmrnd:broker>

<tmrnd:resource uri="http://some.where.else.edu/SubTree">

<tmrnd:parameter>//p</tmrnd:parameter>

<tmrnd:resource uri="http://some.where.edu/WordCount"/>

<tmrnd:resource uri="http://yet.another.edu/AscendingHashSortByKey"/>

</tmrnd:request>

Here are the steps for resolving this request:

1. WordCount receives a POST containing the initial type (in this case the document) and the tamarind request specification.

2. WordCount parses the request and POSTs the document and the required parameter to SubTree.

3. SubTree hasn't received a request specification, so it just processes the document and sends back a response to the caller (WordCount).

4. WordCount receives the request, parses the next request resource, and sends the result of the last request to WordCount (WordCount realizes that it is the next request, so it just processes it internally).

5. WordCount reads the last request, POSTs the result of the last operation to AscendingHashSortByKey and receives the result.

6. WordCount realizes that there are no more requests, and so it sends the result of the last operation back to the original client.

This system has all the benefits of a web service architecture: including complete platform and language independence. It accomplishes it, however, without the additional overhead (and complexity) of SOAP.

My presentation will briefly lay out the terms of SOAP- and REST-based web service architectures before proceeding to an architectural overview (and working example) of Tamarind being used to perform text analytical procedures on a remote text archive. I will conclude with some ideas concerning the future of Tamarind, its role in the TAPor project, and its possible integration with Stefan Sinclair's Text Analysis Markup Language (TAML).

References:

Fielding, Roy Thomas. Architectural Styles and the Design of Network-based Software Architectures. Diss. University of California, Irvine, 2000. Ann Arbor: UMI, 2000.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None