TAPoR: Tools, Architectures and Techniques II

paper
Authorship
  1. 1. Geoffrey Rockwell

    McMaster University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

The TAPoR project is developing tools for aggregation into a portal for text analysis. This paper will present one collection of these tools, the TAPoRware Tools. This paper will do the following:

1. Discuss the need for new tools and alternative configurations of tools.

2. Discuss the design goals of TAPoRware and their implementation.

3. Demonstrate selected tools as used through different interfaces.

4. Discuss three issues related to the tools and tool development in the humanities.

TAPoRware is only one of a number of tool development projects supported by the TAPoR project. The TAPoR project is tool agnostic - its aim is to aggregate and make accessible any tool that is appropriate. That said, in the development of the portal we need representative tools that can be used to test the portal model, and specifically the calling of tools as web services and passing of results to other services. Thus the preliminary goal of the TAPoRware project is to develop flexible tools that can be adapted by a portal and return results to the portal for further processing by other tools. It is also important to us to have tools that can be adapted for usability research into text analysis interfaces.

Tool Needs

The TAPoR project is conducting a survey on text analysis led by Elaine Toms of the Faculty of Information Studies at the University of Toronto (see http://www.fis.utoronto.ca/tapor/tapor-consent.asp). While this needs survey will be reported on more fully in another paper, a few preliminary results that are influencing the further design of the TAPoRware tools are:

- Humanists report that they typically don't collaborate on projects. They talk about their projects at conferences, but often pursue them alone. Thus we need tools suitable for use by humanists alone without significant training or technical support. Larger projects can use existing server based tools, but such tools are difficult for individuals to use.

- Few of the respondents knew about the tools listed like TACT. They don't know where to look for information about tools. Therefore it is important to properly document and advertise tools if we want colleagues to use them.

- Many respondents use productivity tools like Word and Excel for simple analysis. They appear to use what is at hand and familiar. We can leverage this by providing output that can be easily used in productivity tools.

In short we need simple tools that are easy to use and that are aggregated with supporting information that make their use interesting to humanists new to text analysis.

Design Goals

The design goals of the TAPoRware project are:

1. To develop simple tools that can be run from a remote server on texts that are exposed on the Web so that users can use the tools with minimal work.

2. To develop tools that can be run on XML, HTML and plain text files so that users can use the tools on anything at hand.

3. To develop tools that can output results in formats users can use including the TAML format so that those results can be processed by other tools.

4. To make the tools available as "Open Source" tools for redevelopment and redeployment by other projects.

5. To experiment with alternative interfaces to these tools in order to lower the difficulty of employing these tools.

Currently we have a working suite of tools that operate on XML, HTML and plain texts. In this presentation we will report on the testing process we followed to bring the tools to a point that they can be distributed. These tools were written in Ruby and are available for download from the TAPoRware site (See strange.mcmaster.ca/~taporware). These tools do basic operations like listing elements, extracting subsets of a text, finding patterns, listing words, finding co-occurences, and listing collocates. We have added distribution graphing to the suite and are adding aggregating and comparison tools. One of the challenges facing this project is finding ways of describing the tools for non-specialists who don't know the jargon of computer-assisted text analysis. An alternative interface called AnalyzeThis that describes the tools in terms of what they can do for the non-specialist will be shown.

Demonstration of TAPoRware

In the presentation we will demonstrate the tools in order to show how they work and to show different interfaces to the tools.

TAPoRware Issues

The demonstration of the TAPoRware will be made in order to illustrate the following issues.

Output. The tools are designed so that they can output results in three forms. There is a "pretty" HTML output format that is what introductory users will use. There is, for the tools that operate on XML, a view that presents the original text XML tags within HTML so that users can cut and paste the original XML. This format and the first HTML format are designed to allow users to just cut and paste results into a word processor or other productivity tool. Finally there is an XML results output. This XML results output is of interest to the humanities computing community since this is what allows tools to be concatenated and results to be handled by other tools. We will present this format, developed in consultation with the TAML project led by Stéfan Sinclair to the community for comment. We believe we have identified some basic results formats typical of text analysis operations that can be used by other tools. Here is an example of our current output for a word list:

<?xml version="1.0"?>

<tapor:results xmlns:tapor="tapor.mcmaster.ca">

<tapor:summary>

Summary: There are 752 unique words. and there are 1902 words in total. 526 words occurred once and 96 words occurred twice.</tapor:summary>

<tapor:item>

<tapor:foundword>The</tapor:foundword><tapor:count>102</tapor:count>

</tapor:item>

<tapor:item>

<tapor:foundword>Of</tapor:foundword><tapor:count>91</tapor:count>

</tapor:item>

<tapor:item>

<tapor:foundword>And</tapor:foundword><tapor:count>58</tapor:count>

</tapor:item>

<tapor:item>

<tapor:foundword>A</tapor:foundword><tapor:count>48</tapor:count>

</tapor:item>

...

</tapor:results>

Interface. The tools currently use one interface model that assumes users will go to the portal (or TAPoRware) site and from there will identify the text they are operating on, either by entering a URL or by uploading the text. We are now developing an alternative interface that allows users to click an AnalyzeThis button on their "Favorites Bar" (if they are using Internet Explorer) that will call up a text analysis panel that can operate on whatever Web page they were looking at. This alternative interface is designed to allow simple analytical tools to run on whatever is before the user, thus saving them the step of going to a tools site first.

Exposure. Finally we will make a plea to participants that are publishing electronic texts to Expose Your Texts. We will make the case that, for scholarly text analysis, it is important that e-text projects not just make their texts available within a search engine, but that they expose their texts in a way that allows users to use other tools like the TAPoRware Tools on their texts. We will conclude with guidelines as to how to do this and show an example of what can be done.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags