Mots15 - An interactive concordance system (built from mostly off-the- shelf parts)

poster / demo / art installation
Authorship
  1. 1. Paul Meurer

    University of Bergen

  2. 2. Michael Sperberg-McQueen

    World Wide Web Consortium (W3.org)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Mots15 - An interactive concordance system (built from
mostly off-the- shelf parts)

Paul
Meurer

University of Bergen
paul.meurer@hit.uib.no

Michael
Sperberg-McQueen

World Wide Web Consortium
cmsmcq@acm.org

2002

University of Tübingen

Tübingen

ALLC/ACH 2002

editor

Harald
Fuchs

encoder

Sara
A.
Schmidt

Mots 15 is an interactive Web-based concordance or full-text retrieval system
built mostly out of off-the-shelf software.
The goals of the Mots-15 project are:
to build a reasonably capable full-text retrieval system, with
functionality generally similar to Tact, ARRAS, and the like, but
with better markup awareness
to keep minimum investment low for both implementors and users
to allow experimentation with interesting parts of the query
system

From these design goals follow several design principles:
simplicity of implementation
use of off-the-shelf components wherever possible
modularity, loose coupling among modules using predefined
interfaces wherever possible

1. Basic interfaces in a query system

1.1. Monoliths
At a very simple level, an interactive query system simply accepts
queries from a user, which return responses from the data. In systems
like Arras and Tact, the single monolithic software package controls
everything in the diagram.

Image 1: A monolithic query system

1.2. Web interface
With the advent of graphical browsers for the World Wide Web, however, it
is possible to provide a fairly attractive interface at a much lower
cost than would otherwise be possible. It may still make sense to devise
special-purpose user interface software for specific purposes, but we
can go a long way without it, just relying on the users to have chosen a
Web browser they like reasonably well. The Web, that is, exposes an
interface between the user interface and the data in the back end.

Image 2: A Web-based query system

This interface sets certain limits to our freedom — we must now use HTML
to describe what the user sees (we can use arbitrary XML if we are
willing to require that the user have an XML-capable browser) and the
user's interactions with the server are limited to what can be done
using HTML forms (and possibly a browser scripting language like
Javascript) — but within those limits we can develop better user
interfaces at a lower cost than if we were building from scratch.
Even more important, we can now swap front- and back-ends in and out. We
can experiment with different user interfaces by writing different
front-end forms and HTML style sheets. In theory, we can also experiment
with different back ends by substituting one for the other and using the
same front end; in practice, the existing systems built on this model
don't easily allow for swapping different back ends in and out, because
the interface between the front end and the back end varies with the
specific product used as the back end. Because different commercial
products rarely support identical interfaces, this means it's rarely
possible to swap a new back end in with minimal effort.

1.3. Mots 15
The Mots 15 system differs from the generic Web-based system primarily by
exposing a generic query interface in front of the back-end-specific
query interface, in order to buffer the front end and back end from each
other.

Image 3: Basic plan of MOTS query system

Ideally, this generic query interface should follow some open
specification; ideally, it should provide all the functionality we want
(to keep life simple), and no more (so that it is easy to build back
ends if we want to do it ourselves); the exact choice depends on the
tradeoff between these incompatible goals.
Assuming that we have some suitable query language, and a way to
translate from it into the query language of the back end, then any XML
query engine may be used as back end.
A SQL dbms may be the most flexible back end. The design made by MSM for
this would involve a few light-weight scripts which run on top of’ the
SQL database system. The SQL system itself would in this design produce
not elements but element pointers, which would be used to extract the
elements from a saved copy of the XML file.
The task of translating from the open query language to the proprietary
back end query language is, of course, simplified if the back end
accepts the open query language itself.
Another alternative would be using the corpus management system Corpus
WorkBench (IMS, Stuttgart) as the MOTS back end, possibly combined with
an XML query facility. This would allow for more advanced linguistic
queries.

2. Pieces of Mots 15
Mots 15 is designed to make it relatively simple to specify and implement
each piece of the system. The better we succeed in this goal, the easier it
will be for us to experiment with different parts of the system, and the
easier it will be for eventual users to customize it for their own purposes.
Eventually, the designers hope that Mots 15 will grow into a library of
reusable and customizable pieces, which individuals and small projects can
modify to make useful special-purpose systems.
The Mots 15 design requires the following pieces of software:
browser: an off-the-shelf Web browser; this handles the actual
display of results on the user's screen and interaction with the
user
forms: one or more HTML forms which allow the user to specify
searches; these produce an HTML-forms data stream which the parser
hands to an appropriate CGI script
form-to-query translator: a program to translate the forms data
into a query, expressed in the open query language
query-to-query translator: a program to translate the query from
the open query language into the query language supported by the
back end
back end: a program, which accepts queries in some (possibly
proprietary) query language and returns as results e.g. some set of
XML elements
wrapper: a program which takes the results and places them in
two-level wrapper: (a) an outermost mots:result XML element and (b)
an element depending on the hit type wrapped around each hit, each
with attributes providing useful information about the query and its
results
XML-to-HTML translator: a program which takes the wrapped results
and translates them into HTML suitable for display in the user's
off-the-shelf browser
transaction manager: a CGI script to manage the query/response
transaction, by calling (or incorporating) the various other
programs in this list; it may also be responsible for session
management

3. Open problems and opportunities
The existing implementation of Mots15 (as of November 2001) is a minimal
system with
a choice of several simple Web interfaces
support for straightforward XML documents only
XSLT stylesheets for XML-HTML translation
a limited (XPath based) query language, extended with word
frequency queries

There are several obvious challenges for the future development of Mots
15:
serious Web interface (room for experiment)
‘XML++’ support
display of parallel versions, textual variation
external and user supplied annotation
proximity searching
exploiting grammatical annotation of text
supporting documents with overlap (i.e. TexMECS)
allowing users to search as if the text were marked up more simply
than it is (e.g. with a uniform chapter/section/paragraph/sentence
hierarchy)
supporting more powerful back ends (either by means of wormholes
in the open query language, or by means of a second interface)
managing selection of texts from a corpus or collection; federated
searches

References

John
Price-Wilkin
Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries

Public-Access Computer Systems Review

5
3
5-21
1994

.

John
Price-Wilkin
A Gateway between the World Wide Web and PAT: Exploring SGML Through the Web

Public-Acces Computer Systems Review

5
7
527
1994

John
Price-Wilkin
The Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical Analysis

Presented at Literary Texts in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana- Champaign). April 10-12, 1994

1994

. Published in the Proceedings of the Clinic. Gateway between the World Wide Web and PAT: Exploring SGML Through the Web.

John
Price-Wilkin
Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW

D-Lib Magazine

May 1997

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2002
"New Directions in Humanities Computing"

Hosted at Universität Tübingen (University of Tubingen / Tuebingen)

Tübingen, Germany

July 23, 2002 - July 28, 2008

72 works by 136 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20041117094331/http://www.uni-tuebingen.de/allcach2002/

Series: ALLC/EADH (29), ACH/ICCH (22), ACH/ALLC (14)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None