Markup in Textgrid

paper
Authorship
  1. 1. Fotis Jannidis

    Technische Universität Darmstadt (Technical University of Darmstadt)

  2. 2. Thorsten Vitt

    Technische Universität Darmstadt (Technical University of Darmstadt)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The paper will discuss the decisions in relation to markup
which have been made in Textgrid. The fi rst part of the paper
will describe the functionality and principal architecture of
Textgrid, the second part will discuss Textgrid’s baseline
encoding. Textgrid is a modular platform for collaborative
textual editing and a fi rst building block for a community grid
for the humanities. Textgrid consists of a toolkit for creating
and working with digital editions and a repository offering
storage, archiving and retrieval.
Textgrid’s architecture follows a layered design, built for
openness on all levels. At its base there is a middleware layer
providing generic utilities to encapsulate and provide access
to the data grid’s storage facilities as well as external archives.
Additionally, indexing and retrieval facilities and generic services
like authorisation and authentication are provided here.
A service layer built on the middleware provides automated
text processing facilities and access to semantic resources.
Here, Textgrid offers domain-specifi c services like a
confi gurable streaming editor or a lemmatizer which uses the
dictionaries stored in Textgrid. All services can be orchestrated
in workfl ows, which may also include external services.
Every service deploys standard web service technologies.
As well, tools in the service layer can work with both data
managed by the middleware and data streamed in and out of
these services by the caller, so they can be integrated with
environments outside of Textgrid.
The full tool suite of Textgrid is accessible via TextGridLab, a
user interface based on Eclipse which, besides user interfaces
to the services and search and management facilities for
Textgrid’s content, also includes some primarily interactive
tools. The user interface provides integrated access to the
various tools: For example, an XML Editor, a tool to mark
up parts of an image and link it to the text, and a dictionary
service. From the perspective of the user, all these tools are
part of one application.
This software framework is completely based on plug-ins and
thus refl ects the other layers’ extensibility: it can be easily
extended by plug-ins provided by third parties, and although
there is a standalone executable tailored for the philologist
users, TextGridLab’s plugins can be integrated with existing
Eclipse installations, as well.
Additionally, the general public may read and search publicized
material by means of a web interface, without installing any
specialized software.
Designing this infrastructure it would have been a possibility
to defi ne one data format which can be used in all services
including search and retrieval and publishing. Instead the
designers chose a different approach: each service or software
component defi nes its own minimal level of format restriction.
The XML editor, which is part of the front end, is designed to
process all fi les which are xml conform; the streaming editor
service can handle any kind of fi le etc. The main reason for this
decision was the experience of those people involved and the
model of the TEI guidelines to allow users as much individual
freedom to choose and use their markup as possible even if
the success of TEI lite and the many project specifi c TEI subsets
seem to point to the need for defi ning strict standards.
But at some points of the project more restrictive format
decisions had to be made. One of them was the result of the
project’s ambition to make all texts searchable in a way which
is more useful than a mere full text search. On the other hand
it isn’t realistic to propose a full format which will allow all
future editors, lexicographers and corpus designers to encode
all features they are interested in. So Textgrid allows all projects
to use whatever XML markup seems necessary but burdens
the project with designing its own interface to these complex
data structures. But in this form the project data are an island
and there is no common retrieval possible. To allow a retrieval
across all data in Textgrid which goes beyond the possibilities
of a full text research, the Textgrid designers discussed several
possibilities but fi nally settled down on a concept which relies
very much on text types like drama, prose, verse, letter etc.
and we differentiate between basic text types like verse and
container text types like corpora or critical editions.
Interproject search is enabled by transforming all texts into
a rudimentary format which contains the most important
information of the specifi c text type. This baseline encoding is
not meant to be a core encoding which covers all important
information of a text type but it is strictly functional. We
defi ned three demands which should be met by the baseline
encoding, which is meant to be a subset of the TEI:
1) Intelligent search. Including often used aspects of text
types into the search we try to make text retrieval more
effective. A typical example would be the ‘knowledge’ that a
word is the lemma of a dictionary entry, so a search for this
word would mark this subtree as a better hit than another
where it is just part of a paragraph.
2) Representation of search results. The results of an interproject
search have to be displayed in some manner which
preserves some important aspects of the source texts.
3) Automatic reuse and further processing of text. A typical
example for this would be the integration of a dictionary
in a network of dictionaries. This aspect is notoriously underrepresented in most design decisions of modern
online editions which usually see the publication as the
natural goal of their project, a publication which usually only
allows for reading and searching as the typical forms of text
usage.
Our paper will describe the baseline encoding format for
some of the text types supported by Textgrid at the moment
including the metadata format and discuss in what ways the
three requirements are met by them.
One of the aims of our paper is to put our arguments and
design decisions up for discussion in order to test their
validity. Another aim is to refl ect on the consequences of this
approach for others like the TEI, especially the idea to defi ne
important text types for the humanities and provide specifi c
markup for them.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None