The MATE Workbench - a Tool for XML Corpus Annotation

poster / demo / art installation
Authorship
  1. 1. Amy Isard

    University of Edinburgh

  2. 2. David McKelvie

    University of Edinburgh

  3. 3. Andreas Mengel

    Universität Stuttgart

  4. 4. Morten Baun Moeller

    Odense University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The MATE workbench is an annotation tool which allows transcription, annotation, display and querying of speech and text corpora. It is not tied to any particular annotation scheme, and allows users to define interfaces which suit their particular needs.

The main difference between the MATE workbench and other annotation tools is its flexibility. Any annotation scheme which can be expressed in XML [7] can be used with the workbench (for a discussion of overlapping hierarchies see below), and the display and annotation interfaces are defined using a language based on the stylesheet language XSLT [8]. The workbench is written entirely in Java and is therefore platform independent.

Annotation can be a very tedious task for humans, and many tools have been developed to make it easier. We conducted a review of some existing annotation tools before beginning development of the workbench [3], and many of them have in common a fixed user interface or a restriction to a particular coding scheme. One tool which has some similarities to MATE is GATE [1], which can also be used with any annotation scheme, but has a different internal architecture, based on the US Tipster architecture rather than XML, and a main aim of making it possible to integrate different automatic annotation components within one system. MATE aims to provide a framework where stylesheets can be used to provide user-defined annotation and display interfaces. Because the stylesheet language is quite high level, it is easier to write a stylesheet to provide a given interface than to write an entire coding tool from scratch.

The MATE project has developed annotation schemes for five sets of linguistic phenomena, and examples of markup using these schemes will be distributed with the workbench, along with stylesheets for their annotation and display. Users of the workbench are by no means limited to these schemes, however.

To display annotated data in the workbench, a user must have a MATE project file, which specifies one or more XML annotation or transcription files and sound files if appropriate, and a stylesheet which is to be applied to these files. Several examples of these will be provided with the workbench. When the workbench is started, a corpus folder window appears with a display of all the available project files. After selecting a project file, the user clicks on the "run" button, the specified files are processed, and one or more display or annotation windows appears, depending on what was specified in the stylesheet. A different stylesheet can be used with the same files to produce different behaviour.

The other major use of the workbench is in performing queries over a corpus. A query language [4] was developed within the project which is tailored to our internal representation of the data, including the treatment of multiple hierarchies as defined below. To perform a query, the user first loads in one or more documents, then opens a Query Window, which provides support for building complex queries. The results of the queries are saved in XML format within the workbench, and are then also displayed using a stylesheet, allowing a flexible representation of the results.

One question which often comes up when XML annotation schemes are being developed is multiple overlapping hierarchies of markup. The TEI [5] describes several possible ways to provide overlapping in SGML. One of these, 'concur', is not possible in XML, which was designed to be a less complicated and easier to use version of SGML, and therefore left out some features. We have chosen to take a different approach from any of these, but one which has been proven to be successful in at least one large-scale corpus annotation project [2]. Our solution is to keep each level of annotation, and each data-stream (in the case of multi-speaker conversations for instance) separate, and to link each level to a common base-level. This base level would normally be the smallest unit on which all the other annotations depend. This may often be the word level, but could also be phonemes in the case of speech, higher level units such as sentences or paragraphs or indeed anything else as appropriate. The MATE workbench will therefore deal appropriately with any data which are marked up in this way using hyperlinks, as defined in the XLINK proposal [6].

Another advantage of the generality of the MATE workbench is that it makes it easier to combine views of annotation done using different schemes on the same corpus. These annotations may be done on different sites without any contact, but if both use hyperlinks to the same base level, then it is possible to create stylesheets which display both the annotations at the same time. It is also possible to write a stylesheet which will display one level and allow annotation of another level at the same time.

The MATE workbench has just been completed, and testing and evaluation are about to begin. We will be able to provide a section on this evaluation for the final paper. We will be asking testers to use the workbench for a variety of different annotation tasks, and provide feedback on ease of use and power, and also evaluating whether the stylesheet language allows testers to define new annotation and display interfaces easily. We will also be submitting a proposal to ALLC/ACH 2000 for a demo of the workbench.

References

[1] GATE <http://www.dcs.shef.ac.uk/research/groups/nlp% 2Fgate/>
[2] Isard, McKelvie and Thompson (1998). Towards a Minimal Standard for Dialogue Transcripts: A New SGML Architecture for the HCRC Map Task Corpus. Proceedings of the 5th International Conference on Spoken Language Processing, ICSLP98, Sydney.
[3] MATE Deliverable D3.1: Specification of Coding Workbench <http://www.cogsci.ed.ac.uk/~amyi/mate/report.html>
[4] Mengel, A. and Heid, U. (1999). Enhancing Reusability of Speech Corpora by Hyperlinked Query Output, Eurospeech 99, Budapest, September 1999.
[5] Sperberg-McQueen, C. M. and Bournard, Lou (eds). TEI Guidelines for Electronic Text Encoding <http://etext.lib.virginia.edu/TEI.html>
[6] XLINK <http://www.w3.org/TR/xlink>
[7] XML <http://www.w3.org/TR/REC-xml>
[8] XSLT <http://www.w3.org/TR/xslt>

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000

Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None