XML for Overlapping Structures (XfOS) using a non XML Data Model

paper
Authorship
  1. 1. Alexander Czmiel

    Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) (Berlin-Brandenburg Academy of Sciences and Humanities), History - Universität zu Köln (University of Cologne)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

Since the beginning of the use of hierarchical Markup Languages, like SGML and XML, for marking up texts, the phenomena of overlapping structures has become a major concern, because the rather constricted tree structure of these languages does not permit the presentation of overlapping structures. Contrary the OHCO-thesis1 it is generally accepted, that texts are no „ordered hierarchy of content objects", but rather can have multiple hierarchies or lack of any transparent hierarchy. So the problem of overlapping structures can be distinguished between either multiple concurrent hierarchies that overlap each other or overlapping structures without any hierarchies.

In addition to the traditional attempts to solve this problem, during the „Extreme Markup Languages 2002" conference2, two more, interesting options were introduced. The first option are the „Just In Time Trees" (JITTs)3 developed by Patrick Durusau and Matthew O'Donnell. The second proposal is a Markup Language explicitly developed for marking up overlapping structures in one document, the „Layered Markup and Annotation Language" (LMNL)4, invented by Jeni Tennison, Wendell Piez and Gavin Thomas Nicol. Especially the second approach is interesting for documents containing overlapping structures without any hierarchies, but it can also be used to markup concurrent hierarchies in one document.

This paper tries to give a short summary and comparison of the known methods of resolution and attempts to outline a solution by combining XML with a non XML data model. In this case it concerns the LMNL data model, which is based on „Core Range Algebra" (CRA)5 and „Attributed Range Algebra" (ARA)6.

At this time the proposals for possible solutions can be divided into three categories:

SGML/XML based solution, i.e. „milestones", fragmentation and references.
Alternative Markup Languages with different data models and different syntax, i.e. MECS7, TexMECS8 or LMNL.
Relocating the assertion of a tree for each hierarchy from the markup to the processing of a document, as the JITTs technology suggests.
Traditional Solutions

Several possible ways to solve this problem by using traditional hierarchical Markup Languages exist. Solutions like CONCUR, a feature of SGML, „milestones" from the TEI-Guidelines, fragmentation, references, like XPointer, or separate annotationsi9 are all useful, but do not provide a definite solution to the problem. The main disadvantage of these solutions is the lack of an appropriate data model for describing text with overlapping structures. For this purpose, special systems, with special data models, like the „Multi-Element Code System" (MECS) or LMNL, are developed. MECS has a very complex and therefore complicated syntax, compared to LMNL, which is much easier to learn. LMNL on the other hand has the disadvantages of missing available software tools for creating, editing, validating or even displaying documents and missing other features, like practical query languages.

Apparently there is a demand for a solution that takes the advantages of both approaches. On the one hand a data model, explicitly developed for overlapping structures, on the other hand the plurality of existing tools for XML, including features like XSLT, XPath etc. For this purpose we need to distinguish between data model and the syntax, what this paper tries to show on the example by using the LMNL data model combined with XML syntax.

Data Model

A suitable data model, especially for overlapping structures without any hierarchies, is the LMNL data model, which is based on the „Attributed Range Algebra" (ARA). In ARA a document is handled as a sequence of characters, over which a number of Ranges spans. Ranges have a name, a start index and a length. They are represented in the document through markup. Ranges can have Annotations, which are not allowed to overlap but can be annotated and structured too. A LMNL document is organized in Layers that overlay each other. Every Layer, except the text layer, which is the lowest Layer of every document, has a base layer and can have other layers as overlay.

Such a model can apparently not be represented as a tree, so at first glance it seems unsuitable to combine with XML. However using the following Syntax, which attempts an implementation of the „flat subset"10 of the LMNL data model, might be a practical solution.

Syntax

According to the statement of the LMNL developers, that LMNL is a data model which can be combined with different syntaxes. It therefore seems self-evident, to combine the advantages of LMNL with the mass of existing valuable tools for processing XML.

The intended proximity of LMNL and XML makes a transformation of the data model to XML syntax much easier, than for example a transformation from MECS to XML. However it must be said, that some of the typical features of LMNL cannot be converted to XML.11

Below is a classic example for overlapping structures, which is no well formed XML:

<a>text with <b>overlapping</a> parts</b>

In LMNL syntax the example would look as follows:

[a}text with [b}overlapping{a] parts{b]

The XfOS Syntax would look like in the following way:

<a type="start"/>text with <b type="start"/>overlapping<a type="end"/> parts <b type="end"/>

This, at first sight, looks like the milestone solution from the TEI-Guidelines. But if the Ranges are annotated, which can in LMNL be done either in the start- or in the end tag, the markup becomes more complex.

LMNL:

[a [c}annotation in start tag{]}text with [b}overlapping{a] parts{b [c}annotation in end tag{]]

XfOS:

<a type="start"> <c>annotation in start tag</c></a>text with <b type="start"/>overlapping<a type="end"/> parts<b type="end"> <c>annotation in end tag</c></b>

Apparently the resulting marked up document looks like XML, especially since it is well formed, but in fact it is LMNL. The XML for overlapping structures in that case is a mixture of milestones and fragmentation. Every start tag and end tag of a Range is represented by its own XML element, that possesses an attribute "type" for indicating whether it is the Ranges start tag or end tag. If the Ranges are not annotated, the XML elements are empty. Annotated Ranges have content, the Annotation, either in the start tag or the end tag. This is possible, because Annotations can be structured but are not allowed to overlap each other.

Perspective

At this point of development the approach outlined above seems to be a makeshift solution as well, but if we accept the distinction between data model and syntax, we are able to use the benefits of both aspects. Especially the possibility of using a popular syntax most people are familiar with, in combination with a big amount of existing tools, is an important advantage. In this scenario, we only have to learn one more data model, but there is no need to deploy editors or parser. The possibility to use an existing XML parser makes it relatively easy to adapt an application to the LMNL data model as well as developing programing APIs that depends on those parsers.

References

1. Renear, Allen / Mylonas, Elli / Durand, David, Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies in: International Association for Literary and Linguistic Computing: Selected papers from ALLC, ACH Conference. Christ Church, oxford, April 1992. Oxford 1996.
2. http://www.extrememarkup.com/extreme/2002/
3. JITTs relocate the assertion of trees from marking up to the processing of a document. This technology seems very suitable for documents containing multiple concurrent hierarchies. For documents with overlapping structures without transparent hierarchies JITTs provide no processing model.
4. www.lmnl.org. Unfortunately the website isn't available anymore for unkown reasons.
5. Nicol, Gavin Thomas, Core Range Algebra. Toward a formal model of markup. 2002. http://www.mind-to-mind.com/library/papers/ara/core-range-algebra-03-2002.html
6. Nicol, Gavin Thomas, Attributed Range Algebra. Extending Core Range Algebra to Arbitrary Structures. 2002. http://www.mind-to-mind.com/library/papers/ara/attributed-range-algebra-07-2002.html
7. Huitfeld, Claus, MECS -- A Multi-Element Code System. In: Working Papers from the Wittgenstein Archives, University of Bergen 1998. http://helmer.aksis.uib.no/claus/mecs/mecs.htm
8. Huitfeld, Claus / Sperberg-McQueen, C. M., TexMECS. An experimental markup meta language for complex documents. 2001. http://helmer.aksis.uib.no/claus/mlcd/papers/texmecs.html
9. This method for marking up texts is mainly recomended by Andreas Witt in: Witt, Andreas, Multiple Informationsstrukturierung mit Auszeichnungssprachen. XML-basierte Methoden und deren Nutzen für die Sprachtechnologie. Dissertation, Universität Bielefeld, 2002.
10. The flat subset only consists of exactly two Layers, the text layer and one Layer as overlay, that contains all Ranges.
11. I.e. the concepts of "anonymous" ranges, markup reduction and the declaration of entities everywhere inside a document.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None