King's College London
King's College London
In transcribing and encoding texts in XML, it is often (if not
always) the case that structures and features do not nest within
each other but overlap. Paraphrasing a notorious quotation
from Paul Maas on the most unsolvable problem in editing, one
may say that “there is no remedy for multiple hierarchies”1.
Every year a considerable number of new papers and posters
about how to deal with multiple hierarchies and overlapping
structures are presented to conferences, and the new version
of the TEI includes a chapter on similar matters2. And yet, as
the liveliness of debate shows, a convenient, standard and
sharable solution is still to be found.
Any kind of text potentially (and actually) contains multiple
hierarchies, such as verses and syntax, or speeches and verses.
Perhaps the most extreme form of this problem arises when
transcribing a manuscript, assuming that the transcriber wants
to describe both the content and also the physical structure
and characteristics at the same time. Pages, columns and linebreaks
confl ict with paragraphs and other structural and non
structural divisions such as quotation and reported speeches,
as well as deletions, additions, and changes in scribal hand.
At present three different approaches to this problem have
been proposed:
1. non-XML notation (for instance LMNL in Cowan,
Tennison and Piez 2006) to be processed by specifi c tools
which must be developed in-house;
2. pseudo-XML, such as colored XML or concurrent XML;3
3. full XML approaches such as milestones, stand-off
markup4, or fragmentation.
All of these approaches depend on post-processing tools.
Some of these tools have been developed using scripting or
other languages such as perl, Python, or Java, and others use
XSLT-based approaches.
The milestone approach is the one chosen by the TEI, and,
consequently, by us. Nevertheless, milestones introduce
greater levels of complexity when building (X)HTML output
and for this reason they might not be used in practice.
As XSLT is the most common technology used to output XML
encoded texts, at CCH we experimented with different ways
to deal with milestones and have developed a handful of XSLT
techniques (as opposed to tools) that can be easily adapted to
different circumstances.
Clashing of two hierarchies
Let us consider, for example, the TEI P5 empty element
<handShift/> that delimits a change of scribal hand in a
manuscript. Yet problems arise if one needs to visualize both the
paragraphs and the different hands with different formatting.
In case of XHTML visualization, one would almost certainly
want to transform the <handShift> from an empty element
to a container, but this container could well then overlap with
existing block or inline elements such as <p>.
The easiest way to deal with this is to use the “disable-outputescaping”
technique by outputting HTML elements as text; for
instance:
<xsl:template match=”tei:handShift”>
<xsl:text disable-outputescaping=”
yes”><span style=”color:
red;”></xsl:text>
</xsl:template>
<xsl:template match=”tei:anchor[@
type=’end-handShift’]”>
<xsl:text disable-outputescaping=”
yes”></span></xsl:text>
</xsl:template>
However, this solution presents the obvious disadvantage that
the output will not be well structured (X)HTML, and although
browsers are often forgiving and therefore may cope with this
in practice, this forgiveness cannot be relied on and so this
process cannot be recommended.
A better option is to transform the area delimited by two
<handShift>s in a container but fragmenting it to avoid
overlapping.
One possible XSLT algorithm to expand and to fragment an
empty element could be:
• Loop on the <handShift>s
• Determine the ancestor <p>
<xsl:variable name=”cur-p”
select=”generate-id(ancestor::p)”/>
• Determine the next <handShift>
<xsl:variable name=”nexths”
select=”generateid(
following::handShift)”/>
• Create a new non-empty element <handShift> • Loop on all nodes after <handShift/>, up to but not
including either the next <handShift/> or the end of the
current <p>. This can be achieved using an XPath expression
similar to the following:
following::*[ancestor::
p[generate-id()=$cur-p]]
[not(preceding::handShift[generateid()=$
next-hs])]
|
following::text()[ancestor::
p[generate-id()=$cur-p]]
[not(preceding::handShift[generateid()=$
next-hs])]
That will return:
<p> … <handShift> … </handShift></p>
<p><handShift> … </handShift> … </p>
This resulting intermediate XML could then be used to
produce XHTML that would fulfi l the visualization required.
However this would involve another XSL transformation and
the intermediate fi le would not be valid against the schema,
unless an ad hoc customized schema is generated for that
purpose.
Nevertheless, thanks to XSLT 2 it is possible to produce a
single process outputting a fi rst transformation into a variable
and then apply other templates on the variable itself using
the mode attribute, thus dividing the process into steps and
avoiding both external non-valid fi les and also modifi cations
to the schema.
This is an example using the mode attribute.
• Declaration of variables
<xsl:variable name=”step1”>
<xsl:call-template name=”one”/>
</xsl:variable>
<xsl:variable name=”step2”>
<xsl:apply-templates
select=”$step1” mode=”step2”/>
</xsl:variable>
• XML to XML transformation (Step 1)
Copying the whole XML text:
<xsl:template match=”*” mode=”step1”>
<xsl:copy>...</xsl:copy>
</xsl:template>
Other templates (as the ones described above) to transform
<handShift/>:
<xsl:template match=”handShift”
mode=”step1”>
[...]
</xsl:template>
Saving the elaborated fi le into the declared variable:
<xsl:template name=”one” mode=”step1”>
<xsl:apply-templates
select=”TEI” mode=”step1”/>
</xsl:template>
• XHTML transformation (Step 2)
<xsl:template match=”/” mode=”step2”>
<html>...</html>
</xsl:template>
Other templates to transform <handShift> and <p> in XHTML
elements:
<xsl:template match=”handShift”
mode=”step2”>
<span class=”hand”>...</span>
</xsl:template>
• Output
<xsl:template match=”/”>
<xsl:copy-of select=”$step2”/>
</xsl:template>
The poster will include a comparison of the performances
of the XSLT 2.0 algorithm with a sequence of XSLT 1.0
transformations.
Clashing of more than two hierarchies
It is not improbable that in complex texts such as manuscripts
more than two hierarchies clash. Consequently, the diffi culties
of visualization in XHTML can become more complex.
During the analysis for a CCH project devoted to the digital
edition of Jane Austen’s manuscripts of fi ctional texts, the
need emerged to mark up lines as block elements in order to
manage them via a CSS stylesheet.
In TEI P5 lines are marked by the <lb/> empty element, and so
it was necessary to transform these into containers. Therefore
at least three hierarchies were clashing: <handShift/>, <lb/>
and <p>.
A good way to handle the confl ict could be looping on text
nodes between milestones. In the following example, all the
text nodes between <handShift>s are expanded into container
elements and then transformed into <span> elements carrying
a class attribute. Moreover all the lines are transformed into
further <span>s using the algorithm mentioned before in
order to manage them as block elements. The following templates show a possible implementation of
this method.
Step 1 XML to XML:
<xsl:template match=”text()
[not(ancestor::teiHeader)]”
mode=”step1”>
<handShift>
<xsl:copy-of select=”preceding::
handShift[1]/@new”/>
<xsl:value-of select=”.”/>
</handShift>
</xsl:template>
Step 2 XML to XHTML:
<xsl:template match=”handShift”>
<span class=”@new”>
<xsl:apply-templates mode=”step2”/>
</span>
</xsl:template>
Such a solution is also applicable with more than two clashing
hierarchies.
Even though this approach can be applied generically, a deep
analysis of the needs for representation and visualization is
required in order to develop more customized features. For
instance, the need to show lines as block elements has caused
other hierarchical clashes that have been resolved using
customized applications of the algorithms explained above.
According to project requirements, in fact, if the apparently
innocuous element <lb/> is used to produce non-empty
elements in the output, any TEI element at a phrasal level is
potentially overlapping and requires a special treatment.
The poster may be seen as providing an XSLT Cookbook for
multiple hierarchies (the XSLT code will be available as just
such a cookbook from the TEI wiki.) In our opinion simple
recipes are better for encoding multiple hierarchies than a tool
is, even a customizable one. The fl exibility and the extensibility
of the TEI encoding schema allows for an almost infi nite
combination of elements, attributes and values according to
the different needs of each text. Since the fi rst release of the
TEI Guidelines, the Digital Humanities community has learnt
to enjoy the fl exibility of SGML/XML based text encoding, but
such advantages come with a price, such as the diffi culty of
creating generic tools able to accommodate the specifi c needs
of every single project.
Furthermore, even assuming that a fi nite combination of
elements, attributes and values could be predicted at input
(considerably limiting the possibilities offered by the TEI
schema), the potential outputs are still infi nite. This is why the
most successful technology for processing text encoded in
XML is either an equally fl exible language – XSLT – or tools
that are based on such a language but that still require a high
degree of customization.
Therefore, sharing methodologies and approaches within
the community, though disappointing for those looking for
out-of-the-box solutions, is perhaps the most fruitful line of
development in the fi eld of multiple hierarchies.
Notes
1 “Gegen Kontamination ist kein Kraut gewachsen”, in Maas 1927.
2 “Non-hierarchical structures” in Burnard and Bauman 2007 at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html.
Something of the kind for SGML was also in TEI P3.
3 See Sperberg McQueen 2007 for an overview.
4 Burnard and Bauman 2008, at http://www.tei-c.org/release/doc/teip5-
doc/en/html/SA.html#SASO
References
Lou Burnard and Syd Bauman (2007) TEI P5: Guidelines for
Electronic Text Encoding and Interchange, available at http://
www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html
(11/12/07).
Paul Maas (1927), Textkritik, Leipzig.
Alexander Czmiel (2004) XML for Overlapping Structures
(XfOS) using a non XML Data Model, available at http://www.
hum.gu.se/allcach2004/AP/html/prop104.html#en (11/16/07)
John Cowan, Jeni Tennison and Wendell Piez (2006), LMNL
Update. In “Extreme Markup 2006” (slides available at http://
www.idealliance.org/papers/extreme/proceedings/html/2006/
Cowan01/EML2006Cowan01.html or at http://lmnl.net)
(11/16/07)
Patrick Durusau and Matthew Brook O’Donnell (2004)
Tabling the Overlap Discussion, available at http://www.
idealliance.org/papers/extreme/proceedings/html/2004/
Durusau01/EML2004Durusau01.html (11/16/07)
Michael Sperberg McQueen (2007), Representation of
Overlapping Structures. In Proceedings of Extreme Markup
2007 (available at http://www.idealliance.org/papers/
extreme/proceedings/xslfo-pdf/2007/SperbergMcQueen01/
EML2007SperbergMcQueen01.pdf) (15/03/2008)
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Oulu
Oulu, Finland
June 25, 2008 - June 29, 2008
135 works by 231 authors indexed
Conference website: http://www.ekl.oulu.fi/dh2008/
Series: ADHO (3)
Organizers: ADHO