Cultural Studies - Technische Universität Dortmund
Cultural Studies - Technische Universität Dortmund
In many contexts, e.g. in interdisciplinary research, in scientific journalism and in specialised lexicography, readers search for information in a scientific domain in which they have previous but no expert knowledge. Their time is constrained, and they have to solve a very specific type of problem. In scenarios like these, users often read excursively and perceive only parts of longer
documents. When these documents are sequentially
organised, i.e. designed to be read from the beginning to the end, this selective reading may result in coherence
problems. For example, a reader, jumping right in the middle of a sequential document, may not understand (or may misunderstand) a paragraph because he lacks the prerequisite knowledge given in the preceding text.
The approach presented in our presentation generates
hypertext views on sequential documents with the goal of avoiding coherence problems and making selective
reading and browsing more efficient and more convenient
than it would be possible with printmedia.
II. Strategies for the Generation of
Hypertext Views
In contrast to other approaches to text-to-hypertext
conversion, we generate hypertext views as additional
layers while preserving the original sequence and
content of the sequential documents. Thus, readers still have the option to perceive the documents in their original sequential form, provided they have the time to do so. The hypertext views mark an additional offer for those readers who only have the time for selective reading.
Our approach processes information coming from two levels:
1) On the document level the documents are annotated with regard to three annotation layers:
• On the “document structure layer” we annotate
structural units (such as chapters, paragraphs,
footnotes, enumerated and unordered lists) using an annotation scheme derivated from DocBook.
• On the “terms and definitions layer” we annotate
occurrences of technical terms as well as text
segments in which these terms are explicitly defined.
• On the “cohesion layer” we annotate text-
grammatical information of various types, e.g.
co-reference, connectives, text-deictic expressions (cf. Holler et al. 2004).
While the annotation was performed manually in the first phase of the project, we currently investigate methods for automatic annotation (cf. Storrer & Wellinghoff 2006).
2) On the domain knowledge level we represent the semantics of technical terms occurring in the
documents in a WordNet-style representation that we call “TermNet”.
Using the annotations of the document structure layer, our hypertext views are generated using the segmentation principle “one-paragraph-is-one-node” as a starting point (which we refine in various ways). The rationale behind this is the expectation that paragraphs should be self-contained, and can thus be used as basic “building blocks” for a hypertext (Hammwöhner 1997). To support the selective reading of the resulting hypertext nodes, we enrich these nodes using two types of strategies:
1) Reconstructing cohesive closedness: Paragraphs in sequential documents often contain cohesion
markers that point to information located external to the node – either in the preceding or in the subsequent
text. Examples of such cohesive features are
connective particles (“furthermore”), anaphoric expressions (“the aforementioned specification”).
In our hypertext views we try to reconstruct cohesive
closedness by liberating these cohesive markers
from their linkage to a specific reading path. For this
purpose, we implemented four basic operations:
• Anaphora resolution: In the case of anaphoric
expressions, its antecedents will pop up (cf. figure
1).
• Linking: Text-deictic expressions such as “the
aforementioned specification” are transformed into
links connected to the respective text segments.
• Deletion: Connectives, which first and foremost
serve the creation of a fluent text (e.g. “yet”), are
deleted.
• Node expansion: When a connective is directly
bound to the preceding or subsequent text (e.g.
“furthermore”), we provide the option to expand
the current node and display the preceding or
subsequent paragraph. With this option users may
accumulate as much context as they need for properly
understanding the node’s content.
2) Linking according to knowledge prerequisites: With
these types of strategies we offer additional information,
which may be helpful for selective text comprehension.
In our approach we concentrate on information
related to the meaning of technical terms, because for
our user scenario – the rapid search for information
in a scientific domain – technical terms play a
central role. In our hypertext views we offer two options
to support the selective readers to better understand
the terms and their underlying concepts:
• On the basis of the annotation layer “definitions
and technical terms”, we link the terms occurring in
the documents to their definitions within the same
document (cf. figure 2).
• On the basis of our TermNet, we generate glossary
views, which show how a given term is linked to
other terms and concepts of the domain (cf. figure
3). These glossary views also contain hyperlinks
to text segments, in which the respective terms are
explicitly defined. The glossary views are connected
to all term occurrences in the documents; but
the glossary can also be used as an additional
stand-alone component.
Figure 1: The NP anaphora “der Ebenen” (engl. the levels) is
coreferent with each of the text segments “Die Speicherebene”
(storage level), “der konzeptionellen Ebene” (conceptional
level) and “der Präsentations- und Interaktionsebene”
(presentational level) which occur in the preceding text.
All three antecedents are displayed in a popup window.
Figure 2: Popup showing a definition of a term.
Figure 3: Glossary view of the term “modul”, including
a link to its definition (top left). We implemented our strategies using a German corpus of 20 non-fictional documents (103.805 words) belonging to two specialised research domains, namely text technology and hypertext research.
III. Implementation Issues
All data – the TermNet as well as the annotated
documents in our corpus – are represented and processed using XML technology.
TermNet is represented using the Topic Map Standard (Pepper & Moore 2001). To provide easy access to the data, we have developed an XSLT key library for XML
Topic Maps (a key being comparable to an associative
array in other programming languages). We use this library
to perform simple consistency checks and to automatically
infer additional TermNet relations, which may be of
interest to the hypertext user browsing the glossary. In a subsequent step the TermNet, together with the newly
inferred relations, is transformed into the hypertext
glossary, again using the key library. Accordingly, the TermNet can be stored and maintained without redundant and error-prone information.
The documents of our corpus are annotated with respect to the three annotation layers described above: (1) the document structure layer, (2) the terms and definitions
layer, (3) the cohesion layer. Following the approach
developed by Witt et al. 2005, we store the three annotation
layers in separate files. Thus, each layer can be annotated and maintained separately and can be validated against its corresponding document grammar (DTD or schema
file). In a subsequent unification step, the different
annotation layers of our corpus documents are merged. The resulting unified representation is the basis for another XSLT transformation, which automatically generates the hypertext views along the guidelines of our linking and segmentation strategies:
• Information from the “document structure layer” is used to generate an overall layout, to extract a table of contents and to perform text segmentation.
• On the basis of the “terms and definitions layer”, we
generate hyperlinks from technical terms to an ordered
list of definitions for that term (the order being based
on definition type and the position of the term
relative to the definition).
• The “cohesion layer” is used for the reconstruction
of cohesive closedness by means of one-to-one links,
one-to-many links, anaphora resolution, deletion and node expansion.
All hypertext features are implemented using HTML
and JavaScript as the target language of the XSLT
transformation. This XSLT transformation is straightforward,
but it has one disadvantage: each time when a linking
or segmentation strategy is modified, or when the
document grammar (of one of the three annotation
layers) is changed, it is neccessary to adjust the (rather complex) programming code. In our project, where
hypertextualisation strategies are to be tried out and tested,
this turned out to be tedious (and sometimes error-prone) work.
In order to address this problem and facilitate the flexible testing and modification of our strategies, we designed
the Hypertext Transformation Language (HTTL), a
declarative language suitable for expressing rules of
hypertextualisation (cf. Lenz in preparation). The language
was designed with a hypertext expert in mind who writes a set of rules for linking and segmentation strategies; it
offers general rules operating on abstract hypertext notions
such as “hypertext nodes”, “one-to-many links” or “node expansions”. The expert may formulate segmentation and linking rules using HTTL, which is easier to learn than XSLT code, and allows for flexible experimenting with and refinement of hypertext conversion rules. HTTL also allows the user to use different rule sets for different
situations. For example, a more coarse-grained segmentation strategy can be chosen for long text types.
Once stated, these rules are compiled into an executable XSLT program that performs the actual transformation of the annotated sequential documents into hypertext.
Although some of the linking rules in HTTL may become
rather complex (allowing for the generation of, e.g., one-to-many links with ordered link ends or node expansions),
hypertextualisation rules in HTTL are much more concise – and thus more maintainable – than the generated XSLT code.
References
Hammwöhner, R. (1997). Offene Hypertextsysteme. Konstanz: Universitätsverlag Konstanz. Holler, A., Maas, J.F. and Storrer, A. (2004). Exploiting
coreference annotations for text-to-hypertext
conversion. In: Proceedings of LREC, May 2004, Lisboa. 651-654.
Lenz, E. A. (in prep.). Hypertext Transformation
Language (HTTL). In: Dieter Metzing and Andreas
Witt (eds.): Linguistic modeling of information
and Markup Languages. Dordrecht: Springer.
In preparation.
Pepper, S. and Moore, G. (2001). XML Topic Maps (XTM) 1.0. Topic-Maps.Org specification, March 2001. URL http://www.topicmaps.org/xtm/1.0/.
Storrer, A. and Wellinghoff, S. (2006). Automated
detection and annotation of term definitions in
German text corpora. Accepted for LREC 2006.
Witt, A., Goecke, D., Sasaki, F. and Lüngen, H. (2005). Unification of XML Documents with Concurrent Markup. Lit Linguist Computing, 20(1):103–116, 2005.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)
Paris, France
July 5, 2006 - July 9, 2006
151 works by 245 authors indexed
The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.
Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/