Attributes: A Problem

Claus Huitfeldt

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Text Objects and Their Properties
Letters of the alphabet, numbers, punctuation
marks and a few other conventional signs are basic
constituents of any text written in an alphabetical
writing system. Any computing system which is
capable of representing strings of alphanumeric
characters is therefore also capable of representing
at least the most basic linguistic contents of texts
written in such writing systems.
However, looking back on a long tradition of
manuscripts and printed books, we are definitely
not prepared to admit that there is nothing more to
texts than that. Not only can written texts contain
graphical elements such as drawings etc., but the
page layout, typography and graphical design may
also play a crucial role in identifying, emphasising
and increasing readability of parts of a text and
conveying structural relationships between them.
Since a computer text file is in a certain sense
simply a long string of characters, it is by marking
them up with reserved character combinations that
text processing systems let us represent such properties and structures. Text encoding systems such
as SGML are an attempt to systematize, generalize
and standardize such markup.
The marks or tags serve to identify specific parts
145
or elements of the text and to ascribe specific
properties to these elements. In a text encoded in
accordance with the TEI’s Guidelines for SGML
encoding, e.g., we will frequently find structures
such as:
... <emph> ... </emph> ...
The start tag and the end tag (i.e., the reserved
character combinations ’<...>’ and ’</...>’) indicate the start and end of an element and ascribe to
this element a property indicated by the generic
identifier (the GI, in this case ’emph’). The TEI
Guidelines tell us that this particular GI indicates
the property ’emphasized’, which ’marks words or
phrases which are stressed or emphasized for linguistic or rhetorical effect’ (TEI P3, p 955).
Broadly speaking, SGML-encoded documents
consist entirely of such elements, which may be
nested within other elements (cf. the OHCO-model of texts in DeRose et al). We could therefore
also say that an SGML document is an ordered
sequence of characters and markup, the markup
ascribing certain properties to parts of the sequence and indicating certain relationships between
these parts.
This seems to invite a rather clear-cut conception
of texts as collections of objects with properties:
The characters are so to speak the basic building
blocks or elementary particles which cannot be
decomposed any further, they are the smallest
possible objects out of which higher-level objects,
elements, are built. An object is either a character
string or an element, and properties can be ascribed to such objects by GIs.
2. Attributes
It may be that one and the same object has more
than one property, or that we want to classify or
qualify an ascribed property further. SGML allows us to express such features by means of
attributes:
... <foreign lang=fr> ... </foreign> ...
In this case, the GI ’foreign’ ascribes the property
of ’belonging to some language other than that of
the surrounding text’ (TEI P3, p 981) to the element, while the attribute ’lang’ identifies this
language more specifically as being French (indicated by the attribute value ’fr’). In other cases, the
attribute may supply further information about the
same element:
... <foreign lang=fr rend=italics> ... </foreign> ...
The value ’italics’ of the attribute ’rend’ (for ’rendition’) does not provide a further classification or
qualification of the language in question, but indicates that the element was or should be printed in
italics.
Attributes can be useful since they allow us to
express complex structures in a regular way, allowing for various sorts of processing depending on
the purpose at hand:
<foreign lang=fr rend=italics> ... </foreign>
<emph rend=italics> ... </emph>
<name type=person reg=’Smith, John’
rend=bold> ... </name>
SGML also lets us enforce rules e.g. to the effect
that the attribute ’reg’ is required on the GI ’name’
but not allowed on the GIs ’emph’ and ’foreign’,
that the attribute ’rend’ is allowed on all GIs and
required on ’emph’, etc. The SGML attribute mechanism thus gives us a very strong tool to describe textual structures.
In practice one will usually design an SGML
encoding system so that what are perceived as
primary properties are represented by GIs, whereas attributes either qualify or classify these primary properties or add secondary attributes to those
ascribed by the GI.
What is considered primary and secondary will
vary from context to context. What for certain
purposes may be encoded like this:
<foreign rend=italics> ... </foreign>
<emph rend=italics> ... </emph>
<name rend=bold> ... </name>
may for other purposes more suitably be encoded
like this:
<italics type=foreign> ... </italics>
<italics type=emph> ... </italics>
<bold type=name> ... </bold>
It is sometimes said that the choice of whether to
represent a certain property as a GI or as an attribute is a matter of taste and style. But while an
element can have several attributes it can only
have one GI. This may lead to problems in cases
where one and the same element has properties
which are both indicated by GIs.
Assume e.g. that one has chosen to represent emphasized phrases and names as GIs with attributes
as illustrated above, and we encounter an emphasised name printed in bold italics. Either one must
add a new attribute to the system, indicating one
of the properties which would normally have been
represented as a GI, e.g. like this:
<name type=person reg=’Smith, John’
rend= ’bold italics’ mode=emph>
... </name>
146
(The sole purpose of the attribute ’mode’ is to
carry the value of what would otherwise have been
a GI.) Or one must nest two elements with the
relevant GIs in question inside each other, and
either duplicate common attributes or decide
which of the two elements should carry them, e.g.
like this:
<name type=person reg=’Smith, John’
rend=bold>
<emph rend=italics>
... </emph></name>
Both cases seem to leave room for some slack or
even inconsistency in encoding practice, and they
mean that the same phenomena will be encoded
by different mechanisms or in different ways from
case to case.
The latter case also raises the question which
should be the outer and which the inner of the two
elements, leaving additional room for slack and
inconsistency.
3. Encoding Without Attributes
Among the aims of the Wittgenstein Archives at
the University of Bergen is to transcribe the (mostly unpublished) 20,000-page manuscript Nachlass
of the Austrian philosopher Ludwig Wittgenstein.
The encoding system used in this project is based
on MECS (cf. Huitfeldt 1993 and 1995), which in
all respects relevant for the present discussion is
identical to SGML.
In this project, we decided not to make use of
attributes at all.
Instead, a separate GI was introduced for every
possible combination of properties, i.e. for what
would otherwise have been represented as a combination of GIs and attributes. One of the reasons
for this decision was that it was rather difficult to
decide which were to count as primary and which
as secondary properties of the texts.
E.g., Wittgenstein frequently marks parts of his
texts with underlining. There are several different
types of underlining, – such as straight, wavy,
dotted and broken lines, underlinings with one,
two or several lines. We know that Wittgenstein
had his personal conventions for such markings in
the manuscripts, and that the different kinds of
underlining have different meanings. We know
e.g. that a straight line means emphasis and that
wavy lines in general indicate dissatisfaction with
content or formulation, but we do not know the
exact meaning of all these conventions. And although we do know that Wittgenstein indicated
emphasis and dissatisfaction also by other means,
a lot of uncertainty of interpretation usually pertains to these other occurrences.
Therefore, we limit our interpretation of the text
to identifying the convention used, – we do not
take the further step of interpreting what the convention in each individual case stands for, – i.e. we
indicate the underlinings, not their meaning.
In SGML, we might have encoded all these properties with one GI and two attributes, e.g.
<u shape=s number=1> for 1 straight line,
<u shape=s number=2> for 2 straight lines,
<u shape=w number=1> for 1 wavy line,
etc.
Instead, we encode like this:
<us1> for 1 straight line,
<us2> for 2 straight lines,
<uw1> for 1 wavy line,
etc.
The number of possible combinations of such
properties is considerable but limited, – the number of GIs we have to handle becomes larger than
the number we would have had to deal with had
we used a system with attributes – but it is manageable. The Wittgenstein Archives encoded texts
for several years in this manner, and everything
seemed to function well.
4. New problems
However, after a while we proceeded to a part of
the Wittgenstein Nachlass which did cause us
problems. Some texts have been edited by several
different individuals or by Wittgenstein himself at
different times, i.e. they are written in different
“hands”. Text originally written in one hand has
sometimes been subject to cancellation, modification or addition by a later hand, which may in turn
have been subject to alteration by a yet later hand,
etc.
E.g. a word originally written in one hand (by
Wittgenstein himself) may be underlined by a later
hand (e.g. his colleague Russell), the underlining
may have been cancelled by a third hand (e.g.
Ramsey) and the cancellation cancelled by a
fourth hand (e.g. Wittgenstein himself, thus dismissing Ramsey and agreeing with Russell). The
second hand, which cancelled the underlining may
also have deleted (i.e. cancelled) the word itself, –
and again this cancellation may have been lifted
by a later hand, and so on.
The introduction of new GIs in order to cover all
combinations of these parameters throughout
20,000 pages became a rather hopeless business in
the long run.
It is worth noticing that the complexity which
threatened to break our system down was not the
number of properties involved (in fact, only two
properties, ’hand’ and ’cancelled’, were involved). Neither was it the number of values that these
147
properties could have (’cancelled’ has only two
values, and the number of hands in Wittgenstein’s
manuscripts is not very large). Nor was the number of different GIs which could have these properties overwhelming.
What was special about these properties was that
they could not only be properties of a text element,
but also properties of properties, properties of
properties of properties, and so on: A word can be
cancelled (deleted), the cancellation can be cancelled, the cancellation of the cancellation can be
cancelled – and on each of these levels the cancellation in question can have the property of being
made in a different hand.
5. Multilevel Attribution
Although the question as to whether properties
should be represented in the form of GIs or attributes is mostly a practical one, we have now found
a criterion to identify certain properties that cannot
be represented with GIs only.
What is characteristic about these properties is that
they can occur at any level, that they can be used
recursively, and that there is in principle no limit
to the number of levels at which they can apply.
Since the difference between MECS and SGML is
that whereas MECS has no attributes SGML has,
one would think that for the Wittgenstein Archives
the above was a strong argument in favour of
SGML. However, it turns out to be just as difficult
to take the above factors into account in SGML as
in MECS. (The TEI’s encoding of certainty and
responsibility (TEI P3, p 521-528) and its use of
feature structure mechanisms (TEI P3, p 475-519)
are examples of solutions to similar problems in
SGML.)
As mentioned earlier, an SGML attribute qualifies
an element or its GI, not the other attributes of the
same element. I.e., we can of course design special
attributes for each of the levels involved on an ad
hoc basis, but these will be entirely dependent
upon some specific application for their correct
interpretation.
6. Conclusion
Multi-level properties, i.e. properties which can be
properties of properties, and which can occur at
any level of attribution, cannot be represented by
GIs and must be represented by some kind of
attribute mechanism.
However, such an attribute mechanism must have
capabilities we do not find in SGML, i.e certain
attributes must be attributable of elements as well
as of other attributes, and this attribution must be
able to occur unrestricted at any level of recursion.
References
ISO: “Information Processing – Text and Office
Systems Standard Generalized Markup
Language (SGML)”, International Organization for Standardization, ISO 8879-1986, Geneva 1986.
C.M. Sperberg-McQueen and Lou Burnard (eds.):
“Guidelines for the Encoding and Interchange
of Machine-Readable Texts (TEI P3)”, Chicago and Oxford April 1994.
Claus Huitfeldt: “MECS – A MultiElement Code
System”, forthcoming in Working Papers
from the Wittgenstein Archives at the University of Bergen, 1995.
Claus Huitfeldt: “MECS-WIT – A Registration
Standard for the Wittgenstein Archives at the
University of Bergen”, forthcoming in Working Papers from the Wittgenstein Archives at
the University of Bergen, 1995.
DeRose, Durand, Mylonas, and Renear: “What is
Text, Really?” in Journal of Computing in
Higher Education, Winter 1990, Vol I (2), p
3-26.

Full text license: This text is republished here with permission from the original rights holder.

Attributes: A Problem

1. Claus Huitfeldt

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996