Text as a Data Type

Manfred Thaller

Authorship

1. Manfred Thaller

Max Planck Institut fur Geschichte

Parent session

Digital Manuscripts: Editions v. Archives , Manfred Thaller, Elli Mylonas

Original URL

https://web.archive.org/web/20000919103936/http://www.hit.uib.no/allc/thaller.pdf

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The administration of non linearily encoded texts
on computer systems has traditionally been seen
as a relatively high level problem in systems design. That is, relevant systems usually take a rather
traditional approach at low level programming
and add the nonlinearly functionality required for
the presentation of nonlinear texts and/or linkages
between digitally represented and transcribed
texts by specifying appropriate functionality within the environment of the particular application
system.
This creates problems, when a non-linear property
transcends a specific component of an application.
Assume, e.g., a text which is marked up according
to two overlapping hierarchies. While these two
hierarchies are represented in markup as equally
important, “loading” the text into almost any target system usually means that a browser converts
the text into a data object, which represents exactly
one hierarchy and simply ignores the other. That
is very convenient from the point of view of the
target system, as, when it is being realized, the
whole question of co-existing hierarchies can be
ignored. The point of designing a markup scheme
which allows for overlapping hierarchies, only to
loose this property when the text is actually
browsed into an application is not immediately
understandable, however.
As a solution we propose to implement a data type
“extended string”, which replaces the traditional
concept of a “string” in programming application
systems. This means, that any application program
accepts “external information”, which is browsed
into the internal extended string representation,
processed in that form and re-converted into some
kind of external representation before being displayed on an appropriate medium. This is far from
new, of course: a good example might be a system
252
like X-Windows where this general logic is used
to allow an applications programmer to manipulate strings by completely traditional tools, while the
internal string representation takes care of all
aspects of processing necessary to handle font
properties in display.
We assume, however, that this logic can be carried
considerably further. Let us assume, that a given
text is marked up by two overlapping hierarchies,
one representing the division of the text into reference units, like pages and the other some semantic division, like the names of fields of a data base
system, into which a specific substring belongs.
Even if the text is marked up in a way which
preserves both types of division, once it is browsed
and loaded into the underlying database structure,
we will normally not have the possibility anymore
to access the reference units. More explicitly: if
such a text is browsed into a data base system
which has been realized in C, the function call
strcmp(name1,name2)
will yield the same value, irrespective whether
name1 and name2 are contained on the same page
or not.
To change this, we propose the implementation of
a data type “extended string”, which has a comparison function
estrcmp(environment,name1,name2)
which by default should act just as strcmp() above.
If within an application program, however, it
should be preceded by a call to an environment
changing function
estrsetsensitive(environment,
PageSensitivity,On)
any following call to
estrcmp(name1,name2)
should result in different return values, reflecting
whether name1 and name2 are on the same page
or on different ones.
Taking examples from a series of ongoing projects
who use experimental software based on the concept of a data type “extended string” as introduced
above, the proposed paper discusses first some
practical problems of its realization and the interrelationship of such an implementation with existing programming tools, taking as an example the
embedding of the data type into a X-compatible
widget.
It should be emphasized again, that this is just an
introductory example: the number of string properties handled in that form is rather large and goes
considerably beyond the scope of overlapping
hierarchies. A complete description of the concept
of an extended string can be found in M.Thaller,
“The Processing of Manuscripts”, in: M.Thaller
(Ed.) Images and Manuscripts in Historical Computing (= Halbgraue Reihe zur Historischen Fachinformatik, vol. A 14), St Katharinen, 1992, 97-
121. All the properties in question can be divided
into three groups: (a) Those which are necessary
to implement nonlinearity (from which our initial
example has been taken), (b) those which are
necessary to connect transcribed parts of a text to
bitmaps of the image it describes or the manuscript
it transcribes and (c) those which deal with “graphic” properties of portions of a text.
In all three cases the questions raised relate to two
different fields: on the one hand they are connected to the practical dimension of programming.
This aspect is supposed to be covered by the
example quoted above. On the other hand, however, the actual policies to be implemented by such
a purely technical solution, reflect heavily the
conceptual decisions about what a specific property of a text actually means within the context of a
given discipline.
This shall be described with regard to the question
of how much information is actually related to the
third of the three problem areas given above, the
graphic properties of a text within historical research. Speaking on the most general level, we
consider a text to be “historical”, when it describes
a situation, where we do neither know for sure,
what the situation has been “in reality”, nor according to which rules it has been converted into a
written report about reality. On an intuitive level
this is exemplified by cases, where two people
with the same graphic representation of their names are mentioned in a set of documents, which
possibly could be two cases of the same “real”
individual being caught acting, which, however
could also be homographic symbols for two completely different biological entities. At a more
sublime level, a change in the color of the ink a
given person uses in an official correspondence of
the 19th century could be an indication of the
original supply of ink having dried up; or of a
considerable rise of the author within the bureaucratic ranks. Let us just emphasize for non-historians, that the second example is all but artificial:
indeed the different colors of comments to drafts
for diplomatic documents are in the 19th century
quite often the only identifying mark of which
diplomatic agent added which opinion.
What these introductory examples should demonstrate, is, that the text – the computer interpretable
representation of a written document – forms in
historical research an intermediate layer between
two other layers of information. On the one extreme we have abstract factual knowledge about the
various entities described in a text, which allows
the interpretation of it; on the other there are purely
graphical characteristics of the written document,
which may carry meaning, but need not do so.
That the second problem is a genuine markup
problem is probably obvious: if we use a computer
to prepare diplomatic drafts of the 19th century for
printing, we obviously need a way to describe a
253
portion of the document as being “written with
blue pencil”. Which, at the time of the first transcription is exactly what it says, a literal description of a graphic property, though during the process of research it may well acquire a more
abstract connotation, like “author=M. Simpson”.
This could of course be interpreted as such properties being eminently fitted to abstract rules for
markup, because at the time of producing the
markup we have not yet the faintest idea what the
final representation in print, if any, of the specific
graphic property is to be. The problem is however,
that part of the research which is supposed to be
supported – at least within an archival environment – is precisely dedicated to finding out, what
the observable graphical properties mean. If a
computer system shall therefore be able to support
historical research as opposed to adminsitering in
a convenient way results of historical research, it
has to have the capability of administering graphical properties as what they are, being able to
switch to a more abstract interpretation in time, but
always being able to fall back to what can actually
be observed.
To bring it to a point: almost all the examples
given in the discussions on standardization during
the last few years dealt with how to tag a structure
which is clearly understood and where the graphic
representation is accidental. Historical work deals
with structures in a text which we want to discover, where the graphics we see may be all the clues
we ever might get.
Concludingly the paper shows how these considerations fit into the ones that resulted in the first
example given, and can be turned into an organic
extension of an implementation as the X-compatible “extended string widget” introduced above.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Text as a Data Type

1. Manfred Thaller

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996