Quality Assurance In Between Tags

Maria Sollohub

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
In discussions on text encoding and projects employing text encoding techniques, much emphasis
tends to be placed on code structures and syntax.
But what about the text in between the tags? How
can we check the quality of the encoded text? This
paper will attempt to identify some of the problems raised by this question, describing solutions
and suggesting extensions to these solutions.
The solutions presented here are implemented in
MECS (Multi-Element Code System), the encoding scheme developed at and employed in the
work of The Wittgenstein Archives at the University of Bergen. The ISO standard, SGML (Standard Generalized Markup Language), is compatible
with MECS. In terms of the software solutions
illustrated here, however, it is important to be
aware that they do not have to be MECS-specific.
By following the same principles, similar solutions could be and have partly been found in other
encoding systems.
Quality problems in between the tags
The Wittgenstein Archives at the University of
Bergen was established in 1990 and thus has over
five years of text encoding experience. This has
emphasised for us the importance not only of
adopting an appropriate encoding scheme, but also
of just how essential it is to be able to control the
quality of the text between the code tags.
The material being transcribed by The Wittgenstein Archives consists of 20 000 pages of manuscripts and typescripts. Wittgenstein’s habit of
continuously revising and rearranging his manuscripts means that they are characterised by different types of deletions, insertions, remarks in the
margins, and cross-references, often creating several alternative formulations of a single expression. Neither is it always clear which of these
formulations he has finally decided upon. In terms
of transcription, The Wittgenstein Archives’ basic
minimal requirement is to be able to produce both
a strictly diplomatic and a normalised/simplified
reading version of each and every individual manuscript. We thus need methods of checking that
this requirement is being fulfilled.
Of course, all text encoding projects deal with
different source material and will be faced with
different problems and needs. Our point here is to
emphasise that one should not overlook the importance of being able to control the quality of the
material in between the tags, in addition to being
able to check the syntax and structure of the encoding.
The following are typical problems encountered
by text encoding projects:
– the need to spell-check encoded documents
while remaining in primary format (thereby
retaining references in the encoded source
transcription)
– the need to be able to check that particular
elements satisfy particular conditions (e.g.
language requirements, dating formats, use
of specific terminology etc.)
Examples and Solutions
Consider the following example:
<s>Ich war mit <name>Heinz</name> im
Theater</s>
<s><name><k>H</k>einz</name> hat gesagt:
<q><c>I</c>ch war es nicht, ich <inc>hab
</inc>...</q></s>
<s><q>Was hast Du?</q></s> 1
Running this piece of text through a normal spellchecker will not give sensible results. At The
Wittgenstein Archives we use a profile which can
filter an encoded transcription to produce a complete list of graphwords together with line and
column references (LRef. and CRef.). The above
example will thus produce the following list:
LRef. CRef. ich
LRef. CRef. war
LRef. CRef. mit
LRef. CRef. Heinz
LRef. CRef. im
LRef. CRef. Theater
LRef. CRef. Heinz
LRef. CRef. hat
LRef. CRef. gesagt
LRef. CRef. ich
LRef. CRef. war
LRef. CRef. es
LRef. CRef. nicht
LRef. CRef. ich
LRef. CRef. was
LRef. CRef. hast
LRef. CRef. Du
The profile works by enforcing a specific behaviour for the contents of each and every code in the
code scheme. It will be apparent, for example, that
“hab” from <inc>hab</inc> does not feature in the
list of graphwords. This is because the contents of
an <inc>..</inc> code is defined as incomplete
and therefore will not be counted as a graphword.
It is also interesting to note what happens to the
contents of the <c>..</c> and <k>..</k> codes in
this filtering process. In order to fully understand
what happens here, it is necessary to be aware of
the fact that the filtering process automatically
changes the case of the first character of every
sentence, in order to counter capitalisation of
words whose standard form is in lower case2
.
Thus, for each occurrence of <c>..</c> and
<k>..</k> the case of the letter concerned is either
changed (<c>) or kept (<k>) in order to send an
appropriately normalised version of the resulting
word to the word list.
We may, however, have a more specialised interest in the contents of a particular code and its
occurrences in the course of a transcription. By
altering the filtering profile very slightly, we can
define a filter which will produce a word list of the
contents of all occurrences of a single code. In the
above example, we could extract the contents of
the <name> code in this way, with the following
result:
LRef. CRef. Heinz
LRef. CRef. Heinz
Lists such as the two above can be checked for
errors in a wide variety of ways – manually, automatically or both. A commercial spell-checker
which ignored numerals (i.e. would not be confused by the line and column references) could be
used in the case of the first list.
143
The triviality of the above example should not
mask the potential of these tools, e.g. if we imagine
more complicated examples – more complex encoding, a larger volume of source material, several
transcribers, transcription work continuing over a
long period of time (years). In our work at The
Wittgenstein Archives it has been proven to us
again and again just how vital it is to be able to
check the individual graphwords of a transcription. Many graphwords are “corrupted” by inserted code tags and it is surprising how easy it is for
the existence of such code tags to make it difficult
to recognise misspellings of words. There is also
a need to check the standard application of encoding conventions with respect to specific codes.
Here too, different transcribers (especially over a
long period of time) can easily deviate from project-specific encoding conventions.
Below are some specific examples of “quality”
control processes in use at The Wittgenstein Archives in Bergen:
[These examples will be supplemented with extracts from transcription work at the Archives in
the form of visual aids during a presentation of
this paper.]
I. Vocabulary control and spell checking
Wittgenstein uses the languages German, English,
French and Latin, and has a wide repertoire of
Wittgenstein-specific vocabulary. The checking
process involves the production of separate lists
for each of these languages. Rather than rely on a
commercial spell-checker, we have built up master lists in each of the four languages against
which we check our language lists from each
individual transcription, adding to them any new
acceptable graphwords that result along the way.
The checking process (in the form of a single piece
of software) functions as follows:
• produce list of German graphwords
• compare list against master list of German
words
• produce list of “new” words
• ask the transcriber to check new words (references provided), adjusting transcription for
genuine mistakes
• produce new list of “new” words
• ask transcriber to repeat checking until all
“new” words are acceptable
• repeat process for English words
• repeat process for French words
• repeat process for Latin words
• send all lists of “new” words to “word list
manager” who does a final acceptability control and adds new words to master lists.
II. Extraction of contents of individual
codes such as “person” and “dating”
codes
This is particularly relevant for a large volume of
material transcribed over a long period of time,
where it is necessary to check the consistency of
transcription conventions. For example, The Wittgenstein Archives’ encoding system expects every
occurrence of a person name to be encoded in two
parts, where the first part contains the name as it
appears in the source material, and the second part
contains a standardised form of the same name.
But who decides what the standardised form of the
name is? “Skolem, Thoralf Albert” or “Skolem,
Thoralf” or ...... ? One of the points of having such
a code is to facilitate indexing at a later date, and
it is therefore essential that only one of the many
alternatives is adopted as the standard. Extracting
a complete list of person names from transcriptions will give a very good means of controlling
such convention consistency.
The dating code presents a similar problem and is
easy to transcribe incorrectly. Whereas wrongly
coded names may also be checked against master
lists in the vocabulary control process, dates are
not so easily checked. The extraction process is a
quick and effective method of checking consistency within specific code tags in a large volume of
source material, giving reference indications directly linked to the encoded document.
The language division into German, English,
French and Latin is particularly useful for vocabulary control at The Wittgenstein Archives, but
other divisions are just as plausible – and for other
projects perhaps more useful. A need that has been
brought to our attention by The Norwegian Term
Bank in their work on terminology is that of being
able to check the use of specific vocabulary within
a particular SGML code. E.g., given that there is
a code <definition>...</definition>, used for every
entry in an extensive reference work (perhaps
more than one person responsible for encoding),
how can we control the use of terminology within
that code? The processes described above could
provide a solution to this problem.
One step further – some possible
extensions
The Wittgenstein Archives employs a number of
tools and methods in order to achieve quality “in
between tags”, but there are still areas where we
would like even more control. Could we find a
way of defining acceptable formats for particular
codes, such that a transcriber trying to use a diffe144
rent format will receive an error message? This
would be useful, for example, for defining codes
that should contain solely numerical data, no numerical data, or dates in only one acceptable format.
[More specific examples of what we would like to
achieve in this direction will be presented at the
conference.]
Conclusion
In the essential document analysis stage prior to a
new text encoding project, it is important not to
think merely in terms of which encoding scheme
and what kind of code tags are required, but to
spend some time considering what type of control
mechanisms will be necessary in order to check
the quality of the encoded documents. What problems will the material pose in terms of quality
control? Are there tools to deal with them? Can the
necessary tools be purchased or developed? At
The Wittgenstein Archives we have learned from
experience. In order to ensure quality transcriptions of 20,000 pages, we use an array of control
tools, checking both syntax and code contents
(vocabulary and encoding conventions). The
MECS tools described here can also be used on
SGML encoded documents, but there is nothing to
prevent similar “SGML” tools being developed
along the same principles. The main concern
should be that such tools exist and continue to
develop in order to cater for the needs of a growing
number of text encoding projects.
Notes
1 The codes used in this example are hypothetical, but are based on those used at The Wittgenstein Archives: <s> = sentence, <name> =
name, <k> = keep case, <q> = quote, <c> =
change case, <inc> = incomplete.
2 In German, all nouns (not just proper nouns)
are capitalised in their standard form. It is also
acceptable to capitalise the personal pronoun
“Du” (you). Thus, some care has to be taken
in the use and application of the “change case”
and “keep case” rules.
References
ISO: “Information Processing – Text and Office
Systems Standard Generalized Markup
Language (SGML)”, International Organization for Standardization, ISO 8879-1986, Geneva 1986.
Claus Huitfeldt: “MECS – A Multi-Element Code
System”, forthcoming in Working Papers
from The Wittgenstein Archives at the University of Bergen, 1995.
Claus Huitfeldt and Ole Letnes: “Encoding Wittgenstein”. Paper read at the joint ACH-ALLC
Conference in Washington D.C., June 1993.
Printed in Conference Abstracts, The Center
for Text & Technology of the Academic Computer Center, Georgetown University, June
1993, pp 83-85.
Claus Huitfeldt: “Manuscript Encoding: Alphatexts and Betatexts”. Paper read at the joint
ACH-ALLC Conference in Washington D.C.,
June 1993. Printed in Conference Abstracts,
The Center for Text & Technology of the
Academic Computer Center, Georgetown
University, June 1993, pp 85-88.

Full text license: This text is republished here with permission from the original rights holder.

Quality Assurance In Between Tags

1. Maria Sollohub

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996