Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)
Adam Mickiewicz University
CLiPS Computational Linguistics Group - Universität Antwerpen (University of Antwerp), Institute for the Study of Literature in the Low Countries (ISLN) - Universität Antwerpen (University of Antwerp)
Before the age of printing, texts were copied
manually only. This was done by scribes –
persons who made a copy of a text for their own
use or for the use of others. Often, the original
is no longer available. All that remain are copies
of the text, or copies of copies of copies. We
know that scribes made mistakes, and that they
changed spellings and wording according to
what they thought fit for their audience. And we
know that they sometimes reworked the text or
parts thereof.
Up till now, these insecurities may have made
medieval texts less interesting to work on for
digital humanists. However, the complex world
of medieval textual copying is a very challenging
topic in its own right. Recently, some scholars
have tried to develop and apply digital methods
and techniques to gain insight in manual text
transmission. In this session, they will explain
which specific research questions led to their
approach, and why traditional methods did
not suffice. Then they will describe the digital
approach they developed, how they gathered
their data, and present the first results. They
will sketch the next steps for their research
and reflect on which larger questions may come
closer to an answer, and which other areas
of digital humanities will benefit from this
research.
The first paper (by Jacob Thaisen) will focus
on how the variability of spelling characteristic
of Middle English makes probabilistic models
a powerful tool for distinguishing scribes
and exemplars. The second paper (by Karina
van Dalen-Oskam) goes into vocabulary and
frequencies of parts of speech, as a means
to get insight in the influence scribes exerted
on the texts they copied. The third paper (by
Mike Kestemont) aims at erasing or minimizing
textual differences in order to assess stability
and the persistence of authorial features of
manually copied medieval texts.
Probabilistic Modeling
of Middle English Scribal
Orthographies
Jacob Thaisen
thaisen@ifa.amu.edu.pl
Adam Mickiewicz University, Poznań, Poland
With the Norman Conquest of 1066
written English ceased to be employed for
administrative and other official purposes, and
the normative spelling conventions established
for the West Saxon variety of Old English
fell into disuse. When the language eventually
regained these crucial domains around three
centuries later, a norm for how to spell English
no longer existed. The only models available to
scribes were the practices of other languages
known to them or, increasingly as English
strengthened its position, the conventions
adopted in the exemplars from which they
copied. As a result of the interaction of all
these factors, Middle English—the English of the
period from the Battle of Hastings to Caxton's
introduction of printing from movable type in
1476—is characterized by considerable variation
in spelling, even within the output of a single
individual. There is nothing at all unusual
about one and the same scribe of this period
representing one and the same word in more
than one way, including very frequent words
such as the definite article and conjunctions.
Moreover, scribes could use the variability to
their advantage in carrying out the copying task,
2
for example, to adjust the length of lines or speed
up the copying process.
The variability of Middle English orthography
means it would be misguided to assume
that two texts penned by a single scribe
necessarily follow, or should follow, identical
spelling conventions. They are much more
likely to exhibit variation within bounds.
Any stylometric attribution of Middle English
texts to a single scribe or of portions of
a text in a single scribal hand to different
exemplars on the basis of spelling must take
this nature of the evidence into account. The
probabilistic methods known from statistically-
based machine translation, spell-checking,
optical character recognition, and other natural
language processing applications are specifically
designed to recognize patterns in "messy" data
and generalize on the basis of them. It is
the purpose of this paper to demonstrate that
this property of these methods makes them
adequate stylometric discriminators of unique
orthographies.
The methodologies developed in connection
with the preparation of
A Linguistic Atlas of
Late Medieval English
(McIntosh, Samuels,
et al. 1986) separate unique orthographies by
manual and predominantly qualitative means;
if quantitative data are collected at all,
they are subjected only to simple statistical
tests. Since texts differ lexically, they are
not readily comparable in all respects. The
Atlas
solution is to generate comparability by
restricting the investigation to the subset of
the respective lexicons of the various texts
they may reasonably be expected to share,
such as function words. Spelling forms for
these words are collected from samples of
the texts by selective questionnaire and any
pattern present in their distribution detected by
visual inspection. The forms are further often
analyzed by reference to known dialect markers.
The latter translates as the researcher relating
the forms to phonological and morphological
variables, although there is recognition in
the dialectological literature that geographic
significance too may characterize other levels of
language.
However, it is now practically feasible to
estimate the full orthography of which a given
text is a sample by building probabilistic models.
The reason is that recent years have witnessed
an increase in the amount of diplomatically
transcribed manuscript materials available in
digital form, which makes it possible to abandon
the qualitative focus. Scholars are already
subjecting the lexical variation present in
similar materials to sophisticated computer-
assisted quantitative analysis (Robinson 1997,
van Dalen-Oskam and van Zundert 2007). Their
studies point the way forward.
The building blocks of Middle English
orthographies are not individual letters but
sequences of letters of varying length which,
further, combine with one another in specific
ways, with phonograms, morphograms, and
logograms existing side by side. Every Middle
English orthography has a slightly different set
of building blocks, making
n
-gram models a
good type of probabilistic model for capturing
the distinct properties of each. Such a model
is simply an exhaustive listing of grams (letters
and letter sequences), each with its own
probability and weight.
"Perplexity" expresses how well a given model
is able to account for the grams found in a text
other than the one from which the model itself
is derived. That is, a model – itself a list of
grams – is compared with a list of the grams
found in another text and the measure simply
expresses the level of agreement between the
two lists. However, to find out whether the two
texts are instances of the same orthography, a
better model is a model not of the text from
which it is derived but of the orthography which
that text is a sample of. This is because the lexis
of the text means the probabilities of the grams
are not those they have in the orthography. This
skew can be reduced by generalizing the model.
"Smoothing" refers to the act of (automatically)
introducing weights to achieve the best possible
generalization.
Chen and Goodman (1998) carry out a
systematic investigation of how a range of
smoothing techniques perform relative to one
another on a variety of corpus sizes in terms of
the ability to account for test data. Their data
come from present-day English and their basic
unit is the word rather than, as here, the letter.
They find the technique developed by Witten
and Bell (1991) consistently to generalize the
least effectively, and that developed by Kneser
and Ney (1995), and later modified by Chen
and Goodman (1998), consistently to do so the
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at King's College London
London, England, United Kingdom
July 7, 2010 - July 10, 2010
142 works by 295 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)
Conference website: http://dh2010.cch.kcl.ac.uk/
Series: ADHO (5)
Organizers: ADHO