Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)
Introduction Van Dalen-Oskam and van Zundert (2007) applied
the lexical richness measures Yule’s K and Burrows’s
Delta in an authorship attribution study on the
Middle Dutch Romance of Walewein, which we know
was written by two authors. The researchers used a
walking window to find where the second author took
over from the first author. The change of authors could
be seen in the graphs for Delta for three strata of highfrequency
words (1-50, 51-100, 101-150); it showed the
best in frequency stratum 101-150. In the highest frequency
stratum, however, the break between the two
scribes responsible for the manuscript of the text (this
break occurred much earlier in the text than the author
break) was even clearer.
Following up on this research, the researchers looked
into the distribution of parts of speech in the abovementioned
frequency strata (van Dalen-Oskam & van
Zundert 2008). When they examined the strata, starting
with the highest frequency words (1-50) and going on
to frequency stratum 101-150, they found an increasing
percentage for the categories ‘noun’ and ‘verb’, and a
decreasing percentage for ‘pronoun’ and ‘preposition’.
This may imply that in the area in which this text showed
the clearest difference between scribes (i.e. frequencies
1-50), part of the uniqueness of these scribes lies in their
idiosyncratic use (as to frequencies) of pronouns and
prepositions. In the area in which Walewein showed the
clearest difference between authors (frequencies 101-
150), part of the uniqueness of the authors lies in their
idiosyncratic frequencies of nouns and verbs. In other
words: the scribes seem to differ especially in the frequencies
of function words, whereas the authors seem
to differ especially in the frequencies of content words.
Research questions
Authorship attribution for medieval texts before the age
of printing is hampered by several practical problems.
First, we often do not know a text’s author. However,
even when the author is known, almost always his or remain
are copies of the text, or copies of copies of copies,
which are usually difficult to position in a trustworthy
family tree (stemma) of the manuscripts. Thus, we do
not have the text of the original author, but only texts
resulting from manual copying by scribes – persons who
made a copy of a text for their own use or for the use of
others (who may or may not have paid for these copies
and perhaps had special wishes for their copies). We
know that scribes made mistakes, and that they changed
spellings and wording according to what they thought
fit for their audience. And we know that they sometimes
reworked the text or parts thereof. In most cases, however,
the original text is clearly recognizable. But when a
text shows very many changes, scholars start describing
these copies as adaptations, and consider them a new text
rather than a copy, and see the scribe not as a copyist but
as an author in his or her own right. Thus, the question
is: how many changes are needed to call a text not a copy
but a new text? All in all, the extent of the influence a
scribe had on the text he or she copied has been insufficiently
studied.
The Walewein results led to new research questions.
The first was whether we could confirm the findings in a
larger set of texts. If scribes do indeed differ as regards
how frequently they use certain parts of speech, from the
attribution point of view we could ask whether it may be
possible to distinguish scribes from each other. From a
literary and philological point of view, we may expect a
clearer answer to the question in what ways scribes differed
from authors (if they did differ, that is). And from
a methodological standpoint, we would want to know
which measures yield the best results.
Corpus and method
Not many Middle Dutch texts are available in a substantial
number of copies. We chose a work by the Flemish
author Jacob van Maerlant: the Rijmbijbel (‘Rhyming
Bible’), which is a translation/adaptation of the
Medieval Latin Historia scholastica written by Petrus
Comestor. Van Maerlant finished this work in 1271, and
many fragments and fifteen manuscripts (though not all
containing all parts of the text) are handed down to us,
dating from ca. 1285 to the end of the fifteenth century.
One of these manuscripts is available in a good edition;
it is also digitally available lemmatized and tagged for
parts of speech. Transcriptions of the other manuscripts
had to be made for this research. Because of the length
of the texts (almost 35,000 lines), we had to work with
samples. We chose 5 samples of 200-240 lines from different
parts of the text, and transcribed the parallel texts
(if available) from all 15 manuscripts, lemmatized the
samples and tagged them for parts of speech. The manuscripts
are indicated by the letters A, B, C, D, E, F, G,
H, I, J, K, L, M, N and O. We approached the samples
as ‘bags of words’ and decided to compare these bags of
words as regards vocabulary (number of unique types
and tokens) and frequencies of parts of speech. We differentiated
between ten parts of speech:
00 noun (content word)
02 proper name (content word)
10 adjective
20 main verb (content word)
21 copula / auxiliary verb
30 numeral
4* pronouns
40 personal pronoun
41 demonstrative pronoun
42 relative pronoun
43 interrogative pronoun
44 indefinite pronoun
45 possessive pronoun
50 adverb
70 preposition
8* conjunction
80 coordinating conjunction
81 subordinating conjunction
82 comparative conjunction
For each part of speech, in each sample we measured the
absolute frequency, the relative frequency, the average
of the fifteen samples, the standard deviation, the z-score
and the ranking of the manuscript in comparison with the
other fourteen manuscripts.
Example
To give an impression of the material, the second line
of the ‘Judith’ episode in all fifteen manuscripts is presented
below. Judith was ‘van liue wijf van herten man’
(‘as to her body (a) woman, as to her heart (a) man’):
A Van liue wijf van herten man
B van liue wijf van herten man
C Van liue wijf van herten man.
D Van liue wijf van herte man
E van liue wijf van herten man
F Van liue wijf van herten man
G Van liue wijf. van herten man...
H Van lyue een wijf van herte een man I Wijf van liue van herten man
J Van liue wijf van herten man
K Van liue wijf van herten man
L Van liue wijf van herten man
M Van liue wijf van herte man
N Van liue wijf van herten man
O van liue wijf van herten man
Manuscript H has two indefinite pronouns, something
that none of the other manuscripts has. Manuscript I has
a different order in the first noun phrase. The differences
in manuscript H will have an influence on word counts
and the frequencies of parts of speech, while the difference
in manuscript I will not show up in these counts.
The differences among the manuscripts for this line are
small. Many other lines show more, and more complex
differences.
Fig. 1
Results
Reading through all fifteen Judith episodes, we find that
two of the manuscripts show remarkable differences
compared to the other thirteen. These manuscripts are
also the odd ones out in the statistical results. Figure 1
presents the results for the nouns (excluding names). The
horizontal axis is the mean of the frequency of nouns in
all fifteen Judith episodes. The vertical axis gives the zscores:
the higher a manuscript is located on the graph,
the more significant the deviation from the mean. Everything
above z-score 1 may be statistically significant.
Manuscript I clearly deviates from all the others as to the
frequency of nouns. Other parts of speech yield different
results (cf. figure 2).
Fig. 2
In the mean of all codes in figure 3, only E and I are
above z-score 1. All other manuscripts are below this
line. They do, however, show some variation, and from
the point of view of the scholar, this variation is interesting.
An evaluation of all results for the different parts of
speech shows that adjectives differ the most and nouns
differ the least. Only manuscript I shows a significant
deviation in the frequency of the nouns. Other parts of
speech that often differ are pronouns, adverbs and – to
a lesser degree – auxiliary verbs/copula. Names, main
verbs and conjunctions show more differentiation than
the nouns excluding names, but not much. These results
seem to confirm that in copying a text, scribes could
indeed ‘manoeuvre’ in the area of function words (adverbs,
adjectives, etc.), but may have been kept in check
as to content words (nouns, names, main verbs). The
only manuscript that deviates wildly in the frequency of is immediately clear to the researcher that, at least in the
episode about Judith, it is a very free adaptation of the
text. In this episode, the scribe clearly has to be seen as
an author.
Fig. 3
Our paper will present a comparison with the other samples
from the same text. This comparison will show that
it is unlikely that we will be able to distinguish scribes
with this method – although, as stated, we can clearly
spot an ‘author’ between the scribes. We hope to apply
some other measures to the corpus, for example typetoken
ratio, Yule’s K and Burrows’s Delta.
Conclusions
The analysis and comparison of the vocabulary of copies
of the same text seems to be a promising way to gain
more insight into the range of freedoms that scribes were
allowed. Of course, both more tests and a much larger
corpus are needed. A comparison with other languages
would also be interesting. The results of this empirical
approach will promote, for example, the development of
new tools or the fine-tuning of existing tools for scribal
measurements (e.g. Spencer & Howe 2001, 2002).
References
Karina van Dalen-Oskam & Joris van Zundert (2007).
‘Delta for Middle Dutch: Author and copyist distinction
in “Walewein”’. In: Literary and Linguistic Computing
22, pp. 345-362.
Karina van Dalen-Oskam and Joris van Zundert (2008).
‘The Quest for Uniqueness: Author and Copyist Distinction
in Middle Dutch Arthurian Romances based
on Computer-assisted Lexicon Analysis’. In: Marijke
Mooijaart and Marijke van der Wal (eds.) Yesterday’s
words: contemporary, current and future lexicography.
[Proceedings of the Third International Conference on
Historical Lexicography and Lexicology (ICHLL), 21-
23 June 2006, Leiden]. Cambridge: Cambridge Scholars
Publishing, 2008, pp. 292-304.
Matthew Spencer and Christopher J. Howe (2001). ‘Estimating
distances between manuscripts based on copying
errors’. In: Literary and Linguistic Computing 16,
pp. 467-484
Matthew Spencer and Christopher J. Howe (2002).
‘How accurate were scribes? A mathematical model’. In:
Literary and Linguistic Computing 17, pp. 311-322
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO