Pedagogical University of Krakow
Pedagogical University of Krakow
In 2007, John Burrows identified three regions
in word frequency lists of corpora in authorship
attribution and stylometry. The first of these
regions consists of the most frequent words,
for which his Delta has become the best-known
method of study. This is evidenced by a varied
body of research with interesting modifications
of the method (e.g. Argamon 2008; Hoover
2004, 2004a). At the other end of the frequency
list, Iota deals with the lowest-frequency words,
while "the large area between the extremes of
ubiquity and rarity" (Burrows, 2007) is now
the target of many studies employing Zeta (e.g.
Craig, Kinney, 2009; Hoover, 2007).
Due to the popularity of the three methods
it was only a matter of time before Delta
(and, to a lesser extent, Zeta and Iota) were
applied to texts in languages other than Modern
English: Middle Dutch (Dalen-Oskam, Zundert,
2007), Old English (Garcìa, Martìn 2007) and
Polish (Eder, Rybicki 2009). Delta has also been
used in translation-oriented papers, including
Burrowsís own work on Juvenal (Burrows,
2002) and Rybicki's attempts at translator
attribution (2009).
It has been generally - and mainly empirically
- assumed that the use of methods relying on
the most frequent words in a corpus should
work just as well in other languages as it did
in English; this question was approached in
any detail only very recently (Juola, 2009). We
could not fail to observe that its success rates
in Polish, although still high, fell somewhat
short of its guessing rate in English (Rybicki
2009a). Also, the already-quoted study by
Rybicki (2009) seemed to suggest that, in a
corpus of translated literary texts, Delta was
much better at recognising the author of the
original than the translator. This justified a
more in-depth look at the workings of Burrowsís
method both in its "original" English and in a
variety of other languages.
1. Methods
In this study, a single major modification has
been applied to the usual Delta process. Each
analysis was made for the top 50-5000 most
frequent words in the corpus - but then the 50
most frequent words would be omitted and the
next 50-5000 words taken for analysis; then
the first 100 most frequent words would be
omitted, and so on. This was done with a single R
script written by Eder; the script produced word
frequency tables, calculated Delta and produced
"heatmap" graphs of Delta's success rate for each
of the frequency list intervals, showing the best
combinations of initial word position in wordlist
and size of window, including variations of
pronoun deletion and culling parameters. Thus,
in the resulting heatmap graphs, the horizontal
axis presents the size of each wordlist used
for one set of Delta calculations; the vertical
axis shows how many of the most frequent
words were omitted. Each of the runs of the
script produced an average of ca. 3000 Delta
iterations.
2. Material
The project included the following corpora (used
separately); each contained a similar number of
texts to be attributed.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at King's College London
London, England, United Kingdom
July 7, 2010 - July 10, 2010
142 works by 295 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)
Conference website: http://dh2010.cch.kcl.ac.uk/
Series: ADHO (5)
Organizers: ADHO