Does Size Matter? A Reexamination of a Timeproven Method

  1. 1. Jan Rybicki

    Pedagogical University of Krakow

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Previous stylometric studies (Rybicki 2006, 2006a, 2007) of
patterns in multivariate diagrams of correlation matrices
derived from relative frequencies of most frequent words in
character idiolects (Burrows 1987) in a number of originals
and translations (respectively, Sienkiewicz’s trilogy of historical
romances and its two English translations; Rybicki’s Polish
translations of novels by John le Carré; and the three English
versions of Hamlet, Q1, Q2, F, and its translations into nine
different languages) have all yielded interesting similarities in
the layout of data points in the above-mentioned diagrams for
corresponding originals and translations. The repetitiveness of
the observed phenomenon in many such pairs immediately
raised some questions as to its causes – the more so as the
most-frequent-word lists for any original and its translations,
consisting primarily of functions words, modals, pronouns,
prepositions and articles (if any in a given language) contains
few direct counterparts in any two languages.
The primary suspect for a possible bias was the difference
in size between the parts of individual characters, since that
particular feature invariably remains proportionally similar
between original and translation (Rybicki 2000). The import
of this element remained unnoticed in many other studies
employing Burrows’s time-proven method (later developed,
evaluated and applied by a number of scholars, including Hoover
2002) since they were either conducted on equal samples or,
more importantly, although some did address the problem of
translation, they never dealt with so many translations at a
time as did e.g. the Hamlet study (Rybicki 2007).
The emerging suspicion was that since the sizes of the
characters studied do not vary signifi cantly between the twelve
versions of Hamlet, this proportion might heavily infl uence
the multivariate graphs produced – the more so as, in most
cases, the most talkative characters occupied central positions
in the graphs, while the lesser parts were usually limited to
the peripheries. This (together with similar effects observed
in earlier studies) had to lead to a reassessment of the results.
Also, since most studies were conducted in traditional maledominated
writing, resulting in female characters speaking
little in proportion even to their importance in the plot,
these peripheries usually included separate groups of women;
while this was often seen as the authors’ ability to stylistically
differentiate “male” and “female” idiom, the size bias could
distort even this quite possible effect.
A number of tests was then performed on character
idiolects and narrative fragments of various sizes and various
confi gurations of characters from selected English novels
and plays – ranging from Ruth and David Copperfi eld; from
Conan Doyle to Poe; narration in Jane Austen, or plays by Ben
Jonson and Samuel Beckett – to investigate how the impact
of size distorts the image of stylistic differences presented in
multivariate diagrams.
One possible source of the bias is the way relative frequencies
of individual words are calculated. Being a simple ratio of
word frequency to size of sample, it may be as unreliable
as another very similar formula, the type-to-token ratio (or,
indeed, one as complex as Yule’s K), has been found to be
unreliable as a measure of vocabulary richness in samples
of different size (Tweedie & Baayen 1997). Also, due to the
nature of multivariate analyses used (both Factor Analysis and
Multimensional Scaling), it is not surprising that the less stable
statistics of the less talkative characters would have their data
points pushed to the peripheries of the graph. Finally, since the
list of the most frequent words used in the analysis is most
heavily infl uenced by the longest parts in any text, this might
also be the reason for the “centralising” bias visible in data
points for such characters.
It should be stressed that the above reservations do not,
in any way, invalidate the entire method; indeed, results for
samples of comparative size remain reliable and unbiased. Also,
it should not be forgotten that size of a character’s part is in
itself an individuating feature of that character.
Burrows, J.F. (1987). Computation into Criticism: A Study of
Jane Austen’s Novels and an Experiment in Method. Oxford:
Clarendon Press.
Hoover, D.L. (2002). New Directions in Statistical Stylistics and
Authorship Attribution. Proc. ALLC/ACH.
Rybicki, J. (2000). A Computer-Assisted Comparative Analysis of
Henryk Sienkiewicz’s Trilogy and its Two English Translations. PhD
thesis, Kraków: Akademia Pedagogiczna.
Rybicki, J. (2006). Burrowing into Translation: Character Idiolects in
Henryk Sienkiewicz’s Trilogy and its Two English Translations. LLC
21.1: 91-103.
Rybicki, J. (2006a). Can I Write like John le Carré. Paris: Digital
Humanities 2006.
Rybicki, J. (2007). Twelve Hamlets: A Stylometric Analysis of
Major Characters’ Idiolects in Three English Versions and Nine
Translations. Urbana-Champaign: Digital Humanities 2007.
F. Tweedie, R. Baayen (1997). Lexical ‘constants’ in stylometry
and authorship studies, <

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website:

Series: ADHO (3)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None