Department of English - Carnegie Mellon University
The Hypothetical and Theoretical Underpinnings of
Non-traditional Authorship Attribution Studies: Assumptions, Presumptions, and
Verifiable Constructs
Joseph
Rudman
Department of English Carnegie Mellon
University
Rudman@cmphys.phys.cmu.edu
1999
University of Virginia
Charlottesville, VA
ACH/ALLC 1999
editor
encoder
Sara
A.
Schmidt
I Introduction
Some words, such as "Phrenology" or "Stylometry", insinuate their own
assumptions. In fact, nobody has ever proved that minds can be measured
by bumps or style by numbers.
Sams [1994] p. 469
In our view the protagonists of stylistic analysis in forensic
applications have not only failed to demonstrate such a link [between
style and authorship] but have not even attempted to do so.
Totty et al. [1987] p. 18
The hypothesis behind non-traditional authorship attribution studies -- those
using the computer, statistics, and stylistics -- is that every author has a
verifiably unique style. This paper points out and discusses the fact that
this hypothesis has never been empirically tested, let alone proven. The
lack of a proven theory after more than thirty years and well over 600
studies is one of the main reasons that non-traditional authorship studies
are not accepted --in the main-- by either the literary or the scientific
community.
This paper then goes on to discuss some other assumptions behind the main one
and finishes by outlining an empirical study to help move the hypothesis to
proof. The movement of this hypothesis through theory to proof is needed to
give validity to all authorship attribution studies.
II A Short History of the Hypothesis
...try to balance in your own mind the question whether the latter
[text] does not deal in longer words than the former [text]. It has
always run in my head that a little expenditure of money would settle
questions of authorship this way.... Some of these days spurious
writings will be detected by this test. Mind, I told you so.
de Morgan [1851] p. 215-216
May there not be "fingerprints" in writing, of which the author, and
most of his critics, are quite unconscious, but which could be
discovered by some new approach, to the benefit of the search for
truth?
Williams [1970] p. 2
This section outlines the history of the hypothesis that every author has a
verifiably unique style. Some of the reasons why the hypothesis was never
tested are listed with a short discussion (e.g.):
1. Computers
2. Machine readable text
3. Degree of difficulty
4. The panoply of peripheral disciplines
III What is Behind the Hypothesis: Other Sub-assumptions
Wordprinting is still in its infancy and cannot yet boast an
explanatory theory or even an agreed-upon name. Nor do its practitioners
agree on an optimal statistical model. This degree of openness...has not
prevented the convincing success of a number of important studies, which
in turn gives added intuitive plausibility to its basic
assumptions.
Reynolds [1995] p. 157
This section lists and discusses some of the sub-assumptions of the main hypothesis:
a. Style is quantifiableThat style is quantifiable is
now a given -- a fact already established. This quantifiability is
what sets the working definition of style for not only this paper,
but for most non-traditional authorship attribution studies. A short
explanation with examples of empirical studies that prove this point
is provided.
b. Style changes over timeThe problems with this
assumption are listed and discussed. Key studies on style change
over time are explicated.
c. Style is different for different genresThe problems
with this assumption are listed and discussed. Key studies on style
change over genre are explicated.
d.Style is as differentiating as (i) Fingerprints, (ii) DNA, or
(iii) Iris ScansThese assumptions differ as to the
attainable degree of certainty in any findings. This section goes on
to discuss what has been reported in the literature about the degree
of certainty and what can and should be expected.
The general problems of non-traditional authorship attribution as reported by
Rudman (Rudman, 1998) are discussed only in so far as they have first level
bearing on each sub-assumption (e.g.):
a. Which style markers to useIs the number of style
markers infinite? Is style an open ended system? (This is a
follow-up on a discussion at the Kingstown conference.)
b. Which statistical tests to useDo each of these
statistical tests need their own theoretical underpinnings? Michael
Farringdon's discussion of the criticism that, "QSUM has no
theoretical basis," is explicated.
IV An Empirical Proof
There are two strategies to making progress toward finding the
correct underlying theory, (1) the so-called "top-down" approach where
one postulates a complete theory of everything... (2) the empirically
based "bottom up" approach where one uses experimental data to make
smaller, incremental steps.
Rothstein [1998] p. 4
This section discusses the "top-down" and "bottom-up" experimental strategies
for moving the hypothesis to a correct theory and thence to studies that can
prove or disprove the theory. I have not found a "top-down" approach in the
literature -- and, understandably so, if for no other reason than
logistics.
One experimental approach to test the hypothesis, a hybrid of the "top-down"
and "bottom-up" is given here and discussed:
1. Within a time period (~ +/- 5 years), language (native), and
genre, randomly select (n1)% of all possible writers.These
constraints eliminate the need to show that a writer's style changes
over time, over genre, or language.
Randomly select (n2) passages of (n3) running words from each
selected author.
The question, "How can we be sure that (n2) is truly
representative," is discussed.
The question, "How do we know (n3) is large enough," is
discussed.
Subject each author's text to stylistic analysis.The
statement that, "This should be done using as many style markers as
possible," is explicated. A short discussion of the statistics
behind the adjudication of each style marker is presented.
Controls:
a. (n4) other writers from the same pool as (1)
b. (n5) other selections from the writers selected in
(1).
The determination of each variable "n*" is
discussed.
This type of study should be done for every non-traditional authorship
attribution study as part of the control. It is important to realize that if
this type of control is carried out for every authorship study and if it is
consistently shown that every author has a unique style, q.e.d., the
hypothesis, is proven!
A survey and critique of some important "bottom-up" studies is presented. The
importance of attacking both strategies simultaneously is discussed.
V Conclusion
One salient point made in the conclusion is that assertation is not
demonstration. Another point is that the hypothesis has already made
important steps towards theory and proof.
Bibliography
Sophia
De Morgan
Memoir of Augustus de Morgan
(By his wife Sophia de Morgan, with selections from his
letters)
London
1882
Michael
Farringdon
The Critics Answered
Jill
M.
Farringdon
(with contributions by A. Q. Morton, M. G. Farringdon and M.
D. Baker)
Analysing for Authorship
Cardiff
University of Wales Press
1996
239-261
Noel
B.
Reynolds
Statistical Wordprinting
Thomas
Hobbes
Three Discourses
Noel
B.
Reynolds
Arlene
W.
Saxonhouse
Chicago
University of Chicago Press
1995
157-162
Ira
Z.
Rothstein
The Search for a Theory of Everything
Interactions
Department of Physics, Carnegie Mellon
4
1998
Joseph
Rudman
The State of Authorship Attribution Studies: Some
Problems and Solutions
Computers and the Humanities
31
351-365
1998
Eric
Sams
Edmund Ironside and Stylometry
Notes and Queries
469-472
Dec. 1994
R.
N.
Totty
et al
Forensic Linguistics: The Determination of Authorship
from Habits of Style.
Journal of the Forensic Science Society
27
13-28
1987
C.
B.
Williams
Style and Vocabulary: Numerical Studies
London
Charles Griffin & Co., Ltd.
1970
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at University of Virginia
Charlottesville, Virginia, United States
June 9, 1999 - June 13, 1999
102 works by 157 authors indexed
Conference website: http://www2.iath.virginia.edu/ach-allc.99/schedule.html