Department of English - Carnegie Mellon University
'DNA' and Non-traditional Authorship Attribution: An
Inclusive Model
Joseph
Rudman
Carnegie Mellon University
jr20@andrew.cmu.edu
2002
University of Tübingen
Tübingen
ALLC/ACH 2002
editor
Harald
Fuchs
encoder
Sara
A.
Schmidt
Anything a person writes contains the code of his intellectual
DNA, or whatever you want to call it.
Webb 1994
The greater the number of features and the more the features
belong to different categories (e.g., syntactic structures, type of
grammatical subject, inflexions, vocabulary, spelling, and so on)
the stronger the case for shared authorship.
Eagleson 1989
INTRODUCTION:
For many years it has been obvious from the literature that most
non-traditional authorship attribution studies using one or some other small
number of style markers do not carry the weight of scientific validity with
either the majority of other authorship attribution practitioners, the
specialists in the field of the study, or the general public. (In addition
to Eagleson, see Banks and Rudman -- also, Rudman 1998)
During a talk on the "Style-Marker Mapping Project" at the ALLC-ACH 2000
conference in Glasgow, I mentioned, in passing, an attribution model based
on a "DNA" concept. (Rudman 2000) It was illustrative and not "on topic."
However, the audience picked up on this and some of the ensuing questioning
and discussion kept trying to move away from the Style-marker Mapping
Project.
This paper presents a non-traditional authorship attribution model based on a
"DNA" analogy. This paper emphasizes that it is only an analogy -- a
framework to explain the techniques of the "Inclusive Model" -- there are
obvious fundamental differences between DNA and style.
Because some of the terms in this paper could be unfamiliar to the expected
audience, a clear and concise definition is given the first time each such
term is used.
I. BACKGROUND AND DEFINITIONS
If we look at style as a living organism, style-markers are its genetic
material -- making the Style-Marker Mapping Project (Rudman, 2000) analogous
to the human genome project. I would like to extend this biology analogy:
The Inclusive Authorship Attribution Model being analogous to the DNA
analysis.
The earliest reference to DNA and style that I have seen is Bailey's
comparison of the tools used to decode the underlying makeup of the two --
X-ray diffraction for DNA, the computer for style. Bailey does not move
towards a DNA model for stylistics. (Bailey)
The lead quote by Webb also is quoted in Forsyth's dissertation. Yet Forsyth
does not use the intent of the quote to move into a DNA model. (Forsyth)
I have been leaning towards a more inclusive attribution model that would
utilize a large number of style-markers since the mid 1980's. Other
researchers also have recognized the need to expand the number of style
markers in attribution studies. As the DNA structure became decoded and the
comparison methods refined, it became the analogous model of choice. I first
mentioned the model at the ALLC-ACH Oxford conference in 1992. (Banks and
Rudman) The thrust of that presentation was towards a statistical method of
combining the results of different statistical results on various
style-markers. This section briefly traces the evolution of the DNA model
through various publications and presentations.
Clear and concise definitions of the DNA autoradiogram are given. (Kirby) A
brief explanation of why this model is necessary closes this section.
(Willing)
II. THE MODEL
Outline a method of analysis which will allow organization of these
features [the entire range of linguistic features] so as to facilitate
comparison of any one use of language with any other
(Carter, Crystal and Davy, and Darbyshire). McMenamin 1993
A) How the Inclusive Model differs from other models (e.g.
multivariate models and Burrows' Delta Project). (Holmes, Burrows)
B) The DNA Analogy is Explicated.It is shown how each locus of
the autoradiogram is equivalent to a different style-marker. The
determination of each style-marker locus is discussed.Forsyth's
suggestion at the Glasgow conference that a list of "proven"
style-markers should be provided and used is discussed.
C) Visual RepresentationA Method of visual representation of the
results of the model is shown.
D) The following two statistical methods of combining each
style-marker locus into a final answer are presented and discussed:
(1) If the style-markers that are used can be shown to be
independent of one another (e.g. word length distribution,
percentage of nouns starting sentences, type/token ratio) a
procedure based on Fisher's method for combining significance
probabilities from independent statistical tests can be used.
(Fisher)
(2) If the style-markers that are used are not independent of
each other (e.g. word length distribution, word length
correlation, percentage of latinate words) the statistical
method employed by DNA researchers can be used.
CONCLUSION
The method of determining the DNA loci and style-marker loci are different. A
single technique is employed to determine all of the DNI loci. Each
style-marker locus is determined, for the most part, by different
experimental techniques. And some of the style-marker loci are actually the
result of multivariate statistical analysis.
The Inclusive Authorship Attribution Model promises a degree of acceptability
not seen in most non-traditional attribution studies -- especially in types
of studies such as McMenamin's, "`Population Model' where there are no
obvious authorship candidates, and texts from an entire population of
possible authors are considered against texts by one suspected author."
(McMenamin)
Preliminary Bibliography
Richard
W.
Bailey
The Future of Computational Stylistics
ALLC Bulletin
7
4-11
1979
[First presented at the Association for Literary and Linguistic
Computing Fifth International Meeting, Friday, December 15, 1978, King's
College, University of London. Also in LITERARY
COMPUTING AND LITERARY CRITICISM. Ed. Rosanne G. Potter.
Philadelphia: University of Pennsylvania Press, 1989, 3-12.]
David
J.
Balding
Peter
Donnelly
Inference in Forensic Identification
JOURNAL OF THE ROYAL STATISTICAL SOCIETY A
158
[Part 1.]
21-53
1995
David
L.
Banks
Joseph
Rudman
Questionable Attribution in the Canon of Daniel Defoe:
A Study of Techniques
ALLC-ACH'92 Conference. Oxford University, April 7,
1992
1992
John
Burrows
Questions of Authorship: Attribution and Beyond. A
Lecture Delivered on the Occasion of the Roberto Busa Award
ACH-ALLC01 Conference. New York University, New York,
June 14, 2001
2001
Robert
D.
Eagleson
Linguist for the Prosecution
Geraldine
Barnes
et al
WORDS AND WORDSMITHS
Sydney
The University of Sydney Press
1989
22-31
R.
A.
Fisher
STATISTICAL METHODS FOR RESEARCH WORKERS
London
Hafner
1969
Richard
S.
Forsyth
Stylistic Structures: A Computational Approach to Text
Classification
Dissertation
University of Nottingham
1995
David
I.
Holmes
Authorship Attribution and the Book of Morman: A Case
Study in Stylometric Techniques
Ph.D Thesis
University of London, Kings College
May 1990
David
I.
Holmes
Vocabulary Richness and the Prophetic Voice
(A supplement to the main thesis.) Ph.D Thesis
University of London, Kings College
November 1990
Lorne
T.
Kirby
DNA FINGERPRINTING: AN INTRODUCTION
New York
W. H. Freeman
1992
Gerald
R.
McMenamin
FORENSIC STYLISTICS
Amsterdam
Elsevier
1993
Joseph
Rudman
The Style-marker Mapping Project: A Rational and
Progress Report
ALLC/ACH 2000 Conference, University of Glasgow,
Scotland, July 25, 2000
2000
Joseph
Rudman
The State of Authorship Attribution Studies: Some
Problems and Solutions
COMPUTERS AND THE HUMANITIES
31
4
351-365
1997
Charles
Webb
Interview in
THE INDEPENDENT MAGAZINE
35
5 February 1994
[Quoted by Forsyth, 8.]
Richard
Willing
Mismatch Calls DNA Tests Into Question
USA TODAY
3A
8 February 2000
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Universität Tübingen (University of Tubingen / Tuebingen)
Tübingen, Germany
July 23, 2002 - July 28, 2008
72 works by 136 authors indexed
Affiliations need to be double-checked.
Conference website: http://web.archive.org/web/20041117094331/http://www.uni-tuebingen.de/allcach2002/