Department of English - Carnegie Mellon University
On Determining a Valid Text for Non-Traditional
Authorship Attribution Studies: Editing, Unediting, and De-Editing
Joseph
Rudman
Carnegie Mellon
jr20@andrew.cmu.edu
2003
University of Georgia
Athens, Georgia
ACH/ALLC 2003
editor
Eric
Rochester
William
A.
Kretzschmar, Jr.
encoder
Sara
A.
Schmidt
INTRODUCTION:
The work’s material history since its
inception, the vast and largely uncharted alterations imposed by
the history and by the mediation of generation upon generation
of printers, editors, publishers—this is a relativism we are
prone to ignore, but ignore at our peril.
(Marcus 1996)
The literary texts often are not homogenous
since they may comprise dialogues, narrative parts, etc. An
integrated approach, therefore, would require the development of
text sampling tools for selecting the parts of the text that
best illustrate an author’s style.
(Stamatatos et al. 2001)
Most non-traditional authorship attribution studies place too much emphasis
on statistics, stylistics, and the computer and not enough focus is given to
the integrity and validity of the primary data— the text itself.
It is intuitively obvious and easily shown empirically that if you are
conducting a study of the patterns of an author’s stylistic usage (e.g.
Daniel Defoe), the study will be systematically denigrated by each
interpolation of non-Defoe text and even by each interpolation of Defoe text
of a different genre or significantly different time period.
The crux of this paper is about one important element in the empirical
methodology of a valid non-traditional authorship attribution study—the
preparation of the text for stylistic and statistical analysis: unediting,
de-editing, and editing.
The general emphasis of this presentation is on prose analysis with some
peripheral treatment of drama and poetry.
I. BACKGROUND AND DEFINITIONS
A. Why a valid text is necessary should not even be asked. No
valid experiment can be done if the input data is flawed—garbage
in, garbage out!Too many practitioners simply grab a text
from any available source—without any thought to its pedigree.
(e.g. Khmelev and Tweedie’s “Using Markov Chains for the
Identification of Writers.”)Are undertakings such as
Project Gutenberg or the Oxford Text Archive with their easily
available machine readable texts a boon or a bane to
non-traditional authorship atudies? This question is explored in
some detail.
B. Selecting a starting textThe validity of using texts
from the oral tradition and the scribal tradition is
discussed.Before any manipulation and analysis of a text is
carried out, a valid starting text must be acquired that
fulfills many necessary requirements. This selection is
primarily bibliographically driven. If a practitioner is not
savvy in the bibliographical arts, a collaborator who is should
be recruited.Examples of bad starting texts causing
problems are given (e.g. Peng and Hengartner’s “Quantitative
Analysis of Literary Styles.”)If you cannot obtain a valid
text, do not do the study.
C. Unediting—getting back to the
state of “not yet edited”De-editing—removing selected text
Editing—changing (preparing) a text for
statistical analysis
II. EXPLICATIONThe statement, “each age, each author, each study
demands a different mixture of the following particulars,” is discussed.
A. UneditingAs a rule, the closest text to the holograph
should be found and used.
1. Editorial interpolation
a. Filled in lacunae
b. Marginal notation
c. ‘Changes’ in the text
d. Critical editions
2. Printer interpolation
For the Printer is a beast, and
understands nothing I can say to him of correcting
the press.
Dryden (Ward p. 97)
a. Catchwords (the first word of the next leaf
or gathering)
b. Signatures (combinations of letters and
numerals used something like catchwords)
c. Removing obvious typesetting mistakes (a
slippery slope)
i. ‘f’ for the long ‘s’
ii. Double words (e.g. ‘the the’ ‘was
was’
B. De-editing
1. Quotes
a. Factual, unattributed
b. Factual, attributed
c. Self quotes from earlier writings
2. Plagiarism
a. Direct copy
b. Paraphrasing
c. Imitation
3. Collaboration
a. Sectional
b. Phrasal
c. Word level
d. Ghostwriting
4. Genre
a. Poetry, prose, drama, letters, etc.
b. Mixture (e.g. verse drama)
5. Graphs and Numbers
a. Tables
b. Lists
c. Arabic and Roman numerals
6. Guide words
a. Titles—chapter headings—the end word
‘Finis’
b. Marginal annotation
7. Foreign Languages
a. Sentence level and greater
b. Phrase or word level
8. Translations
a. Verbatim
b. Concepts
9. Examples of items de-edited (or not de-edited)
incorrectly by practitioners
a. Biblical quotes
b. Titles in direct apposition
c. Numbers that are spelled out
d. Words with an initial capital
C. Editing
1. Encoding the text
a. Why (e.g. homographic forms)
b. TEI
2. Regularizing
a. Spelling
b. Contracted forms (simple, compound)
c. Hyphenation
d. Masked words (e.g. ‘D_ _ _ e’ for ‘Defoe’)
3. Lemmatizing
a. Pro
b. Con
D. Special Problems in Drama and Poetry
1. Stage directions
2. The ‘age’ dependency of transmission and technique.
III. SOME EXAMPLESStudies that are compromised by mistakes of
commission and/or omission in editing, unediting, or de-editing.
A. Historia Augusta
1. Twelve individual studies
B. Shakespeare
1. Eliott and Valenza
2. Foster
3. Horton
C. Defoe
1. Hargevik
2. Rothman
IV. CONCLUSION
1. Some items that are de-edited are valid style markers in
their own right (e.g. latin phrases, different genre) and should
be treated as such in a parallel study.
2. No matter which text is selected, the practitioner must
disclose which text was used and everything that was done to
it.
3. The same care must be taken with every text in the
study—the anonymous text, the suspected author’s text, and all
of the control texts.
4. If valid texts cannot be located and correctly edited,
unedited, and de-edited, do not do the study
5. A valid text does not guarantee a valid study. However, a
non-valid text guarantees a non-valid study.
REFERENCES
Richard
D.
Altick
John
J.
Fenstermaker
The Art of Literary Research
(Fourth Edition)
New York
W.W. Norton & Company
1993
John
Burrows
Questions of Authorship: Attribution and Beyond. A
Lecture Delivered on the Occasion of the Roberto Busa Award
ACH-ALLC01 Conference. New York University, New York,
June 14, 2001
2001
Ward
E.
Y.
Eliott
Robert
J.
Valenza
So Many Hardballs, So Few Over the Plate: Conclusions
From Our ‘Debate’ With Donald Foster
Computers and the Humanities
36
450-460
2002
Don
Foster
Author Unknown: On the Trail of Anonymous
New York
Henry Holt and Company
2000
Bertrand
A.
Goldgar
Imitation and Plagiarism: The Lauder Affair and Its
Critical Aftermath
Studies in Literary Imagination
34
1
1-16
2001
D.
C.
Geetham
Textual Scholarship: An Introduction
New York
Garland
1992
Gregory
Grefenstette
Pasi
Tapanainen
What is a Word, What is a Sentence? Problems of
Tokenization
Proceedings of the 3rd International Conference on
Computational Lexicography
Budapest
Research Institute for Linguistics, Hungarian Academy of
Sciences
1994
Steig
Hargevik
The Disputed Assignment of “Memoirs of an English
Officer to Daniel Defoe”
(Part I and Part II)
Stockholm
Almqvist and Wiksell
1974
David
I.
Holmes
, et al
A Widow and Her Soldier: Stylometry and the American
Civil War
Literary and Linguistic Computing
16
4
403-420
2001
Thomas
B.
Horton
The Effectiveness of the Stylometry of Function Words
in Discriminating between Shakespeare and Fletcher
Thesis
University of Edinburg
1987
Dmitri
V.
Khmelev
Fiona
J.
Tweedie
Using Markov Chains for Identification of
Writers.
Literary and Linguistic Computing
16
3
299–307
2001
Alexander
Lindey
Plagiarism and Originality
New York
Harper and Brothers
1952
Leah
S.
Marcus
Afterword: Confessions of a Reformed Uneditor
Andrew
Murphy
The Renaissance Text: Theory, Editing,
Textuality
Manchester
Manchester University Press
2000
211–216
Leah
S.
Marcus
Unediting the Renaissance: Shakespeare, Marlow,
Milton
London
Routledge
1996
Maximillian
E.
Novak
The Defoe Canon: Attribution and De-attribution
Huntington Library Quarterly
59
1
83–104
1997
Roger
D.
Peng
Nicolas
W.
Hengartner
Quantitative Analysis of Literary Styles
The American Statistician
56
3
175-185
2002
Project Gutenberg
URL:
Pat
Rogers
The Text of Great Britain: Theme and Design in Defoe's
‘Tour’
Cranbury, NJ
1998
Irving
N.
Rothman
Defoe De-Attributions Scrutinized Under Hargevik
Criteria: Applying Stylometrics to the Canon
Papers of the Bibliographic Society of America
94
3
375–398
2000
Joseph
Rudman
The State of Authorship Attribution Studies: Some
Problems and Solutions
Computers and the Humanities
31
351-365
1998
Joseph
Rudman
Non-Traditional Authorship Attribution Studies in the
Historia Augusta: Some Caveats
Literary and Linguistic Computing
13
3
151-157
1998
Eliot
Slater
The Problem of “The Reign of King Edward III:” A
Statistical Approach
Cambridge
Cambridge University Press
1988
E.
Stamatatos
, et al
Computer-Based Authorship Attribution Without Lexical
Measures
Computers and the Humanities
35
193–214
2001
Text Encoding Initiative
James
Thorp
Watching the Ps & Qs: Editorial Treatment of
Accidentals
Lawrence, Kansas
University of Kansas Printing Service
1971
Charles
E.
Ward
The Letters of John Dryden: With Letters Addressed to
Him
Durham, NC
Duke University Press
1942
David
S.
Williams
Stylometric Authorship Studies in Flavius Josephus and
Related Literature
Lewistown, New York
The Edwin Mellen Press
1992
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at University of Georgia
Athens, Georgia, United States
May 29, 2003 - June 2, 2003
83 works by 132 authors indexed
Affiliations need to be double-checked.
Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/