The Determination and Use of Controls

paper
Authorship
  1. 1. Joseph Rudman

    Department of English - Carnegie Mellon University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

"In scientific research, control means several things. For the present, let it mean that the scientist tries to systematically rule out variables that are possible 'causes' of the effects under study other than the variables hypothesized to be the 'causes.'" Kerlinger and Lee, p. 5
Introduction
Non-traditional authorship attribution studies are those studies that make use of the computer, statistics, and stylistics. The hypothesis behind these studies is that each author has a unique and verifiable style.

However, most non-traditional attribution studies place too much emphasis on the elements of statistics, stylistics, and the computer and not enough focus is given to the overall design of the study as a scientific experiment. One element of this design is the selection and use of controls (aka "comparison groups").

Most practitioners do give "lip service" to the concept and use of controls -- never explicating their selection, use, or value -- never using all of the necessary controls.

The importance of the correct application of controls in authorship studies was made apparent in the recent Donald Foster recantation of his non-traditional attribution of "A Funerall Elegie" to Shakespeare. Foster, after traditional scholarship "proved" him wrong, now agrees that his much ballyhood attribution study used an "inadaquate" base of control texts. [Niederkorn]

Each type of scientific experiment demands its own set of controls -- this is one aspect of the scientific method. Controls are an integral part of any experimental design that adequately tests a given hypothesis. Controls are used to check the validity of the experimental design and to ensure the validity of the results.

Attribution Types

"Differences in usage of literary elements found in a text are not necessarily an indication of differences in authorship, unless inter-authorship parameters have been compared to intra-authorship parameters to establish critical levels of variation." Larry La Mar Adams, p. 86.
There are four main types of authorship attribution [Rudman 2000]:

Anonymous work -- no idea of potential author
Anonymous work -- two, three, or some other workable number of potential authors
Anonymous work -- a collaboration
Anonymous work -- did author 'A' write the questioned work
This paper concentrates on type (4). The determination of controls for the other types will be in "Appendix A" and will be available at the conference as a handout.
Control Types

"Linguistic Stylistics is the Scientific analysis of individual style-markers as observed and described in the idiolect of a single writer, as well as class style-markers as identified in the language or dialect of groups of writers." McMenamin, p. 115.
This paper defines the word control as it pertains to authorship studies and discusses four types of controls:

The "author" touchstone sample
If there is any, even the slightest doubt that a work is by the "author," that work must be excluded from this control.
A subset of the "author" sample
This sample is selected and put aside before any analysis begins.
"Non-author" writings
This requires the practitioner to identify every author and work that fits the selection criteria. The number of authors and texts is discussed below.
Named "non-author" suspects
This would include every 'good faith' suspect author of the anonymous text.
Control Selection

The paper goes on to discuss how to select each type of control:

Genre
Somers and Tweedie claim that they can mix genres and overcome the problem "by a combination of a statistical technique...and choosing measures based on low level linguistic features...."[Somers and Tweedie, p. 412] Other practitioners, such as John Burrows and David Hoover, find it important to separate genres before analysis [Burrows 1992a, p. 175-182] -- even sub-genres (e.g. first person narratives from third person narratives. [Hoover] Burrows also has a telling graph that, "..shows a complex pattern in which genre transcends authorship." [Burrows 1992b, p.101, 102]
Gender
This is a current "hot topic. [Woods] The recent work by Argamon, Koppel, Fine, and Shimoni (in fact one of their papers is "in press" and another is "to appear") forces the practitioner to use gender as a criterion in control text selection. [Argamon et al.] [Koppel et al.]
Time period
This paper will look at some of the chronology and style studies to show the importance of this item to the selection of controls. The (+/-) time factor is discussed -- from Holmes' (+3/-2 years) to the (+/- centuries) of the Historia Augusta. [Holmes] [Rudman 1998]
Randomness
Block Sampling, Spread Sampling, Stratified Block Sampling, and Haphazard Sampling are discussed. [Neumann] Johnstone's, "On the Necessity for Random Sampling," is explicated. [Johnstone]
Representative
There are various ways to select a sample of texts. However, a practitioner cannot simply declare a sample to be "representative" -- e.g. "...forty-three plays of known authorship, which form the comparison samples, HELD TO BE REPRESENTATIVE [emphasis mine] of the usages of six early Modern dramatists." [Hope, p.15]
Size (number of authors in the "non-author" writings)
The ideal would be to include all of the authors and texts that conform to the other selection criteria. And, as Foster found out, at some point the sample is too small to be valid.
Size (number of tokens in each control)
"...large samples aren't necessarily good samples...the representativeness of a sample is actually more important than sample size." [Best]
Statistical reasoning
A short review of the reasoning behind controls is given.
Same native language
The importance of restricting controls to texts written by members of the same native group is discussed.
Anonymous texts, collaborative texts, pastiche, ghostwriting
Why none of these are acceptable in a valid control is discussed.
Other criteria such as education, geographical regions within a native language sphere, political convictions, and religious convictions are possible determiners -- but studies to evaluate their importance are yet to be finalized.
Not one of the ten listed items are without controversy. Most of them are in dispute. These disputes are analyzed in this paper and a reasoned determination made.
The availabiliy of electronic texts (on the internet in particular) can be a boon or a bane in the selection and acquisition of controls. However, too many practitioners let what is on the internet and what is easily available from other sources determine the selection and number of authors in their controls. One example of this is Michael Farringdon's work on Henry Fielding. [Farringdon]

Use of Controls
"And although controls are purportedly used, the methods are never tested entirely apart from the problem for which they were designed." Zimmer, p. 33
This section discusses the way that the various controls should be used and are used and mis-used in a representative sample of experiments.
Examples
Examples of problem studies are explicated;
Historia Augusta
"...the twelve attribution studies of the Historia Augusta.... All provide classic examples of invalid controls." Rudman 2003, p. 29.
Shakespeare
"But it must be viewed as a major setback when one highly visible study can be cited by skeptics as proof that the whole quantitative enterprise is fruitless, or a playground for fringe theories having no historical or computational validity." Foster 1996, p. 255
Defoe
"Most investigators of similar stylo-statistical problems do not divulge how their samples were built up, or how sample size was estimated. It is, certainly, very sensible to leave out such compromising matter, for any attempt to lay down principles in these cases is liable to attract criticism." Hargevik, Part I, p. 28.
Conclusion

"Nothing can do more to lend an air of credibility to a claim than the suggestion that it has been proven in scientific studies or backed by scieintific evidence. Sadly, however, many claims made in the name of science are founded on misapplications of some aspect of scientific method." Carey, p. 6.
The use of controls cannot be left to whimsy. Controls of convenience must be avoided if the practitioner expects the nihil obstat of the gatekeepers -- if the practitioner expects the results to be accepted by the community of scholars.

Valid controls are not a guarantee of a valid study. However, invalid controls guarantee a suspect study that at best must be taken with a grain of salt.

References
Adams, Larry La Mar. "Literary Statistics." (Under 'Correspondence') ALLC BULLETIN, 2.2 (1974): 85-87.
Argamon, Shlomo, et al. "Gender, Genre, and Writing Style in Formal Written Texts." TEXT [To appear] Pre-print courtesy of the authors.
Best, Joel. DAMNED LIES AND STATISTICS. Berkeley: University of California Press, 2001.
Burrows, John F. "Computers and the Study of Literature." In Butler, C. S. (Ed.) COMPUTERS AND WRITTEN TEXTS. Oxford: Blackwell, 1992, pp. 107-204.
Burrows, John F. "Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information." LITERARY AND LINGUISTIC COMPUTING, 7.2 (1992):91-109.
Carey, Stephen S. A BEGINNER'S GUIDE TO SCIENTIFIC METHOD. (2nd Edition) Belmont, CA: Wadsworth Publishing Co., 1998.
Farringdon, Michael G. "A Stylometric Analysis." (Appendix C) In NEW ESSAYS BY HENRY FIELDING: HIS CONTRIBUTIONS TO THE CRAFTSMAN (1734-1739) AND OTHER EARLY JOURNALISM WITH A STYLOMETRIC ANALYSIS BY MICHAEL G. FARRINGDON. Martin C. Battestin. Charlottesville: The University Press of Virginia, 1989. 549-591.
Foster, Donald W. "Response to Elliot [sic] and Valenza,'And Then There Were None.'" COMPUTERS AND THE HUMANITIES, 30 (1996): 247-255.
Hargevik, Steig. THE DISPUTED ASSIGNMENT OF "MEMOIRS OF AN ENGLISH OFFICER" TO DANIEL DEFOE (Part I and Part II) Stockholm: Almqvist and Wiksell, 1974.
Bibliography Holmes, David. "Authorship Attribution and the Book of Mormon: A Case Study in Stylometric Techniques." Ph.D. Dissertation. University of London, 1990.
Bibliography Hoover, David L. "Statistical Stylistics and Authorship Attribution: An Empirical Investigation." LITERARY AND LINGUISTIC COMPUTING, 16 (2001): 421-444.
Bibliography Hope, Jonathan. THE AUTHORSHIP OF SHAKESPEARE'S PLAYS: A SOCIO-LINGUISTIC STUDY. Cambridge: Cambridge University Press, 1994.
Bibliography Johnstone, D. J. "On the Necessity for Random Sampling." BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE 40 (1989): 443-457.
Bibliography Kerlinger, Fred N., and Howard B. Lee. FOUNDATIONS OF BEHAVIORAL RESEARCH. (Fourth Edition) London: Wadsworth, Thomson Learning, 2000.
Bibliography Koppel, Moshe, et al. "Automatically Categorizing Written Texts by Author Gender." LITERARY AND LINGUISTIC COMPUTING (17.4) In press.
Bibliography Love, Harold. ATTRIBUTING AUTHORSHIP: AN INTRODUCTION. Cambridge: The Cambridge University Press, 2002.
Bibliography McMenamin, Gerald R. FORENSIC STYLISTICS. Boca Raton: CRC Press, 2002.
Bibliography Neumann, Kenneth J. THE AUTHENTICITY OF THE PAULINE EPISTLES IN THE LIGHT OF STYLOSTATISTICAL ANALYSIS. Atlanta, Georgia: Scholars Press, 1990.
Bibliography Niederkorn, William S. THE NEW YORK TIMES, 20 June 2002, B1, B5.
Bibliography Rudman, Joseph. "Cherry Picking in Nontraditional Authorship Attribution Studies." CHANCE 16.2 (2003): 26-32.
Bibliography Rudman, Joseph. "Non-Traditional Authorship Attribution Studies: Ignis Fatuus or Rosetta Stone?" BSANZ BULLETIN 24.3 (2000): 163-176.
Bibliography Rudman, Joseph. "Non-Traditional Authorship Attribution Studies in the HISTORIA AUGUSTA: Some Caveats." LITERARY AND LINGUISTIC COMPUTING 13.3 (1998): 151-157.
Bibliography Somers, Harold, and Fiona Tweedie. "Authorship Attribution and Pastiche." COMPUTERS AND THE HUMANITIES 37.4 (2003): 407-429.
Bibliography Woods, Michael. "Men, Women Not Only Speak But Also Write Uniquely." PITTSBURGH POST-GAZETTE, Sunday, September 7, 2003.
Bibliography Zimmer, George Willis. "The Attribution of Authorship: A Computerized Method Evaluated and Compared with Other Methods Past and Future." Ph.D. Dissertation, Michigan State University, 1969.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None