Duquesne University, Juola & Associates
1. Introduction
One of the marks of a "mature science"1 is the development of "standards" of analytic practice, based on shared "key theories, instruments, values, and metaphysical assumptions"2 that scholars work with. This concept has been incorporated into US Law as a mark of reliable evidence.3 One of the weaknesses of authorship attribution is the absence of such standards of practice. For example, fifteen years ago Rudman estimated4 that more than 1000 different feature sets had been proposed for this task. This of course creates controversy about the appropriateness of methods and even the possibility of cherry-picking feature sets to a specific task to get a desired answer.
The solution demanded by Daubert is the use of a specific analytic technique, with standards controlling its operation and an established error rate. We offer a relatively simple protocol for such analysis in the hopes that it may provide a base for the eventual development of such a standard. We illustrate the application of our protocol with three case studies from the recent literature.
2. Methodological Overview
These cases involve the early writings of Edgar Allan Poe5, the anonymized case of an asylum seeker (cited as "Bilbo Baggins")6, and, more famously, the pseudonymous author of The Cuckoo's Calling7, revealed to be J.K. Rowling of Harry Potter fame. All three cases share several characteristics which may therefore be regarded as "typical"; unlike many literary studies of authorship, these are "verification" problems in which there is really only one candidate author of interest, and therefore available samples. No information is readily available to exclude anyone plausible from authorship (unlike, for example, the Federalist Papers, where scholars readily accepted that authorship was confined to the small group of Hamilton, Madison, and Jay). In each case, the candidate author was an established writer and a baseline of writings by that candidate could be easily obtained and validated.
Previous work has shown89 that authorship attribution can be performed with relatively high accuracy using a variety of methods. Typical performance on small, closed-class problems is around 80% accuracy.1011 Using a ensemble methods such as "mixture-of-experts" can boost performance above the baseline of any individual method. Our proposed protocol, then, is to solve this verification problem by running a number of independent studies as elimination tests against an ad-hoc distractor set, to see whether any features set can definitively eliminate the author of interest. Using multiple independent tests provides strong protection both against false acceptance and false rejection errors.
3. Protocol Details
3.1 Ad-hoc distractor set
Most stylometric methods formally choose the most likely author from among a fixed and finite set of candidates based on similarity of writing. While this set is normally chosen based on authors who may actually have had the opportunity to write the disputed document, this is not a formal requirement. From the point of view of stylistic similarity, any two authors or documents can be usefully compared. Koppel et al.12 noted that randomly chosen authors from the same general field and genre would work as well given repeated measures: "The known text of a snippet’s actual author is likely to be the text most similar to the snippet even as we vary the feature set that we use to represent the texts. Another author’s text might happen to be the most similar for one or a few specific feature sets, but it is highly unlikely to be consistently so over many different feature sets." Juola [6] applied the same technique, using newspaper articles scraped from the Web as a baseline against which to compare Baggins' writing.
3.2 Multiple independent elimination tests
The key insight here is that, quoting Koppel, any given wrong author "is highly unlikely to be consistently [similar] over many different feature sets." This insight can be formalized mathematically as follows:
If a technique is X% accurate, the chance of it being wrong is (1-X). (I.e an 80% chance of being right yields 20% chance of being wrong).
If two independent techniques are X% accurate, the chance of them both being wrong is (1-X)^2.
If K different techniques are each X% accurate, the chance of them all being wrong is (1-X)^K, which becomes arbitrarily small as K increases.
Thus using multiple independent analyses will reduce the chance of false acceptance error to as small a value as desired.
Similarly, false rejection errors can be handled by using a relaxed acceptance criterion, and essentially treating the top few candidates as "successful." This again can be demonstrated rigorously. If our technique is 80% accurate among a set of distractor authors, there is a 20% chance that the most similar author will not be the correct one. But in this case (and with suitable independence assumptions), there will also be an 80% chance that the most similar author among all other authors studied will be the correct one (by assumption), and hence only a 4% chance that the correct author will not be among the top two in the original set. (This chance drops to 0.8% for the top 3.) Thus we can say with high probability that any author not among the top few most similar has been eliminated as a plausible candidate author.
3.3 The proposed protocol formalized
We can thus formalize the proposed authorship analysis protocol as follows: Gather an ad-hoc collection of three to five authors other than the author of interest. Run a number of independent tests of different feature sets to determine which author is most similar to the questioned document on that specific feature. (JGAAP13 14 provides a huge number of feature sets from which to choose, and is designed to be extensible to enable people to add additional sets of interest). Any author not in the two or three most likely candidate authors is eliminated as a potential author. If, after enough experiments have been run, the only author not eliminated is the author of interest, his or her authorship of the questioned documents is deemed confirmed.
3.4 An example (Rowling)
The Galbraith/Rowling case is instructive. In this case, I was provided a distractor set of three authors, all contemporary female British crime writers, so their writings would be comparable to "Galbraith's." Tests were run on four separate feature sets: word lengths, character 4-grams, word pairs, and the 100 most frequent words. Of the four authors, only Rowling was not eliminated by at least one feature set.
We can determine the likelihood of error as follows: Assuming that Rowling was not the author, the probability of her appearing in the top half (top 2 of 4) in any list of candidate authors would be 50%; thus she would have one chance in 16 (approximately 6%) chance of not being eliminated through this procedure.
4. Discussion and Conclusions
Perhaps obviously, there are some caveats to the proposed protocol. The most key is, of course, the implicit assumption of independence. Is it is reasonable to believe that the distribution of word lengths is independent of their use of common function words? More importantly, can this belief be validated empirically and justified theoretically? Similarly, there are some numbers in the protocol that may need tightening -- is three to five distractor authors enough? Are five better than three? Can these numbers be justified? We will discuss this further but invite commentary on this point.
It should also be clear that this paper does not ipso facto establish a mandatory standard for authorship studies. We invite discussion and even competing proposals, in addition to further studies to establish not only what other protocols might be more accurate, but also which ones are easier to apply, or even more likely to generate useful information (beyond simple authorship). One key aspect of this proposal is that it relies primarily on rank-order statistics and does not take into account the degree of variation; a more sophisticated protocol might use parametric statistics for greater power, at the possible cost of increased complexity.
From a practical standpoint, however, this protocol may represent a substantial maturation of the field. Not only have we used it ourselves, but it has also been used by third parties [5]. The results have been validated by reference to independent ground truths (Rowling acknowledged authorship on July 12, 2013.15) The results have even been accepted in courts of law. We are thus confident that the proposed protocol will provide a relatively clear-cut way to reduce controversy regarding stylometric authorship attribution and increase its uptake and credibility.
References
1. Kuhn, Thomas S (1996). The Structure of Scientific Revolutions. 3rd ed. Chicago, IL: University of Chicago Press.
2. "Thomas Kuhn (2013). Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/thomas-kuhn/ (Accessed 31 October 2013).
3. Daubert v. Merrell(1993) Dow Pharmaceuticals, 509 U.S. 579
4. Rudman, Joseph (1998). The state of authorship attribution studies: Some problems and solutions, Computers and the Humanities, vol. 31, pp. 351–365.
5. Collins, Paul (2013). Poe's Debut, Hidden in Plain Sight? New Yorker Blog, 7 October. www.newyorker.com/online/blogs/books/2013/10/edgar-allan-poe-earliest-stories-language-software-investigation.html
6. Juola, Patrick (2013), Stylometry and Immigration: A Case Study. Journal of Law & Policy vol 21., pp. 287-298, 2013.
7. Juola, Patrick (2013). Rowling and ;Galbraith': An Authorial Analysis. Language Log posting, 16 July 2013. languagelog.ldc.upenn.edu/nll/?p=5315
8. Juola, Patrick (2008), Authorship Attribution. Delft: NOW Publishing.
9. Juola, Patrick (2012), Large-Scale Experiments in Authorship Attribution. English Studies 93.: 275-283.
10. Juola, Patrick (2012). An Overview of the Traditional Authorship Attribution Subtask. Proceedings of PAN 2012, Rome, Italy.
11. Juola, Patrick and Efstathios Stamatatos (2013). Overview of the Author Identification Task. PAN/CLEF 2013, Valencia, Spain.
12. Moshe Koppel, Jonathan Schler, Shlomo Argamon & Yaron Winter (2012). The “Fundamental Problem” of Authorship Attribution, English Studies, 93:3, 284-291.
13. Juola, Patrick (2009), JGAAP: A System for Comparative Evaluation of Authorship Attribution. Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science 1.
14. Juola, Patrick. (2012). JGAAP 5.3.0: A System for General Purpose Text Classification Experiments. EACL 2012 Workshop on Computational Approaches to Deception Detection, Avignon, France.
15. Richard Brooks (2013). Whodunnit? JK Rowling’s secret life as wizard crime writer revealed. Sunday Times article of July 14.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
Lausanne, Switzerland
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)
Organizers: ADHO