New York University
1. The need for theory
Harold Somers and Fiona Tweedie pose the following question: If a vocabulary- based authorial attribution technique fails to attribute an original text and a pas- tiche to different authors, “is this because the pastiche is good, or because the technique is faulty” (Somers and Tweedie, 2003)? The question places computa- tional techniques and literary concerns into a direct relationship, inviting a formu- lation that might “leap from [word] frequencies to meanings” (Craig, 1999), and so seems an ideal opportunity to explore and interrogate our assumptions surround- ing literary interpretation, literary style, and genre. Moreover, a failure or refusal to attend to the question would seem to be a catastrophe for attribution studies: without a generalized set of criteria to critique how stylistic concerns influence an algorithm’s effectiveness at identifying an author’s statistical “fingerprints,” the general validity of authorial attribution techniques will remain contested despite persuasive examples of authorial attribution techniques like Hoover (2003) and Garcia and Martin (2007). While scholars such as Jockers (2013) are interested in exploring how attributes such as lexical variability, word frequencies, word choice, and other statistical measures can be used as indicators of authorship, style, genre, gender, and even nationality, there remains a paucity of theory to explain why these and other indicators happen to be more or less effective with respect to the finite sets of authors, books, and/or genres they are applied to. Somers and Tweedie approach their question pragmatically—that is, they subject Alice in Wonderland, Gilbert Adair’s pastiche entitled Alice through the Needle’s Eye, and several “con- trol” texts to a battery of authorship attribution techniques and report the results of their tests. They do not, however, provide a theoretical framework to understand why one technique would be more or less effective than another. Nor are their re- sults generalizable to other cases because there does not exist a larger theoretical framework to understand how Somers and Tweedie’s experiments may relate to a different set of originals and pastiches we might examine. The success or lack of success of the statistical techniques used to distinguish authorship has much to do with the idiosyncrasies of the individual texts and authors being considered and is frequently aided by our historical knowledge of existing texts for which author- ship is already known. But far too little effort has been devoted to developing a theoretical model that might provide us with a compendium of the possible ways our statistical methods might fail.
2. The interdependence of authorship and style
Somers and Tweedie’s question highlights the contextual dependencies of our terms and the basic differences in assumption between nontraditional authorship attribution and computational stylistics. Effectiveness, for example, appears sen- sible in the context of authorship attribution techniques, but less so in the context of stylistics. Nontraditional authorship attribution techniques exist in a system for which the value of the question hinges on its falsifiability. Is a text of unknown authorship written by author X or Y given a set of existing texts written by both authors? There can only be one correct historical answer—which is usually only one author—and this correct answer is mutually exclusive to any other answer. Such “facts” are independent of the method we use to discover them. Alterna- tively, the question of pastiche quality—of whether the imitation is well or poorly done—is a question for which stylistics should provide an answer; nontraditional authorship attribution may also influence judgments of quality. When performed under the banner of literary studies, computational stylistics is concerned with in- teresting answers that point us to new interpretive insights about a particular text we are studying. These facts are less stable in that they depend on the methods which allow for their discovery and are not necessarily mutually exclusive.
Yet the epistemological distinctions above are countermanded by cases in which the concerns of authorship interpenetrate the concerns of style in ways that are difficult to generalize. When Erasmus declares a certain letter to be incorrectly at- tributed to St. Jerome based on the belief that “Jerome has a special quality about him, a kind of mental savour and temperament, a quality which may be felt rather than explained,” and, earlier, when we see this “never-failing quality, his lively humour...which the learned admire in Cicero,” the stylistic concern of quality is being used to determine authorship (1992, 80; see also Love 2002:18-22). Yet are issues of authorship and style always interrelated? The answer, I contend, is yes; in limiting cases where we have appeared to isolate these concerns it is because we have already (intentionally or unintentionally) picked our texts in such a way that separation becomes possible.
When we point to a question that does appear to belong exclusively to the do- main of stylistics or authorship attribution, is it not always the case that a careful a priori selection of the texts was conducted at an earlier stage of analysis in which authorship and style did impinge on each other? The decision, for example, to un- dertake a nontraditional authorship attribution test necessarily entails never losing sight of the relationship between authorship and style (i.e., genre) since the signal from the latter sometimes “overpowers” the signal of the former (as one sees in Hoover 2013). And when we do find a statistical result that countermands our (literary) expectations, is not a useful first step to examine the interdependence between authorship and style to account for the surprise?
3. Pastiche Quality and Authorship
Somers and Tweedie’s original question can be separated into two: the first relates to the fundamental validity of computational authorship attribution tech- niques and the second relates to a functional definition of what a pastiche is. Ex- plicating these two questions in detail is useful in developing a theoretical perspective to critique and explore Somers and Tweedie’s paper as well as for developing a better theoretical foundation for the acceptance or rejection of certain assumptions inherent to the contemporary practice of computational stylistics.
As Somers and Tweedie note, for authorship attribution techniques to be most effective one tracks linguistic habits “which may be the least susceptible to variation” (412)—that is, we look at features that an author does unconsciously since the features an author has no control over are those expected to be least affected by the idiosyncrasies of genre, historical moment, and so forth. Yet if we are seeking to examine the “quality” of a pastiche, as Somers and Tweedie ask, then we are seeking features an author is consciously employing in pursuit of imitating another author. The similarities that are relevant to the literary quality of a pastiche would necessarily be those features for which a human reader is able to readily identify and is likely to be those which a reader has had the most practice at identifying. To be sure, the category of pastiche has perhaps a more overt connection to both past literature written and contemporary culture—its existence is defined directly by what has already been written and depends upon the reader’s recognition of this. These relationships between tradition and culture are perhaps no more or less important to other literary forms, but the category of pastiche specifically asks the reader to reflect upon such relationships directly and overtly.
Reframing Somers and Tweedie’s question as two questions allows us to adopt a scheme from R. G. Collingwood’s “On the So-Called Idea of Causation” so as to parse the original ambiguity in Somers and Tweedie’s into several logically distinct classes. This parsing will allow us to see that attributing the success or failure of authorship attribution algorithms to only the two possibilities of algorithm effec- tiveness or pastiche quality effectively equates two incommensurable ontological systems as if they were logically consistent. To avoid this difficulty, we need only clarify the original question so that we are acting in a logically consistent manner. However this particular scheme comes with a high theoretical cost. To resolve the ambiguity inherent to Somers and Tweedle’s question, it may be necessary to re- sort to literary descriptive categories for which the identification by a finite series of computable steps may be theoretically forbidden.
Collingwood, R. G. (1938). "On the So-Called Idea of Causation." Proceedings of the Aristotelian Society, New Series 38: 85-112
Craig, H. (1999).“Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything About Them?” Literary and Linguistic Computing 14: 103-13.
Erasmus, D. (1992) Collected Works of Erasmus. ed. and trans. by James Brady and John Olin. Toronto: University of Toronto Press.
Garcia, A. and Martin, J. (2007). “Function Words in Authorship Attribution Studies.” Literary and Linguistic Computing 22, No. 1: 49-66.
Jockers, Matthew. (2013). Macroanalysis: Digital Methods and Literary History. Urbana-Champaign: University of Illinois Press.
Hoover, D. and Hess, S. (2009). “An exercise in non-ideal authorship attribution: the mysterious Maria Ward.” Literary and Linguistic Computing 24(4): 467-89.
Hoover, D. (2013). “The Full-Spectrum Text-Analysis Spreadsheet.” DH 2013 Conference Abstracts. University of Nebraska-Lincoln; (2003). “Multivariate analysis and the study of style variation.” Literary and Linguistic Computing 18(4): 34160.
Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.
Ramsay, Stephen. “Toward an Algorithmic Criticism.” Literary and Linguistic Computing 18.2 (2003): 167-74.
Somers, H. and Tweedie, F. (2003). “Authorship Attribution and Pastiche.” Computers and the Humanities 37: 407-29.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)