Authorial Attribution and Computational Stylistics: if you can tell authors apart, have you learned anything about them?

  1. 1. Hugh Craig

    Centre for Literary and Linguistic Computing - University of Newcastle

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Attribution studies, alone among the various literary applications of statistics, have achieved mainstream status - finding an audience in the most-read specialist journals, such as PMLA, and (when it is a matter of a new Shakespeare ascription) even in the mass media. On the other hand, descriptive uses of quantitative measures in literary study have made little impact beyond a small circle of practitioners, and articles arising from them are to be found for the most part only in journals dedicated to the overlap of computer applications and the humanities. In the area of descriptive monographs in English literature from major publishing houses, once one has listed Milic on Swift, Burrows on Austen and Corns on Milton, then the field is more or less covered. This paper will attempt an in-principle account of the curious relationship between the twin disciplines of computer-assisted classification and literary description based on empirical measures, with an eye to the incongruent assumptions of rule-based classification and of the interpretation of texts as it is routinely practised in the discipline.
The issues will be illustrated through a worked example, a study of the plays of the English Renaissance playwright Thomas Middleton in the context of a large sample of plays from the same period by other authors. The paper will explore the implications of using this same dataset (frequencies of very common words in ninety plus plays) for both a specific problem of authorial attribution, and for a more open-ended interpretation of the stylistic characteristics which set the Middleton group apart from the others.
It is worth noting at the outset that it is odd, on the face of it, that computational stylistics should concentrate its efforts on the classification of texts and text samples into "given" categories such as author and, to a lesser extent date, gender, nationality, and so on, at a time when, in the dominant discourse of the discipline, such categories have long been the object of a radical scepticism. In this sense computational stylistics and conventional literary study are seriously divergent. Moreover, little effort has been made on either side to move beyond mutual suspicion or ignorance and towards a serious, informed engagement. At present, there is a gulf between the positivism which is the foundational assumption, indeed the raison d'etre for computational stylistics, and the constructivism which is at the basis of the most influential recent acts of literary interpretation, and often part of their declared project. The upshot (given the dominance of the latter in the discipline) is the neglect of the former, except (as already mentioned) in the specialised sub-discipline of authorial attribution.
The question arises as to what is the fundamental difference between classification studies and descriptive work in a literary-statistical context. The two manners of proceeding can on one level be distinguished as a focus on independent and on dependent variables. In classification, dependent variables (quantities of a given linguistic feature, be it word-type, syntactical class, or higher-level construct such as image or metaphor) are of interest as means to the end of reliably grouping text samples according to independent variables, typically with a small number of values such as Author A, B or C, Author A or not-A, or Date Range T1 or T2. In description, on the other hand, the behaviour of the dependent variables is of interest in themselves, and is interpreted not in relation to previously established, "given" variables but as exhibiting structures which invite understanding in fluid, incremental ways, dimensions (in the metaphorical rather than mathematical sense) which may be constructed by the interpreter, informed by the governing interests and expectations of the discipline. In terms of method, the distinction might be encapsulated by the difference between Discriminant analyses or neural networks directed by a "training set" of data based on a categorical variable, and Principal Component analyses or biaxial plots of frequencies of linguistic features, where an interpretation will take the form of an argument that the patterns exhibited should be understood in terms of pressures and traces of higher-order factors which are alleged to be operating in the text samples, but which are not themselves patient of quantitative analysis, and are derived from the tradition of literary study.
Common sense suggests that if quantitative measures are reliable in telling authors apart, and in making some other classifications according to "given" categories like date, and offer access to internal evidence genuinely independent of impressionistic criticism, then some among them ought also to be of use in the main business of literary study, the interpretation of texts. The reverse is certainly true: if quantitative measures are to be taken seriously as characterising an author's style, they should be subject to the test that they can serve to separate multiple shorter text segments known to be by the author from a set of text segments of similar genre or mixture of genre, similar period, and the same nationality, by other writers.
It is also true that some measures, useful for attribution studies, have only limited usefulness for literary criticism or even for stylistics. Cyrus Hoy's discovery that a high proportion of "ye" forms among second-person pronouns was a reliable marker for the authorship of John Fletcher as against his various collaborators, elaborated in a series of articles in the 1950s and early 1960s, is an example. A propensity of this kind would seem to reflect nothing more than a linguistic idiosyncrasy, an unmotivated choice among semantically equivalent alternatives. Certainly Hoy makes no mention of any significance for this marker other than the pragmatic identification of Fletcher's shares in collaborative work. Indeed, an "unconscious" aspect in a marker feature is often a declared desideratum. Such apparently indifferent selections ("while" for "whilst", "upon" for "on") are regarded as less subject to deliberate variation within an *oeuvre*, and thus give classifiers confidence that they are dealing with a pattern at such a base level that imitation is ruled out. To this extent there is a tension between classification and description: as soon as the density or scarcity of a feature takes on a stylistic aspect, it comes to be seen as a purveyor of meaning even at a very general level (contributing to an impression of archaism or modernity, simplicity or sophistication, an intra- or intersubjective focus, and so on) then the problem of conscious control over such markers creates a difficulty, arising from the possibility that such density or scarcity could be varied as a matter of genre, or imitated by another.
Here, at a more general level, the "van Peer dilemma" comes into play. The most reliable linguistic variables (those based on linguistic features that are simple enough to be readily identified), the theory goes, are impossible to interpret because of their very simplicity; on the other hand, high-order language features that are susceptible of literary interpretation are too complex to admit of rule-bound categorisation and so cannot be usefully counted.
The paper will argue that there is an escape from this dilemma through middle-order countable features such as the common words. It will also suggest that classification studies and descriptive ones can be mutually reinforcing rather than competitive: a success in classification can give confidence about the reliability to measures which can then be subject to a "literary" interpretation, and an understanding of the behaviour of the dependent variables can offer an understanding at a deeper level of what it is that has brought about a discrimination according to author.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC