Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution

  Harald Baayen

    Max Planck Institute for Psycholinguistics - University of Nijmegen

  Roald Skarsten

    University of Bergen

This session describes an experiment in authorship attribution in which statistical measures and methods that have been widely applied to words and their frequencies of use are applied to rewrite rules as they appear in a syntactically annotated corpus. The outcome of this experiment suggests that the frequencies with which syntactic rewrite rules are put to use provide at least as good a cue to authorship as word usage. Moreover, one method, which focuses on the use of the lowest-frequency syntactic rules, has a higher resolution than traditional word-based analyses, and promises to be a useful new technique for authorship attribution.

A number of recent contributions to authorship attribution are based on words and their frequencies of occurrence (see, e.g., Burrows 1992, 1993; Holmes, 1994; Holmes and Forsyth 1995). This comes as no surprise, as the statistical analysis of word frequencies requires minimal textual preprocessing. Nevertheless, precisely those words which have proved to have a high discriminatory resolution in the seminal work by Burrows (1992, 1993), the so-called function words (a, the, that, and, but, ..., etc.), appear to tap into the use of syntax. This suggests it might be profitable to study the use of syntax directly by analyzing the use of rewrite rules in texts.

We have designed a statistical experiment using syntactically annotated corpus material to investigate the discriminatory potential of syntactic rewrite rules for authorship attribution. The corpus, its syntactic annotation, and the details of the design of our statistical experiment, are discussed in section 1 by van Halteren. In section 2, Tweedie discusses the accuracy of methods based on measures for vocabulary richness and of methods based on the highest-frequency elements, applied both to words and rewrite rules. In section 3, Baayen investigates the discriminatory potential of the way in which authors make use of the lowest-frequency rewrite rules.

Before going into further detail, we need to make explicit three crucial details of our methodology. First, traditionally, as in the study by Mosteller and Wallace (1964), a text of unknown authorship is compared with texts of which authorship is beyond doubt. In our experiment, the authorship of all texts is known (be it only to the experiment leader, van Halteren, and not to Tweedie and Baayen, who carried out the analyses). This allows us to straightforwardly evaluate the accuracy of the methods we have used. Second, a preliminary pilot study shows that texts written by one author in different genres can differ more than texts written by different authors in the same genre. We have therefore selected our texts from one particular text type, crime fiction. Third, to ensure the accuracy of assignment is independent of our particular split in labeled and unlabeled text fragments, we also required that a successful method should group all text fragments of different authors into clearly distinguishable clusters.

