Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution

multipaper session
  1. 1. Harald Baayen

    Max Planck Institute for Psycholinguistics - University of Nijmegen

  2. 2. Roald Skarsten

    University of Bergen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This session describes an experiment in authorship attribution in which statistical measures and methods that have been widely applied to words and their frequencies of use are applied to rewrite rules as they appear in a syntactically annotated corpus. The outcome of this experiment suggests that the frequencies with which syntactic rewrite rules are put to use provide at least as good a cue to authorship as word usage. Moreover, one method, which focuses on the use of the lowest-frequency syntactic rules, has a higher resolution than traditional word-based analyses, and promises to be a useful new technique for authorship attribution.

A number of recent contributions to authorship attribution are based on words and their frequencies of occurrence (see, e.g., Burrows 1992, 1993; Holmes, 1994; Holmes and Forsyth 1995). This comes as no surprise, as the statistical analysis of word frequencies requires minimal textual preprocessing. Nevertheless, precisely those words which have proved to have a high discriminatory resolution in the seminal work by Burrows (1992, 1993), the so-called function words (a, the, that, and, but, ..., etc.), appear to tap into the use of syntax. This suggests it might be profitable to study the use of syntax directly by analyzing the use of rewrite rules in texts.

We have designed a statistical experiment using syntactically annotated corpus material to investigate the discriminatory potential of syntactic rewrite rules for authorship attribution. The corpus, its syntactic annotation, and the details of the design of our statistical experiment, are discussed in section 1 by van Halteren. In section 2, Tweedie discusses the accuracy of methods based on measures for vocabulary richness and of methods based on the highest-frequency elements, applied both to words and rewrite rules. In section 3, Baayen investigates the discriminatory potential of the way in which authors make use of the lowest-frequency rewrite rules.

Before going into further detail, we need to make explicit three crucial details of our methodology. First, traditionally, as in the study by Mosteller and Wallace (1964), a text of unknown authorship is compared with texts of which authorship is beyond doubt. In our experiment, the authorship of all texts is known (be it only to the experiment leader, van Halteren, and not to Tweedie and Baayen, who carried out the analyses). This allows us to straightforwardly evaluate the accuracy of the methods we have used. Second, a preliminary pilot study shows that texts written by one author in different genres can differ more than texts written by different authors in the same genre. We have therefore selected our texts from one particular text type, crime fiction. Third, to ensure the accuracy of assignment is independent of our particular split in labeled and unlabeled text fragments, we also required that a successful method should group all text fragments of different authors into clearly distinguishable clusters.

Burrows, J. F., (1992). Computers and the Study of Literature. In: C.S. Butler (Ed.), Computers and Written Texts. Oxford: Blackwell. (pp. 167-204).

Burrows, J. F., (1993). Tiptoeing into the infinite: testing for evidence of national differences in the language of English narrative. In: S. Hockey and N. Ide (Eds.), Research in Humanities Computing '92. London: Oxford University Press.

Holmes, D. I., (1994). Authorship Attribution. Computers and the Humanities 28(2):87-106.

Holmes, D. I. and Forsyth, R. S., (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing 10(2):111-127.

Mosteller, F. and Wallace, D. L., (1964). Applied Bayesian and Classical Inference. The case of the Federalist Papers. New York: Springer.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC