Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution

Harald Baayen; Roald Skarsten

Authorship

1. Harald Baayen

Max Planck Institute for Psycholinguistics - University of Nijmegen
2. Roald Skarsten

University of Bergen

Child sessions

Comparison of word-based and syntax-based methods: Vocabulary richness measures and the highest frequency elements, Fiona Tweedie
Experimental Design: Syntactic Annotation as Words, Hans van Halteren
The discriminatory potential of the lowest frequency rewrite rules, Harald Baayen

Original URL

https://web.archive.org/web/19981207053458/http://gonzo.hit.uib.no/allc-ach96/Panels/Baayen/BAAYENNY.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Abstract
This session describes an experiment in authorship attribution in which statistical measures and methods that have been widely applied to words and their frequencies of use are applied to rewrite rules as they appear in a syntactically annotated corpus. The outcome of this experiment suggests that the frequencies with which syntactic rewrite rules are put to use provide at least as good a cue to authorship as word usage. Moreover, one method, which focuses on the use of the lowest-frequency syntactic rules, has a higher resolution than traditional word-based analyses, and promises to be a useful new technique for authorship attribution.

Introduction
A number of recent contributions to authorship attribution are based on words and their frequencies of occurrence (see, e.g., Burrows 1992, 1993; Holmes, 1994; Holmes and Forsyth 1995). This comes as no surprise, as the statistical analysis of word frequencies requires minimal textual preprocessing. Nevertheless, precisely those words which have proved to have a high discriminatory resolution in the seminal work by Burrows (1992, 1993), the so-called function words (a, the, that, and, but, ..., etc.), appear to tap into the use of syntax. This suggests it might be profitable to study the use of syntax directly by analyzing the use of rewrite rules in texts.

We have designed a statistical experiment using syntactically annotated corpus material to investigate the discriminatory potential of syntactic rewrite rules for authorship attribution. The corpus, its syntactic annotation, and the details of the design of our statistical experiment, are discussed in section 1 by van Halteren. In section 2, Tweedie discusses the accuracy of methods based on measures for vocabulary richness and of methods based on the highest-frequency elements, applied both to words and rewrite rules. In section 3, Baayen investigates the discriminatory potential of the way in which authors make use of the lowest-frequency rewrite rules.

Before going into further detail, we need to make explicit three crucial details of our methodology. First, traditionally, as in the study by Mosteller and Wallace (1964), a text of unknown authorship is compared with texts of which authorship is beyond doubt. In our experiment, the authorship of all texts is known (be it only to the experiment leader, van Halteren, and not to Tweedie and Baayen, who carried out the analyses). This allows us to straightforwardly evaluate the accuracy of the methods we have used. Second, a preliminary pilot study shows that texts written by one author in different genres can differ more than texts written by different authors in the same genre. We have therefore selected our texts from one particular text type, crime fiction. Third, to ensure the accuracy of assignment is independent of our particular split in labeled and unlabeled text fragments, we also required that a successful method should group all text fragments of different authors into clearly distinguishable clusters.

References
Burrows, J. F., (1992). Computers and the Study of Literature. In: C.S. Butler (Ed.), Computers and Written Texts. Oxford: Blackwell. (pp. 167-204).

Burrows, J. F., (1993). Tiptoeing into the infinite: testing for evidence of national differences in the language of English narrative. In: S. Hockey and N. Ide (Eds.), Research in Humanities Computing '92. London: Oxford University Press.

Holmes, D. I., (1994). Authorship Attribution. Computers and the Humanities 28(2):87-106.

Holmes, D. I. and Forsyth, R. S., (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing 10(2):111-127.

Mosteller, F. and Wallace, D. L., (1964). Applied Bayesian and Classical Inference. The case of the Federalist Papers. New York: Springer.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution

1. Harald Baayen

2. Roald Skarsten

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996