The discriminatory potential of the lowest frequency rewrite rules

  1. 1. Harald Baayen

    Max Planck Institute for Psycholinguistics - University of Nijmegen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We have also pursued the hypothesis that further reliable and robust clues to authorship identity might be found among the hapax legomena, the rewrite rules with the lowest possible frequency of use. This hypothesis is grounded in two considerations.

First, units in the highest frequency ranges often have properties that are a-typical for the population as a whole (see Baayen and Sproat, 1996). Second, since the likelihood of storage in memory increases with frequency of use, and since awareness builds on memory, it is in the highest frequency ranges that conscious and deliberate wording and syntactic phrasing may be expected. Taken jointly, these considerations suggest that the lowest frequency ranges might provide a clue to authorship that is less contaminated by conscious rhetorical manipulation and thematic structuring that we think may affect the higher-frequency units of analysis.

Among the low-frequency units, the hapax legomena, the units which occur once only, are of special interest. Good (1953) has shown that the likelihood of observing an unseen type is estimated by the ratio of hapax legomena to the total number of tokens: V(1,N)/N. In other words, P(N) = V(1,N)/N estimates the rate at which new units appear, the rate at which the vocabulary of units increases. With respect to distributions of syntactic rewrite rules, this growth rate P(N) estimates the probability that an author will produce a new rewrite rule that she/he has not yet used before. In other words, P(N) taps into an author's syntactic creativity, and can be used to gauge how well an author has mastered the possibilities offered by the grammar.

Does P(N) have a good discriminatory resolution for authorship attribution for our experiment? Text A appears to make a more productive use of syntax than text B, as both V(N), the total number of different construction types, and P(N) are significantly higher for A (2114, 0.090) than for B (1883, 0.074) (in both cases, p << .001, proportions test).

Not surprisingly, this difference in construction richness carries over to the seven labeled samples of A and B. After correcting for the differences in size of the twenty text samples, a classification tree analysis (Breiman, Friedman, Olshen, and Stone, 1984) on the basis of P(N) correctly assigns all unlabeled text samples. This positive result is counterbalanced by a rather imperfect classification of the labeled fragments. The same classification tree reveals a misclassification rate of 2/14 for the labeled samples. Interestingly, using V(N) instead of P(N), again corrected for differences in sample size, a misclassification rate for the labeled samples of 1/14 is obtained, and again all unlabeled samples are assigned to the correct authors.) Although P(N) and V(N) clearly capture important differences between our two authors, they are by themselves unable to satisfy the criteria we have set ourselves, namely, to obtain a classification with a misclassification rate of 0/20.

To increase our sensitivity to author-specific differences in the use of the lowest-frequency rewrite rules, a subclassification of the hapax legomena is required. To do so, we sorted all rewrite rules, irrespective of their frequency, according to their left hand side, the information appearing to the left of the arrow in the rewrite rule. Some left hand sides L appear in a great many different rewrite rules, others appear in just a few rules. We selected the left-hand sides with more than 10 different right-hand sides for further analysis. There were 49 such left hand sides in the pooled twenty text fragments. Let Li (i = 1, 2, ..., 49) denote the set of rewrite rules with the i-th left hand side, and let hi,j (j = 1, 2, ..., 20) denote the number of rewrite rules in text sample j belonging to set Li that occur once only in sample j and that do not occur in any of the other text samples (a hapax legomenon occurring in sample j). Furthermore, let


be the relative frequency of unique hapax legomena in text j falling in category Li with respect to the total number of unique hapax legomena in j summed over all 49 categories. The relative frequency rhi,j measures the extent to which the syntactic creativity unique to a particular author (or text sample) manifests itself in the i-th set of rewrite rules.

A Principal Components Analysis on the 20 x 49 matrix of relative frequencies rhi,j revealed the pattern shown in Figure 2.


The first principal component is highly correlated with the left hand side UTT:S (`Utterance:Sentence', r = 0.96), the second principal component with the left hand sides CJ:CL (`Conjoin:Clause', r = -0.66), RPDU:S (`Reported Utterance:Sentence', r = 0.63), RPGT:S (`Reporting Tail:Sentence', r = -0.63) and V:VP (`Verb:Verb Phrase', r = 0.63). All unlabeled samples are correctly classified, and the samples by A and those by B also appear well-separated in Figure 2, a visual impression that is supported by a Discriminant Analysis.

This analysis again shows that syntactic annotation provides excellent clues for authorship attribution. In addition, a detailed comparison of word usage on the one hand with the use of syntax on the other (not reported here for lack of space) reveals that there is less variability in the use of syntax than in word usage. This suggests that syntax-based analyses are less likely to be foiled by idiosyncracies of individual samples.

Interestingly, the differences between the authors A and B reveal a consistent pattern. A has a greater vocabulary size than B, both with respect to words and with respect to syntactic constructions. Moreover, analyses that for lack of space have not been discussed here reveal that A makes more use of morphologically complex words than B (the adverbial suffix -ly also appears to be a reasonable classifier), and that narrative development in A is more complex than in B. Across the board, A reveals a more creative use of the possibilities of English. Since A, Innes, is a literary critic as well as a writer of crime fiction, this difference comes as no surprise.

The lowest-frequency rewrite rules provide a window to this difference in creativity. An analysis of their use has enabled us to tease apart the text samples written by Innes from text samples written by B, Allingham.

Baayen, R. H. and Sproat, R., (1996). Estimating lexical priors for low-frequency morphologically ambiguous forms. To appear in Computational Linguistics.
Breiman, L., Friedman, J.H., Olshen, R., and Stone, C.J., (1984). Classification and Regression Trees. Belmont, California: Wadsworth International Group.
Good, I. J., (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40, 237-264.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (

Conference website:

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None