Comparison of word-based and syntax-based methods: Vocabulary richness measures and the highest frequency elements

Fiona Tweedie

Authorship

1. Fiona Tweedie

Department of Mathematical Sciences - University of the West of England

Parent session

Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution , Harald Baayen, Roald Skarsten

Original URL

https://web.archive.org/web/19981202180145/http://gonzo.hit.uib.no/allc-ach96/Panels/baayen/tweedie3.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

To evaluate the possible advantages of using rewrite rules instead of words for authorship attribution, we carried out two kinds of comparisons.

First, we compared the accuracy of methods based on statistics for vocabulary richness as applied to word counts on the one hand, and to frequency counts of rewrites on the other. Second, we also compared the accuracy of methods based on the counts of the highest-frequency elements, the 50 highest-frequency function words for the word-based analyses, and the 50 highest-frequency rewrites for the syntax-based approach.

Various measures have been proposed throughout the history of stylometry. We have used a selection of them in our multivariate analyses, following Holmes and Forsyth (1995). Our first measure was proposed by Yule (1944). It is defined as:

fig2.gif

with N the number of tokens, V(i,N) the number of types which occur i times in a sample of N tokens, and v the highest frequency of occurrence. A related measure was proposed by Simpson (1949), who focused on the probability that two words randomly selected from the text are the same. His measure is defined as

fig3.gif

The values of both D and K are primarily determined by the high end of the frequency distribution structure. They quantify the repeat rate of the samples.

In order to consider the low frequency end of the distribution, we also include measures proposed by Honoré (1979) and Sichel (1975). Honoré's measure,

fig4.gif

where V(N) denotes the number of different rewrite types, was used initially to examine the vocabulary of Latin judicial authors and has subsequently been used by others including Holmes and Forsyth (1995). R takes into account the probability that the author will re-use a given type in the text rather than using a new one. It's dependence on V(1,N), the number of hapax legomena, may add useful. Another measure that is sensitive to the low end of the frequency distribution was proposed by Sichel (1975):

S = V(2,N)/V(N) .

By means of this measure we take the number of dis-legomena, the words which appear twice in the text, into account.

Finally, we included a variable which has measured vocabulary richness with success in various fields. Proposed by Brunet (1978), it is defined as:

fig5.gif

where a is a parameter, usually fixed at 0.17, such that W is approximately constant and independent of N.

Values for R, D, S, K and W were calculated for each of the twenty samples of our experiment, for both words and rewrites. In this way we obtained two (20,5) data matrices. A Principal Components Analysis of the word matrix revealed a misclassification rate of 2/6 for the unlabeled samples and a misclassification rate of 1/14 for the labeled samples. Considerably improved results were obtained on the basis of the rewrite matrix. All unlabeled samples were correctly assigned to their authors, and except for one labeled sample, the samples by A were clearly distinguishable from those by B. We conclude that methods based on measures of vocabulary richness are more accurate when applied to rewrites then when applied to words.

Following Burrows (1992), we also investigated the discriminatory potential of the 50 highest-frequency function words, and compared the result with an analysis based on the 50 most frequent rewrites. The two (20,50) data matrices were subjected to Principal Components Analysis. For the words, the labeled samples of A and B were well-separated into two distinct clusters. Five of the six unlabeled samples appeared in the correct clusters. One unlabeled sample, however, appeared exactly half-way between the two clusters, and was equally likely in the analysis to be by A or B. The rewrite-based analysis, by contrast, correctly separated all samples of A and B. The unlabeled sample that could not be assigned with confidence in the word-based analysis to A or B now clearly sided with the cluster of samples by B. Again we find that a rewrite-based analysis leads to an improved classification.

Finally, note that methods based on the 50 most frequent elements appear to have a higher discriminatory potential than methods based on statistics of vocabulary richness. In the word-based analyses, changing from data on vocabulary richness to the data on the 50 most frequent function words led to a decrease in the misclassification rate from 3/20 to 1/20. Similarly, the misclassification rate dropped from 1/20 to 0/20 in the rewrite-based analyses. We conclude that optimal results may be expected for analyses based on the highest-frequency rewrite rules.

References

Brunet, E., (1978). Vocabulaire de Jean Giraudoux: Structure et Évolution, Slatkine.
Burrows, J. F., (1992). Computers and the Study of Literature. In: C.S. Butler (Ed.), Computers and Written Texts. Oxford: Blackwell. (pp. 167-204).
Holmes, D. I. and Forsyth, R. S., (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing 10(2):111-127.
Honoré, A., (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 172-177.
Juillard, M. (1990). Proper nouns as proper style markers of poetry and prose. Literary and Linguistic Computing, 5(1):1-8.
Sichel, H. S., (1975). On a Distribution Law for Word Frequencies. Journal of the American Statistical Association, 70:542-547.
Simpson, E. H., (1949). Measurement of Diversity. Nature 163:168.
Yule, G. U., (1944) The Statistical Study of Literary Vocabulary. Cambridge University Press.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Comparison of word-based and syntax-based methods: Vocabulary richness measures and the highest frequency elements

1. Fiona Tweedie

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996