Vocabulary Richness and Authorship Reconsidered

  1. 1. David L. Hoover

    New York University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

There has been considerable activity recently in applying statistical techniques to literary texts (e.g., Baayen, 1993, 1996; Burrows, 1987, 1992; Burrows and Craig, 1994; Craig, 1999a, 1999b; Holmes, 1994; Holmes and Forsyth, 1995; Tweedie and Baayen, 1998; Tweedie, Holmes, and Corns, 1998). While my interest is primarily stylistic analysis, I will focus here on the usefulness of vocabulary richness for authorship attribution, for if, and only if, vocabulary richness measures can reliably distinguish authors and texts can they usefully characterize authorial style. I will argue here that, for both theoretical and practical reasons, vocabulary richness measures are much less reliable than has been thought, and of only marginal value in stylistic and authorship studies.

Although a single vocabulary richness measure that can characterize an author or text is an attractive idea (see Yule, 1944), readers' perceptions of vocabulary richness are not necessarily accurate. Consider the first 50,000 words of Light in August; The Ambassadors; Sons and Lovers; The Picture of Dorian Gray; The Awakening; The Seawolf; The Red Badge of Courage; and Main Street. Few readers will recognize that the texts are in order of increasing vocabulary, from 4400 to 8300 words. Furthermore, peculiar effects arise from the way some measures are calculated. Thoiron (1986) examines Diversity and entropy, showing that they react surprisingly to textual alteration or doubling, and further tests show similar oddities for other measures.

Tweedie and Baayen (1998) present a thorough examination of 17 proposed constants, using 16 texts by 7 authors (Baum, The Wonderful Wizard of Oz, and The Marvelous Wizard of Oz; Brontë, Wuthering Heights; Carroll, Alice's Adventures in Wonderland and Through the Looking-Glass; Doyle, The Sign of Four, The Hound of the Baskervilles, and The Valley of Fear; James, Confidence and The Europeans; St. Luke, The Gospel According to St. Luke (KJV); The Acts of the Apostles (KJV); London, The Sea Wolf and The Call of the Wild; Wells, The War of the Worlds and The Invisible Man). They show that some constants are not even theoretically constant, and others are not empirically constant (323-34), and note that discourse-structure violates the randomness assumption underling many discussions of vocabulary richness (333-34) and perform sophisticated randomization experiments to uncover the behavior of constants throughout texts. They conclude that the constants capture some "aspects of authorial structure," but do not correctly group all of the texts by each author nor correctly separate all texts by different authors (345-48), and show that principal component analysis of the 100 most frequent function words produces more accurate results.

To extend the study of vocabulary richness measures in authorship studies, I begin with 24,000-word sections of the Tweedie and Baayen texts and attempt to duplicate their results with more accessible techniques. Even using shorter sections, my cluster analysis based on types (or Herdan's K) and the frequency of the most frequent word (FMFW) is as accurate as theirs. A cluster analysis of the 16 complete texts based on Herdan's K and FMFW is also as accurate as their results for the full trajectories of all seventeen constants, or final values of Z and K, and only slightly less accurate than those for the full trajectories for Z and K (348). These results suggest that vocabulary richness may be useful in studies of authorship and style. Further analysis, however, shows that the texts by Doyle, James, and St. Luke cluster even on the basis of the frequencies of initial letters of words, and an analysis based on word tokens and the repeat rate of the most frequent word correctly groups all texts except those by London. This analysis uses the lengths of the texts as a variable, even though the main purpose of the "constants" was to eliminate the effect of text-size. Yet it clusters the texts more accurately than the best vocabulary richness tests of Tweedie and Baayen, and produces results as good as and very similar to their results using principal component analysis of the 100 most frequent words (347).

To investigate these strange results, I extend the analysis by adding additional texts. Vocabulary richness measures are not very effective in grouping 47 texts by 35 authors or 55 texts by 36 authors: fewer than half of the texts by authors with multiple texts cluster correctly. One final expansion of the number of texts under analysis points toward an explanation. I divide my 55 texts into as many 24,000-word sections as possible for analysis, on the assumption that sections of one text should cluster better than different texts by the same author. Yet the most accurate analysis I have been able to produce (using W, H, K, Skewness, word length, hapax legomena, and FMFW) still fails in quite a spectacular way, correctly clustering all texts of only two of the 16 authors with multiple texts (Carroll and St. Luke). It also correctly clusters single texts of Anderson and Crane, all sections of some single texts by authors with multiple texts (The Inheritors, The War of the Worlds, The Jungle Book, and To the Lighthouse), and correctly identifies two single-section texts as clusters (The Marvelous Wizard of Oz, The Wonderful Wizard of Oz). Two or more sections of text(s) by the same author also sometimes cluster together without any "foreign" texts. Vocabulary richness measures clearly capture some aspects of authorial style, but they are inadequate for the correct identification of large numbers of texts by many authors.

Considering the tremendous variation in vocabulary richness among texts and authors, it seems almost inevitable that vocabulary richness measures would fail to distinguish those texts and authors. For example, sections of A Portrait of the Artist rank from 49th to 173rd in vocabulary richness among the 188 sections, and Kipling's The Jungle Book ranks as low as 12th and his Kim as high as 161st. More concretely, the vocabularies of sections of Kipling's novels range from 2935 to 4450 words, while the vocabularies of eleven different texts by eleven different authors (occupying ranks 99-109) range only from 3876 to 3945 words. Clearly, as more texts are compared, a point is reached where no further distinctive values for the vocabulary richness measures are possible. On a practical level, texts like those analyzed above show that it would be unwise to place too much confidence in the presence of a set of texts for a single authorship claimant that display consistent figures for vocabulary richness: a disputed text displaying very different vocabulary richness cannot be reliably assumed to belong to a different author.

Two final cluster analyses dramatically illustrate the dangers of using vocabulary richness measures to group and distinguish texts: Figure 1 shows fourteen texts by seven authors that cluster perfectly, while Figure 2 shows sixteen texts by eight authors that show no correct clusters at all. The chief determinant of the accuracy of clustering is obviously the particular texts being analyzed. The texts Tweedie and Baayen use cluster well because they happen to be more similar to each other in vocabulary richness than they are to texts by other authors. An analysis of the texts in Figure 2 would have lead to radically different results. In retrospect, this is hardly surprising: the tremendous variety of their texts-from EModE religious texts to children's literature to detective fiction to science fiction-is so great that a perceptive reader of the texts should be able to correctly identify any 50-word passage from any of the texts simply by reading it.

Despite the attractiveness of vocabulary richness measures, and despite the fact that they are sometimes effective in clustering and discriminating texts, such measures cannot provide a consistent, reliable, or satisfactory means of characterizing an author or style. There is so much intratextual and intertextual variation among texts and authors that such measures should be used with great caution, and treated merely as preliminary indications of authorship, as rough suggestions about the style of a text or author, as characterizations of texts at the extremes, or as indicators of texts or sections of texts that may be fruitfully analyzed by more robust methods (see Hoover, 1999: 79-113).

Figure 1. Cluster Analysis of Fourteen Texts by Seven Authors Based on Herdan's Characteristic and the Repeat Rate of the Most Frequent Word: Best Case Scenari

Text Key: Baum, 1-2; Doyle, 3-4 (Hound, Return); Conrad, 5-6; Forster, 7-8; James, 9-10 (Confidence, Europeans); Kipling, 11-12; St. Luke, 13-14

Figure 2. Cluster Analysis of Sixteen Texts by Eight Authors Based on Herdan's Characteristic and the Repeat Rate of the Most Frequent Word: Worst Case Scenario

Text Key: Carroll, 1-2; Cather, 3-4; Doyle, 5-6 (Sign, Valley); Golding, 7-8; Hardy, 9-10; London, 11-12; Wells, 13-14; Woolf, 15-16

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC