Tuning the Word Frequency List

paper, specified "long paper"
  1. 1. David L. Hoover

    New York University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

One recent trend in computational stylistics and authorship attribution has been to “tune” the word list for better results. T-tests can identify words with statistically significant differences in frequency between two authors; they remain an excellent method for one-on-one problems (Burrows 1992; McKenna and Antonia 1996; Hoover 2010). Eder and Rybicki (2011) use elegant methods to identify “sweet spots” in the word frequency spectrum by testing a range of numbers of the most frequent words (MFW). They also progressively remove words from the top of the spectrum, eliminating the function words favored by so much previous work. Personal pronouns have been removed to minimize differences in point of view and the numbers of male and female characters (Hoover 2002). The word list has been “culled” by removing words that are frequent because of high frequencies in one text (Hoover 2003, 2004; Burrows 2005). Rybicki and Eder (2011), Jockers, Witten, and Criddle (2008), and Rybicki and Heydel (2013) have culled words absent from any or many texts. Burrows’s Zeta and Iota select words consistently used by one group and avoided by another, ignoring frequency. These tuning methods vary in effectiveness with the number, size, date, genre, and language of the texts, and with the number of authors involved.
I propose here a transparently motivated form of tuning with a strong a priori plausibility: selecting unevenly distributed words, using the coefficient of variation (CoV). This dispersion measure is defined as Stdev/Ave. Freq.*100, expressed as a percentage of the average frequency. Consider the following four words, with statistics from 21 novels and short fiction by Willa Cather and Edith Wharton:
Word Ave. Stdev CoV # Texts Rank
magnificent 0.0034 0.0064 192% 5 3374
longing 0.0034 0.0107 316% 5 2427
generally 0.0036 0.0059 163% 7 2994
father 0.0368 0.0600 163% 14 332
Magnificent and longing both appear in 5 texts with similar average frequencies but very different Stdev, CoV, and rank: longing varies much more in frequency than does magnificent. Generally and father show that the Stdev is too closely tied to the average frequency to measure dispersion here: low-frequency words tend to have small Stdev’s. For the 21 texts above, among the 17,000 word types, only 1 of the 100 largest Stdev’s comes from a word ranking 1,000 or higher. The frequencies and Stdev’s of father are about 10 times those of generally, so both have the same CoV, which is thus a fairer measure of variability than the Stdev.
Unfortunately, simply analyzing words with the highest CoV is unworkable. The Stdev tends to be lower for less frequent words, but rare words and those occurring in a small number of texts have very large CoV’s. For the 21 texts above, 9,000 of the 17,000 word types appear in only one text and share the largest CoV (447%), regardless of their frequency. Some character names in long texts rank as low as 63rd, but 7,200 are hapax legomena. Furthermore, only 300 of the words with the 12,000 largest CoV’s appear in more than two texts. The need to identify words used fairly frequently and in many texts but with widely varying frequencies suggests combining Rybicki-style culling with the CoV in a method I call CoV Tuning. Now, some experiments. First, consider the Cather/Wharton set above, containing 3 Wharton novellas and 7 stories and 11 Cather stories. Standard cluster analysis (Ward linkage, squared Euclidean distance, standardized variables) correctly groups all texts in analyses based on 990 and 900MFW and fail at 400MFW only for Wharton’s “The Hermit and the Wild Woman.” (The seemingly peculiar choice of 990MFW arises from a limitation in my statistics program, Minitab.) All other analyses based on the 100-800MFW show at least 3 errors, including that same story.
I tuned the word list by sorting the 1,500MFW on the CoV. The words with largest CoV’s, mainly character names, appear in only one text, so I re-sorted the words on the number of texts they appear in, retained only words appearing in 7 or more texts, then re-sorted on the CoV and retained the 1,000 most variable words (MVW). Cluster analysis using CoV Tuning is very effective: all 7 analyses based on the 400-990MVW correctly group all the texts. Retaining words appearing in at least 5, 6, or 8 texts is slightly less effective. T-testing the 1,500MFW and retaining only the 352 words with p < .05 gives perfect groupings using all 352 words and decreasing numbers of them, down to the two words with the highest T value, the classic authorship pair, till and until. Thus, for two-author problems, the t-test is clearly superior. Figures 1-2 show standard and CoV Tuned analyses.

Fig. 1: Standard, 800MFW

Fig. 2: CoV Tuned, 800MVW
Now consider some difficult multi-author problems for which t-test tuning is unavailable. First I tested a set of 43 late 19th and early 20th century novels by 15 American authors, 2-3 novels each. Errors for at least 3 authors occur in all analyses using the standard method. Using CoV Tuning retaining only words found in 34 or more texts, analyses of the 900 and 700MVW have just 1 error, and analyses of the 500, 600, 800, and 990MVW have 2. Minimums of 22 and 38 texts give weaker results. Figures 3-4 show standard and CoV Tuned analyses, respectively.

Fig. 3: Standard, 700MFW

Fig. 4: CoV Tuned, 700MVW
Turning to a different genre and shorter texts, I tested 25 pieces of literary criticism by 14 authors, 9 authors with 2 texts each, 1 with 3 texts, and 4 with 1 text (see Hoover 2001 for details). I made this problem more difficult by dividing the texts into 33 sections of about 4,000 words. The standard method works well here, correctly grouping all sections by 12 of the 14 authors in analyses of the 600, 700, 800, and 990MFW and 13 authors at 900MFW. I used CoV Tuning on this word list using whole texts, retained words found in 7 or more texts, and then tested the 4,000-word sections. The improvement over the standard method is less dramatic here: the 600MVW succeeded for just 10 authors, the 700 for 12 authors, and the 800, 900, and 990 for 13. Minimums of 5, 6, or 8 texts are less effective.
Finally, I assembled a set of 40 late 19th and early 20th century novels by 20 authors, 2 novels each, intentionally selecting authors and texts known to be difficult to attribute. Standard cluster analysis correctly groups the texts by only 12 of the 20 authors, from 990 to 600MFW. I used CoV tuning on the word list as above, retaining only words found in at least 28 texts. Here, CoV Tuning matches the results for the standard method for the 700 and 800MVW, but correctly groups 13 authors for the 600MVW and 14 for the 900 and 990MVW. A minimum of 33 texts is less effective. Figures 5 and 6 show standard and CoV Tuned analyses.

Fig. 5: Standard, 700MFW

Fig. 6: CoV Tuned, 700MVW
More testing will be required to determine whether CoV Tuning gives consistently superior results on most sets of texts, and an automatic method for selecting the optimum limit to set on the minimum number of texts in each analysis is needed. However, CoV Tuning already seems to be a potentially valuable method of combining the information about word frequency that has long been the focus of authorship attribution and computational stylistics with the information about consistency of use on which recent methods like Zeta, Iota, and Full Spectrum analysis (Hoover 2013) are based.

Burrows, J. F. (1992). “Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information." LLC 7 (2): 91-109.Burrows, J. F. (2005). “Who Wrote Shamela? Verifying the Authorship of a Parodic Text.” LLC 20 (4): 437-450.
Hoover, D. L. (2003). ”Statistical Stylistics and Authorship Attribution: An Empirical Investigation.” Literary and Linguistic Computing 16(4) 2001: 421-444.Hoover, D. L. (2003). “Multivariate Analysis and the Study of Style Variation.” LLC 18(4), 2003: 341-60.
Hoover, D. L. (2004). ”Testing Burrows’s Delta.” LLC 19 (4): 453-475.
Hoover, D. L. (2010). “Authorial Style,” in Dan McIntyre and Beatrix Busse (eds), Language and Style: Essays in Honour of Mick Short, New York: Palgrave: 250-71.
Hoover, D. L. (2013). “The Full-Spectrum Text-Analysis Spreadsheet,” in Digital Humanities 2013, Lincoln, NE: Center for Digital Research in the Humanities, University of Nebraska: 226-29.
Jockers, M. L., D. M. Witten, and C. S. Criddle. (2008). “Reassessing Authorship of the Book of Mormon Using Delta and Nearest Shrunken Centroid Classification,” LLC 23 (4): 465-491.
McKenna, C. W. F., and A. Antonia. (1996). “‘A few simple words’ of Interior Monologue in Ulysses: Reconfiguring the Evidence.” LLC 11 (2): 55-66.
Rybicki, J., and M. Eder. (2011). “Deeper Delta Across Genres and Languages: Do We Really Need the Most Frequent Words?” LLC 26: 315-21.
Rybicki, J., and M. Heydel. (2013). “The Stylistics and Stylometry of Collaborative Translation: Woolf’s Night and Day in Polish,” LLC, first published online May 27, 2013.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO