Testing Burrows's 'Delta'

paper
Authorship
  1. 1. David Hoover

    New York University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In his Busa Award presentation and two recent articles, John F. Burrows has presented a new measure of stylistic difference that seems quite promising in authorship attribution, and possibly also in stylistic studies (2001, 2002, 2003). This unitary measure, which he calls 'Delta,' like many other techniques, is based on the differences in the frequencies of the most frequent words in a group of texts"in this case, "verse by twenty-five poets of the English Restoration period" (2003: 10). Burrows uses the frequencies of the 150 most frequent words of the entire set of texts in his exposition of the method, comparing the mean frequency of each word in the whole set with its frequency in the texts of each of the authors sequentially. His technique uses z-scores, which are calculated by subtracting the mean frequency of each word in the entire set from its frequency in the test text, and dividing this difference by the standard deviation of the word in the entire set. This method allows all of the 150 words in his set to have equal weight, in spite of their very different frequencies. Burrows then eliminates the sign of the difference (which indicates whether the word is less frequent or more frequent than the mean for the whole set), adds the z-scores together, and calculates the mean. This mean is Delta, "the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text" (2002: 271).

Delta proves remarkably effective in identifying authors in a difficult "open" test. When texts of sixteen authors who are included in the original set and sixteen authors who are not are tested with Delta to determine likely authorship, "Of thirty- two long poems, . . . fifteen are correctly identified and another fifteen yield scores that correctly place them outside the main set" (Burrows, 2003: 15). Although this result is not completely accurate, is very encouraging, and it suggests that delta may be a very useful tool in the early stages of an authorship study in which there are large numbers of possible authors.

Because of its great potential, delta deserves further investigation, and I have begun a series of tests on novels published about 1900, to see if Burrows's success with poetry can be duplicated with prose. The choice of 1900 as a central date ensures a large selection of texts for testing without scanning: many such novels, now out of copyright, are already available as e-texts. After cleaning up the e-texts to correct for problems of hyphenation, apostrophes, and single quotation marks, I have extracted samples of approximately 25,000 words of pure authorial narration from more than 100 novels. Using a combination of text-analysis tools, and custom programming, I have collected the most frequent words from each of the samples for comparison.

Burrows's analysis (2002) shows that the accuracy of Delta decreases as he reduces the frequency list from the 150 to the 40 most frequent words, and Hoover (2002, 2003) has shown that cluster analyses based on as many as the 800 most frequent words are often more accurate than those based on the traditionally-used smaller lists. Therefore, I have collected the 800 most frequent words of each text, and have set up a complex Excel spreadsheet that accepts as input sets of columns of the most frequent words from up to 80 primary and 80 secondary texts for analysis. A macro calculates Delta for each of the primary texts and then for each of the secondary texts in turn, beginning with the 800 most frequent words, and continuing with the 700, 600, 500, 400, 300, 200, 150, 100, 70, 50, 30, and 20 most frequent. Another macro then moves through the results, calculating the rank of the actual author for texts by authors in the main set and extracting the information in an format for easy graphing. This automation assures the accuracy of the analysis, saves countless hours of tedious and painstaking drudgery, and allows for the testing of many differently selected sets of frequent words. For example, I have tested the following eight different kinds of sets:

The most frequent words.
Contractions removed.
Personal pronouns removed.
Contractions and personal pronouns removed.
Words for which a single text supplies more than 70% of the occurrences removed.
Contractions and words for which a single text supplies more than 70% of the occurrences removed.
Personal pronouns and words for which a single text supplies more than 70% of the occurrences removed.
Contractions, personal pronouns, and words for which a single text supplies more than 70% of the occurrences removed.
Research is still in progress, but preliminary results are available for a primary set of 22 American authors (represented by one third-person novel each) that have been tested against 20 third-person American novels from authors in the primary set. As Figure 1 shows, Delta is quite effective, in two cases attributing 19 of the 20 novels to the correct author. (The 20th, which surprisingly ranks 5th, is one of two novels by Ellen Glasgow in the secondary set. The other is consistently identified correctly. This case invites further investigation, because the novel that is not correctly identified was written about 10 years later than the other two, and is sometimes considered to be Glasgow's first novel in her mature style. The preliminary research also shows that removing contractions reduces the accuracy of the analyses, but removing personal pronouns improves it (for similar results, see Hoover, 2002). Finally, removing words for which a single text supplies more than 70% of the occurrences improves the accuracy quite dramatically, producing the only results with 19 of the 20 texts correctly attributed.

Although Burrows showed that texts by authors from the main set usually produce Deltas significantly lower than those by authors from outside the main set, as one would expect, my initial tests on prose show somewhat weaker results: the Deltas of texts by authors from the main set are generally lower, but some show Deltas higher than some of the texts by authors from outside the main set. For this set of prose texts, Delta is more effective in suggesting correct authors than in eliminating incorrect potential claimants. More testing is necessary to determine whether the use of a single novel for each author in the main set may result in less accurate results than those Burrows achieved when most of his authors were represented by a group of poems, and whether work on attribution should be limited to texts with the same point of view or nationality. Nevertheless, Burrows Delta seems poised to become a valuable addition to the authorship attribution toolbox.

Bibliography

1. Burrows, J. F. 2001. Questions of Authorship: Attribution and Beyond. Association for Computers and the Humanities and Association for Literary and Linguistic Computing, Joint International Conference, New York, June 14, 2001.
2. Burrows, J. F. 2002. 'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing 17(3), 267- 287.
3. Burrows, J. F. 2003. Questions of Authorship: Attribution and Beyond. Computers and the Humanities 37(1): 5-32.
4. Hoover, D. L. (2002). Frequent Word Sequences and Statistical Stylistics, Literary and Linguistic Computing, 17(2), pp. 157-80.
5. Hoover, D. L. (2003). Frequent Collocations and Authorial Style, forthcoming, Literary and Linguistic Computing, 18(3), pp. 157-80.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None