Short substrings as document discriminators: An empirical study

  1. 1. Richard Sandes Forsyth

    University of the West of England

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Short substrings as document discriminators: An empirical study
Richard S. Forsyth
University of the West of England
Keywords: Monte-Carlo methods, stylometry, text categorization

1. Introduction
In seeking to capture consistent and distinctive features of linguistic style, stylometrists have used a variety of textual indicators (see: Holmes, 1994). In the majority of stylometric studies, however, the choice of which indicators to use in a given problem is left to the discretion of the investigator. This practice permits the exercise of human judgement, and thus can sometimes save a time-consuming search for suitable descriptors; but it inevitably involves subjectivity. Very often the choice of suitable linguistic markers is crucial to the development of an effective discriminant rule; but, being subjective, it may not be replicable on another problem.
The situation is similar in the related fields of multivariate pattern recognition and machine learning (Everitt & Dunn, 1991; Quinlan, 1993): most studies begin by presuming that a suitable set of attributes or features has already been found. In text analysis this presumption is more than usually questionable.

For these and other reasons, a number of studies have appeared recently (e.g. Burrows, 1992; Binongo, 1994; Burrows & Craig, 1994; Kjell, 1994; Ledger & Merriam, 1994; Bayer et al., 1996) in which the features used as indicators are not imposed by the prior judgement of the analyst but are -- to a large extent -- determined by the texts being analyzed. The main aim of the present paper is to advance this trend.

2. Method of Classification Used
As studying the strengths and weaknesses of different classification techniques was not the main focus of this investigation, a simple nearest-centroid classifier was used for the tests reported in this paper. This algorithm, as implemented here, uses a training set of examples with known class membership to compute a centroid (multi-dimensional average) for each category. Then on a fresh or unseen case a measure of distance from each class centroid is computed and the current case is assigned to the category of the centroid which it most resembles. This method is described as "simultaneous key identification" by Sneath & Sokal (1973). It belongs to a class of similarity-based classifiers which McKenzie & Forsyth (1995), among others, have shown to be generally robust in practice.
3. The Bristol Benchmark Suite
In many areas of computing, benchmarking is a routine practice. For instance, when compilers are tested for compliance to a programming-language specification, it is normal to apply them to a suite of benchmark cases and record any divergence from expected behaviour.
Although benchmarking does have drawbacks as well as advantages, it does help in setting objective standards. For example, it is arguable that in the field of forecasting, the work of Makridakis and colleagues (e.g. Makridakis & Wheelwright, 1989), who tested a number of forecasting methods on a wide range of (mostly economic) time series, transformed the field -- leading to both methodological and practical advances.

Even though billion-byte public-domain archives of text exist, e.g. Project Gutenberg and the Oxford Text Archive, stylometry currently lacks an equivalent set of accepted test problems. Therefore we at Bristol have compiled a textual benchmark suite to meet this need. The current version of this suite is known as Tbench96. Despite its deficiencies, it does present a broader variety of test problems than other workers in stylometry and allied fields have previously used.

4. Data-Driven Feature-Finding
The main focus of this investigation is a comparative examination of a number of methods of data-driven feature-finding. Only methods that do not require the exercise of judgemental expertise on the part of an investigator were considered. Thus the feature types examined below share the following desirable properties: (1) they are easy to compute; (2) they are interlingual, i.e. they are not limited to the English language.
Perhaps the simplest possible method of obtaining a set of textual features is simply to treat each letter of the alphabet as a feature, i.e. to count the frequency of each of the 26 letters in each kilobyte line. This requires no background knowledge except that the texts concerned are encoded using the Roman alphabet. At first glance this would seem not just simple, but simplistic. However, some previous studies -- most notably Ledger (1989) and Ledger & Merriam (1994) -- have reported surprisingly good results when using letter-counts as stylistic indicators. At the very least, this establishes a baseline level of performance: more sophisticated features sets need to outperform letter counting to justify their added complexity.

To avoid restricting other marker sets to an unrealistic size, the baseline technique included here is actually that of Ule (1982) rather than that of Ledger & Merriam (1994). Ule counted not just the frequencies of the 26 latin letters A to Z (ignoring the distinction between upper and lower case) but also the ten arabic numerals 0 to 9.

A natural progression from counting the 36 pre-specified alphanumeric characters in each text block is simply to count the frequencies, in each block, of the 36 commonest characters in the training text as a whole. This allows spaces and other punctuation symbols to play a role as discriminators. The danger here is that such characters may be less intrinsic to the original texts than letters. For instance, Shakespeare's punctuation is mostly the work of subsequent editors. Nevertheless, it was decided to investigate the effect of moving from letter-counting to character counting. (In the two cases in TBench96 where punctuation is not original, punctuation marks have been removed from the files held on disk.)

If counts of single characters are useful as features then so might counts of digrams be useful, or of higher-order n-grams. Kjell (1994) reported good results in assigning Federalist essays written either by Hamilton or Madison to the correct authors using a neural-network classifier to which letter-pair frequencies were given as input features. This approach was extended to characters (not just letters) and generalized to include comparison of trigrams and tetragrams as well.

Another type of textual feature has been used extensively by Burrows (1992) as well as Binongo (1994), among others. Essentially it involves finding the most frequently used words and treating the rate of usage of each such word in a given text as a quantitative attribute. The exact number of common words used varies by author and application. Burrows and colleagues (Burrows, 1992; Burrows & Craig, 1994) discuss examples using anywhere from the 50 most common to the 100 most common words. Binongo (1994) uses the commonest 36 words (after excluding pronouns). Greenwood (1995) uses the commonest 32 (in New Testament Greek). In the present study, the most frequent 36 words in the combined training samples of each problem were used.

In addition, two novel approaches to textual feature-finding were tested in the present study. These are described in sections 5 and 6.

5. Progressive Pairwise Chunking
The first of these derives what are called DOUBLETS in Table 1 from text samples, by adapting a method proposed independently by Wolff (1975) and Dawkins (1976) in different contexts.
This algorithm scans byte-encoded text sequences, seeking the most common pair of symbols. At the end of each scan it replaces all occurrences of that pair by a newly allocated code. This process is repeated for the next most common pair, and so on till the requested number of pairings have been made. The program used here assumes that ASCII codes from 128 onwards are free for reassignment (as is the case with Tbench96) so byte codes from 128 upwards are allocated sequentially. As output the program produces a list of doublets. These need not be digrams, since previously joined doublets can be linked in later passes -- thus building up quite long chains, if they occur in the data, without demanding great computational resources.

The program thus gets away from the artificiality of using fixed- length substrings such as digrams, trigrams or tetragrams. It is able to find markers that are shorter than words (e.g. an affix such as `ed ') or longer (e.g. a collocation such as `of the').

6. Monte-Carlo Feature-Finding
The last method of feature-finding tested in this study takes the idea implicit in the progressive chunking method to what may be thought its logical conclusion. Monte-Carlo Feature-Finding is simply a random search for substrings that exist in the training data. Here this process is implemented by a program called TEFF (Text Extending Feature Finder). This finds markers merely by searching through a given set of training texts.
Essentially TEFF picks many substrings (3600 in the experiment reported here) from the combined training text completely at random. The length of each substring is also a random number from 1 to 8. All distinct substrings thus found are saved and then ranked according to a distinctiveness measure, i.e. according to a measure of their differential rate of occurrence in the different text categories for the problem concerned. Chi-squared is used to measure distinctiveness. For the present experiment only the most distinctive 36 substrings were kept.

6.1 Elastic Strings
However, earlier trials of Monte-Carlo Feature-Finding (Forsyth, 1995) have indicated that the basic procedure, as described above, tends to generate substrings that are fragmented at what seem linguistically inappropriate boundary points, even when they prove effective as discriminators. For instance, with samples from the Federalist Papers, a precursor of TEFF often found `pon' or `upo' instead of `upon '. So TEFF incorporates a procedure that extends each substring, as soon as it is generated, to the maximal length justified by the training text. This refinement does not require the program to have any knowledge of English orthography or morphology, but it does eliminate the most glaring examples of improper text fragmentation. For instance, applying this procedure to a list of substrings produced by from the Federalist data the substring `deraci` became ` confederacies`. Other expansions included `partmen' to ` department' and `ongres' to ` congress'.
6.2 Measuring Distinctiveness
The measure of distinctiveness used in TEFF is Chi-squared (Pearson, 1900), with expected values computed on the basis of equal rates of occurrence in the various categories of text being examined. This has been reported as effective in characterizing textual differences by McMahon et al. (1979) and Hofland & Johansson (1982). However, Chi-squared has also been criticized, e.g. by Church & Hanks (1989) and by Kilgarriff (1996), on the grounds that overall frequency does not distinguish words or strings that occur plentifully in only one section of a text from those that occur more steadily throughout it. Ideally we want markers that permeate a particular kind of text. Thus:
"an ideal index should actually be a measure of permeation; that is, it should consider both frequency and distribution information." Stone & Dunphy (1966), p. 33.

In pursuit of this ideal, TEFF offers two measures of distinctiveness. In the first, Chi-squared is computed using unadjusted frequency counts; in the second, the square root of the count for each kilobyte line is accumulated rather than the frequency count itself. This latter measure penalizes strings that occur very often but only in a narrow segment of the text. It thus has a better claim to be an index of permeation than the unmodified Chi-squared score. Both variants were compared in the present experiment.

7. Results
The main response variable measured is the percentage of correct classifications made on unseen test data. Mean values are shown in Table 1. Table 1 -- Mean Success Rates on Test Data. Marker Type Percent Correct Klecka's tau
0. ALPHANUM 69.15 0.4369
1. UNIGRAMS 73.41 0.5078
2. DIGRAMS 70.11 0.4528
3. TRIGRAMS 69.65 0.4426
4. TETRAGRAMS 68.89 0.4301
5. WORDS 68.99 0.4309
6a. DOUBLETS 70.26 0.4527
6b. DOUBLETS 69.38 0.4400
7a. TEFFSUBS 73.77 0.5203
(raw counts)
7b. TEFFSUBS 75.00 0.5325
(sqrt counts)
These figures are aggregated over all 13 problems in Tbench96. Percentage success rate is given because it is familiar and thus easy to interpret. However, as different problems vary in the number of categories to be distinguished (2, 3 or 4), the rightmost column gives a measure of accuracy that compensates for unequal numbers of categories (Klecka, 1980).

Using an overall significance level of 0.05 Hsu's multiple- comparison method indicated that all but three methods were significantly different from the best. These three were: UNIGRAMS, TEFFSUBS using raw counts, and TEFFSUBS using the square-root transformation. Only these three candidates remain as realistic contenders to be the best method of the 10 tested here.

8. Discussion
A variety of stylometric indicator types have been examined empirically. This has given rise to a preliminary `pecking order' among marker types based on results obtained on a benchmark suite of text classification problems. This ranking suggests firstly that it is worthwhile to allow non-alphanumeric characters as textual features as well as alphanumeric ones, and secondly that it is worthwhile to select textual features for distinctiveness rather than simply by frequency.
It is also worth noting that one of the simplest ways of finding textual features (UNIGRAMS, which just uses the commonest 36 characters in the training texts) is also one of the best. Moreover, no advantage was evident from using longer n-grams.

The two novel types of textual marker tested performed creditably. Both merit serious consideration by future researchers in this area. DOUBLETS obtained by progressive pairwise chunking would appear to be at least as informative as common words, which have previously held pride of place in stylometry.

Of course this conclusion only carries weight to the extent that the benchmark suite used here (Tbench96) is adequate, and it can only be regarded as a prototype. Nevertheless the very idea of using an agreed suite of benchmark problems is in itself a contribution to stylometry; and while Tbench96 is admittedly imperfect, it already presents a wider range of problems than customarily used, and provides a starting point for future developments.


I thank Dr David Holmes and Professor Colin Martindale for providing some of the text files used in this benchmarking exercise, as well as for helpful comments.


Bayer, T., Renz, I., Stein, K. & Kressel, U. (1996). Domain and Language Independent Feature Extraction for Statistical Text Categorization. In: L.J. Evett & T.G. Rose, eds., Language Engineering for Document Analysis and Recognition. Nottingham Trent University, Nottingham.

Binongo, J.N.G. (1994). Joaquin's Joaquinesquerie, Joaquinesquerie's Joaquin: A Statistical Expression of a Filipino Writer's Style. Literary & Linguistic Computing, 9(4), 267-279.

Burrows, J.F. (1992). Not unless you Ask Nicely: the Interpretive Nexus between Analysis and Information. Literary & Linguistic Computing, 7(2), 91-109.

Burrows, J.F. & Craig, D.H. (1994). Lyrical Drama and the "Turbid Montebanks": Styles of Dialogue in Romantic and Renaissance Tragedy. Computers & the Humanities, 28, 63-86.

Church, K. & Hanks, P. (1989). Word Association Norms, Mutual Information and Lexicography. Proc. 27th Annual ACL Meeting, Vancouver, 76-83.

Dawkins, R. (1976). Hierarchical Organisation: a Candidate Principle for Ethology. In: P.P.G. Bateson & R.A. Hinde, eds., Growing Points in Ethology. Cambridge University Press.

Everitt, B.S. & Dunn, G. (1991). Applied Multivariate Data Analysis. Edward Arnold, London.

Forsyth, R.S. (1995). Stylistic Structures: a Computational Approach to Text Classification. Unpublished Doctoral Thesis, Faculty of Science, University of Nottingham.

Greenwood, H.H. (1995). Common Word Frequencies and Authorship in Luke's Gospel and Acts. Literary & Linguistic Computing, 10(3), 183-187.

Holmes, D.I. (1994). Authorship Attribution. Computers & the Humanities, 28, 1-20.

Kilgarriff, A. (1996). Which Words are Particularly Characteristic of a Text? In: L.J. Evett & T.G. Rose, eds., Language Engineering for Document Analysis and Recognition. Nottingham Trent University, Nottingham.

Kjell, B. (1994). Authorship Determination Using Letter Pair Frequency Features with Neural Net Classifiers. Literary & Linguistic Computing, 9(2), 119-124.

Klecka, W.R. (1980). Discriminant Analyis. Sage Publications, Newbury Park.

Ledger, G.R. (1989). Re-Counting Plato. Oxford University Press, Oxford.

Ledger, G.R. & Merriam, T.V.N. (1994). Shakespeare, Fletcher, and the Two Noble Kinsmen. Literary & Linguistic Computing, 9(3), 235-248.

Makridakis, S. & Wheelwright, S.C. (1989). Forecasting Methods for Managers, fifth edition. John Wiley & Sons, New York.

McKenzie, D.P. & Forsyth, R.S. (1995). Classification by Similarity: An Overview of Statistical Methods of Case-Based Reasoning. Computers in Human Behavior, 11(2), 273-288.

McMahon, L.E., Cherry, L.L. & Morris, R. (1978). Statistical Text Processing. Bell System Technical Journal, 57(6), 2137-2154.

Mosteller, F. & Wallace, D.L. (1984). Applied Bayesian and Classical Inference: the Case of the Federalist Papers. Springer-Verlag, New York. [Extended edition of: Mosteller & Wallace (1964). Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, Massachusetts.]

Pearson, K. (1900). On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that It Can Reasonably Be Supposed to Have Arisen from Random Sampling. Philosophical Magazine, 6(2), 157-176.

Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.

Sneath, P.H.A. & Sokal, R.R. (1973). Numerical Taxonomy. W.H. Freeman & Co., San Francisco.

Stone, P.J., Dunphy, D.C., Smith, M.S. & Ogilvie, D.M. (1966). The General Inquirer: a Computer Approach to Content Analysis in the Behavioral Sciences. MIT Press, Cambridge, Mass.

Ule, L. (1982). Recent Progress in Computer Methods of Authorship Determination. ALLC Bulletin, 10(3), 73-89.

Wolff, J.G. (1975). An Algorithm for the Segmentation of an Artificial Language Analogue. Brit. J. Psychology, 66(1), 79-90.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at Queen's University

Kingston, Ontario, Canada

June 3, 1997 - June 7, 1997

76 works by 119 authors indexed

Series: ACH/ALLC (9), ACH/ICCH (17), ALLC/EADH (24)

Organizers: ACH, ALLC