Feature Selection in Authorship Attribution: Ordering the Wordlist

paper, specified "long paper"
Authorship
  1. 1. Maciej Eder

    Institute of Polish Language - Polish Academy of Sciences

  2. 2. Joanna Byszuk

    Institute of Polish Language - Polish Academy of Sciences

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Feature selection in machine learning is one of the best covered topics in general statistics literature, and, next to classification algorithm, the most important factor to consider. To name but a few, the techniques of identifying the most efficient features include dimension reduction, shrinkage, or penalization (James et al., 2013). However, in stylometric investigations the selection of best performing style-markers is usually narrowed down to the choice between word frequencies, character n-grams, POS-tag n-grams (Hirst and Feguina, 2007; Stamatatos, 2009), etc., without devoting much attention to their statistical properties.
The purpose of this study is somewhat different: apart from identifying meaningful features – or such that facilitate telling apart analyzed classes – we would like to provide some deeper linguistic understanding of the most distinctive features, i.e. discover if words efficient in classification share any linguistic properties.
In stylometry, and particularly in authorship attribution, most frequent words (MFWs), or more specifically, their mean frequencies in the corpus, are traditionally claimed to exhibit strong discriminative power. In a vast majority of studies following a seminal study by Mosteller and Wallace (1964), the feature selection procedure starts with preparing a joint frequency list containing words ordered in a decreasing sequence of the number of their occurrences in the entire corpus, from the most to the least frequent. It has been shown that MFWs are mostly function (synsemantic) words, bearing meaning only in the company of other words, which makes it a very plausible theoretical justification for simply taking a considerable number of top words from a frequency list as potential features. Even if the impact of content (autosemantic) words is also considered important, and as claimed in a recent study (Lestrade, 2017) crucial for the Zipfian distribution of words in a dataset, these words are very sensitive to topic, theme, and other factors that might overshadow the authorial signal.
However, there is no simple answer to the question how many of MFWs should be taken into consideration. Consequently, there is no consensus between scholars when the choice of MFW vector is concerned, ranging from a very limited number of top words (Juola, 2008) to long vectors of 1,000 or so features (Hoover, 2004). Rybicki and Eder (2011) tested hundreds of combinations for both frequent and not-so-frequent words, and came to a conclusion that there is no universal number of features that would lead to satisfying results: this always depends on language and corpus, although a vector between 100 and 1000 MFWs usually reveals acceptable performance. Further studies (Evert et al., 2017) corroborate these findings using various measures of textual similarity.
There are several ways to balance the influence of a certain words in the corpora and lessen the impact of less important ones. Undoubtedly, the most popular is tf-idf (term frequency–inverse document frequency), commonly used in information retrieval systems. It assumes that a word which is attested in few documents, but is yet relatively frequent (e.g.
onto
,
upon
,
therefore), contributes much more to the general knowledge of the text than a popular word evenly distributed across the corpus (e.g.
the, a, an). This method of weighting by definition culls the most frequent words, and boosts the weights of “keywords”, or unusual words. A possibly unwanted effect is that the “keywords” also include several proper nouns, names, and so on.

A different approach assumes that the use of some words – no matter how frequent they actually are – does not differ significantly across the corpus, whereas some others are over- and underused. The
variability of a given word in a text collection might then be a good indication of its discriminative power. However, since the standard deviation of a given sequence of numbers strongly depends on their actual values, this also holds for word frequencies. E.g., the standard deviation for the set x= {1, 2, 3, 4} is as high as 1.290994, the standard deviation of the set y= {10, 20, 30, 40} is ten times bigger, which, going back to the realm of word frequencies, would mean that the variability of the word
the would be orders of magnitude higher than that of the infrequent words. This can be corrected with the coefficient of variation (CoV), which is the standard deviation of a set of values divided by its arithmetic mean. First attempts to tune the word list according to the CoV were carried out by Hoover (2014).

In the following experiments, we explore the above three basic ways of ordering the features: (i) according to their mean frequency in the corpus, (ii) according to their mean TF-IDF score, and (iii) according to their CoV.
Machine learning is always burdened with the problem of over- or underfitting the model: using too few features usually makes it difficult to reveal the signal from the corpus, but relying on too many without any pruning might lead to overshadowing the signal with noise. Rather than looking for the optimal length of feature vector, we approach this problem with the intention of a rearrangement of features from the most to the least meaningful, which can help neutralize the issue of unimportant words landing (high) in the number of considered features. While it is possible to balance the impact of particular, not necessarily very important, features with a proper choice of distance measure and classification method, these solutions require advanced knowledge of the existing practices and the ability to critically assess the method and results. Applying a better method of feature selection, more resistant to words that carry importance only for singular works or authors, makes stylometry more accessible to beginners and provides more objectivity, reducing the opportunity and urge for “cherry-picking”.

Dataset

We used 4 datasets: 100 Polish, 100 English, and 75 French novels for main analysis, and smaller set of 28 English for test purposes, as the computations were time-consuming. The datasets were retrieved from resources used as benchmark by Computational Stylistics Group and Evert et al. 2017 comparison of Delta measures. The texts in these corpora are balanced for the time period (turn of the 19th and 20th century), number of texts per author (3, with additional 1 book for Polish and English) and to some extent style and topic – all canon literary texts, dealing with social issues, with some influence of history.

TF
z-scores
TF-IDF
z-scored TF-IDF

mean TF (=MFWs)
x
x, discussed in detail
x
x

TF-IDF
x
x, discussed in detail
x
x

CoV
x
x, discussed in detail
x
x

Table 1: Considered scenarios, out of which z-score weighting discussed in this paper.
Although we tested all above-mentioned scenarios, due to limited amount of space in this abstract, we focus on the behavior of various order of features that are next transformed to z-scores, which proved most interesting in terms of the results. Other weights and their combinations with different arrangements of the features will be discussed in the full-length version of this paper.

Experiment design

In our approach, we performed a series of controlled tests of attribution, using the above-mentioned corpora of known authorship. Over multiple iterations we recorded the number of texts that were correctly attributed by our chosen supervised classifier. To neutralize an impact of any local idiosyncrasies in our corpora, we applied leave-one-out cross-validation scenario, which meant that every single text from a corpus was checked against a slightly altered training set.
As for chosen classification method, because of conceptual simplicity and intuitive interpretation of the results, we use
k
-NN supervised lazy learner, with k= 1. In stylometric community, this classifier is well known under the name
Delta (Burrows, 2002), and widely used. Since it is somewhat difficult to test multidimensional methods using one feature at a time, we apply a moving window approach, in which for every single feature to be tested, we perform a test in 10-dimensional space for the feature in question and its 9 subsequent features (i.e. a total of 10 words in the order as they appear after
the weighting and ordering procedure). In other words, in each iteration we test the combination of w
i + w
i+1 + … + w
i+9 features.

Results

The evaluation of different rearrangements of the list of features starts with a classic MFW-centric approach, or ordering the features according to their mean term frequency (TF). The results turned out to be surprisingly unsurprising: mere word frequencies outperform other approaches (see Fig. 1).

Fig. 1: Attribution scores for the words ordered based on MFWs (mean TF), weighted with z-scores. One circle represents one window of 10 subsequent features.
TF-IDF reveals reasonable predictive power for the features ranked at the top of the list, as evidenced by the first few hundreds of words (Fig. 2). However, the obtained values are significantly lower than for the regular MFWs. Worth noticing is the fact that the end of the feature list spikes up: very frequent function words excluded from the TF-IDF company and clustered at the end of the list actually outperform all the other features in this picture, once again highlighting the relevance of very common and very frequent words for the attribution.

Fig. 2: Attribution scores for the words ordered based on mean TF-IDF, weighted with z-scores.
Interesting and counter-intuitive are the results for the coefficient of variation (Fig. 3): the most successful attribution scores alike, or even higher, than those obtained for TF occur only for the features grouped at the end of the considered wordlist. Moreover, the significant features are distributed more densely than in the other case. Essentially, an inverse version of CoV, computed as 1 / CoV, will serve as an efficient feature harvester.

Fig. 3: Attribution scores for the words ordered based on coefficient of variation (CoV), weighed with z-scores.
While general attributive success rates are higher for TF than for CoV, the tail of non-distinctive features is longer in the case of CoV than for the TF-weighted list, which suggests that the two methods can harvest meaningful features quite effectively, despite differences. This brings us to the question: What if we combine their potential to extract the right features?
We propose to simply multiply TF and reverse CoV. Knowing that CoV is in fact the standard deviation of a given feature divided by its mean, we can denote the formula as follows:

ω
i =
μ
i × (1 / (
σ
i /
μ
i ))

where
ω
i is the new weight of a feature
i,
μ
i – its mean TF, and
σ
i – its standard deviation. With a little bit of algebra, we observe that:

ω
i =
μ
i
2 /
σ
i

which, we believe, will further aggregate meaningful features at the beginning of the wordlist. We additionally tested a similar idea of boosting the meaningful features by simply magnifying TF by standard deviations:
μ
i ×
σ
i.

Fig. 4: Cumulative attribution scores for different scenarios discussed in the paper, for the first 500 features. Inset: an overview of the entire feature set. Results obtained on the test corpus of 28 English novels.
The comparison of the above three plus two newly introduced scenarios is shown in Fig. 4, where cumulative sums of attribution accuracies are presented. If the features were spread randomly on the word list, the observed results would follow the grey dashed line. The higher a given trajectory, the better an order of features from most to least meaningful. An overview of the entire feature set (Fig. 4, inset) exhibits a good performance of TF, but a closer look at the top few dozen features (Fig. 4, main) shows that our newly introduced weighting outperforms the time-proven MFW-centric approach. The suboptimal behavior of CoV (meaningful features clustered at the end of the list, see also Fig. 3) is mirrored by its very low trajectory.

Conclusions

In this paper, we experimentally confirmed that the intuition of ordering a list of features according to their decreasing frequencies has solid grounds. An alternative approach – ordering features according to their “keyness”, i.e. TF-IDF scores – turned out to be questionable. The major observation formulated here, however, was that combining the usual MFW approach with an inverse CoV weighting arranges the features even more efficiently.

Acknowledgments
The study was conducted as part of
Large-Scale Text Analysis and Methodological Foundations of Computational Stylistics project (SONATA-BIS 2017/26/E/HS2/01019) funded by the Polish National Science Center (NCN), for whose support we are very grateful.

Bibliography

Burrows, J. (1987). Computation into Criticism: a Study of Jane Austen’s Novels and Experiment in Method. Oxford: Clarendon Press.

Burrows, J. (2002).‘Delta’: a measure of stylistic difference and a guide to likely authorship.
Literary and Linguistic Computing,
17(3): 267–87.

Eder, M., Rybicki, J. (2013). Do birds of a feather really flock together, or how to choose training samples for authorship attribution.
Literary and Linguistic Computing,
28: 229–36.

Eder, M. (2011). Style-markers in authorship attribution: a cross-language study of authorial fingerprint.
Studies in Polish Linguistics,
6: 99-114.

Eder, M., Kestemont, M. and Rybicki, J. (2013). Stylometry with R: a suite of tools.
Digital Humanities 2013: Conference Abstracts. Lincoln: University of Nebraska-Lincoln: 487–89.

Evert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C., Vitt, T. (2017). Understanding and explaining Delta measures for authorship attribution.
Digital Scholarship
in the
Humanities,
32: 4–16.

Evert, S., Jannidis, F., Proisl, T., Vitt, T., Schöch, C., Pielström, S., Reger, I. (2016). Outliers or key profiles? Understanding distance measures for authorship attribution. In
Digital Humanities 2016: Conference Abstracts. Kraków, pp. 188–91.

Hirst, G., Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts.
Literary and Linguistic Computing,
22: 405–17.

Hoover, D.L. (2004). Testing Burrows’s Delta.
Literary and Linguistic Computing,
19: 453–75.

Hoover, D. L. (2014). Tuning the word frequency list. In
Digital Humanities 2014: Book of Abstracts. Lausanne, pp. 200–202.

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013).
An Introduction to Statistical Learning with Applications in R. New York: Springer.

Juola, P. (2008). Authorship attribution.
Foundations and Trends in Information Retrieval,
1: 233–334.

Mosteller, F., Wallace, D.L. (1964).
Inference and Disputed Authorship. The Federalist. Reading, Mass.: Addison-Wesley.

Rybicki, J. and Eder, M. (2011). Deeper Delta across genres and languages: do we really need the most frequent words?
Literary and Linguistic Computing,
26(3): 315–21.

Stamatatos, E. (2009). A survey of modern authorship attribution methods.
Journal of the Association for Information Science and Technology,
60: 538–56.

Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features.
Journal of Law and Policy,
21: 20.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO