The present work is an exploration into Dickens' style using the statistical method of Representativeness & Distinctiveness 1 to detect words that Dickens either particularly preferred or avoided compared with other writers of his time. It takes as a starting point research reported on in DigitalHumanities 2012 2, where Tabata used the classification algorithm Random Forests 3 to determine words able to distinguish Dickens' works from both contemporary author Wilkie Collins and a larger set of authors from the 18th and 19th century.
Representativeness & Distinctiveness was originally conceived in the realm of dialectometry, where it has been shown to detect lexical items distinguishing different dialectal areas. In the context of stylometry, the method detects elements for which an author is consistent throughout his own works while also separating him from others. Considering, for instance, a comparison between Dickens and fellow writer Collins on word features using a couple of novels of each writer, one first determines Dickens' representative terms, i.e. those words which he uses consistently either frequently or infrequently over his works. In order to arrive at a combined measure, one then favours those representative terms of Dickens that Collins uses either inconsistently or consistently but with a different frequency over his novels. The remaining group of words are considered to be Dickens' representative and distinctive terms when compared with Collins. Since the analysis is directional, the degree of Representativeness of individual features being different with respect to each author, comparisons are made twice - once from Dickens' to Collins' set and once from Collins' to Dickens' set. This returns two individual author profiles, where features occurring in both profiles are also consistent for both writers.
Thus, Representativeness & Distinctiveness bears similarities with both Burrow's Delta 4 and Zeta5 in so far as favouring consistent terms that are irregular in the opposing author's set. Additionally, it is also similar to Zeta in being dependent on the other set for the selection of distinctive terms out of the representative ones. However, rather than preselecting words according to different frequency strata, it is used here on the first 5000 most frequent ones.
We compare our results on "Dickens vs. Collins" and "Dickens vs. World" to the earlier study using Random Forests (RF) classification that was also done on the basis of frequency comparisons. RF is a machine-learning technique based on ensemble learning from a large number of decision trees, hence "forests". Each tree is trained on a different subset of the training data and subsequently evaluated on the remainder. At each node a different subset of the total features are considered and selected according to the best split. Individual features' importance is averaged over all trees and similar to our method there is a measure of how useful a particular feature is for classification. Representativeness & Distinctiveness and Random Forests are conceptually similar in that the document space is considered as a set of smaller comparisons between documents and distinguishing features are chosen accordingly. While RF presents the complete process of classification and evaluation, Representativeness & Distinctiveness, although less straightforward to evaluate, might have advantages in terms of interpretability.Since a gold-standard indicating the most prominent stylistic features of an author is generally not available, we evaluate author profiles over different iterations in cross-validation by testing how well the selected words are able to separate authors in clustering. For this purpose, we retain the shared words of both profiles, since the values for terms in only one author profile might not be consistent in the other author's documents. The next step is then to cluster all documents based on the relative frequency for those shared terms. Based on the ideal clustering result into the two author groups and the present iteration's clustering result, we compute the Adjusted Rand Index 6. The evaluation technique proposed here is intended to measure classification ability of the author profiles and thus also appropriateness of the method responsible for choosing them. Applying it to our representative and distinctive profiles of Dickens and his contemporaries indicates a high degree of separation ability in clustering. Despite using a different method to determine the author profiles, there is a fair overlap of items with the earlier study using Random Forests, which would strengthen the assumption of them being genuine stylistic elements of Dickens and his peers.
1. Jelena Prokić, Çağrı Çöltekin, and John Nerbonne (2012). Detecting shibboleths. In: Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH. EACL 2012. Avignon, France: Association for Computational Linguistics, pp. 72-80.
2. Tomoji Tabata. (2012). Approaching Dickens' Style through Random Forests. In: Proceedings of the Digital Humanities. Hamburg, Germany.
Leo Breiman (2001). Random Forests. In: Machine Learning, pp. 5–32.
4. Burrows, J. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship, Literary and Linguistic Computing, 17 (3). 267–87.
5. Burrows, J. (2007). All the Way Through: Testing for Authorship in Different Frequency Strata. LLC 22: 27-47.
6. Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. In: Journal of Classification 2.1 (Dec. 1985), pp. 193–218.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)