Stylometry of Collaborations: Dickens, Collins and their collaborative writings

paper, specified "long paper"
  1. 1. Tomoji Tabata

    Graduate School of Language and Culture - University of Osaka

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
The Victorian author Charles Dickens was among the first publishing entrepreneurs to run mass- produced weekly/monthly magazines on a successful commercial basis. He employed many ‘salaried staff writers’ (Nayder, 2002), who had to write under anonymity, including Elizabeth Gaskell, Adelaide Anne Proctor et al., in Household Words and All the Year Round, the journals ‘conducted by’ Dickens (Stone, 1968; Thomas, 1982; Allingham, 2011).

On the other hand, Dickens collaborated with his younger contemporary Wilkie Collins on a number of stories, typically for the Christmas Numbers of his journals. While some of their collaborative pieces were written with the assistance of other staff writers, four works are known to have been co-authored by Dickens and Collins alone (Nayder, 2002): The Frozen Deep (1857), The Lazy Tour of Two Idle Apprentices (1857), The Perils of Certain English Prisoners’ (1857), and No Thoroughfare (1867). The four collaborations can be seen as betokening what appears to be a firm presence of Collins, a foothold he had gained, in the Dickens circle by the time he and Dickens launched into the joint works beginning in 1857.

These collaborative writings vary in design and style from one another as well as in theme and setting. In some cases, one chapter can be read as radically different from another due in large to the varying proportion of contribution by each of the duo: some chapters were written either Dickens or Collins alone, while there are other chapters that were jointly written although the extent of collaboration has yet to be identified quantitatively.

In order to provide new insight to the nature of collaborative authorship, the present study applies s series of stylometric techniques: (1) Craig’s extension of Burrows’s Zeta test for reliably extracting author markers from a large number of candidate words; (2) Cluster analysis based on Burrows’s Delta distance measure (Burrows, 2002) to compare the collaborations with the canonical works of Dickens and Collins; and (3) Rolling Delta (Eder, Kestement and Rybicki, 2013) in an effort to detect authorial takeovers or to estimate the extent of contribution by each of the two authors in their collaborative writings.

2. Single authorship and mixed authorship
Although the lack of byline makes it difficult to determine the authorship of Christmas numbers for Dickens’s periodicals, the account book in the office of Household Words helps identify many of his collaborators (Thomas, 1982). Table 1 shows bibliographic details for the four collaborative works between Dickens and Collins.

The Frozen Deep was originally written as a drama in three acts. Collins drafted the manuscript and Dickens heavily revised it. The script of the drama remained unpublished until 1866, when Collins altered it single-handedly, getting rid of Dickens’s hand. After Dickens’s death, Collins adapted the play as a novella for use in his public reading tour in 1874 (Brannan, 1966). No Thoroughfare was also written first as a drama and then rewritten into a novella form.

When the collaborative pieces are divided into smaller units like a chapter or act, eight units (including The Frozen Deep) are of single authorship, with the remaining six units being a case of mixed authorship.

Fig. 1:

3. Testing the authorship of collaborative chapters
The following experiments draw on a Dickens corpus comprising 22 texts and a Collins corpus with the same number of texts as a basis of reference, with which we compare the style of the collaborative chapters (see Tables 3 and 4 in Appendix for details). The first round of analysis is to run Craig’s version of Burrows’s Zeta test in order to extract Dickens markers as well as Collins markers. The vast majority of authorship attribution studies have relied on n most common words in the corpus/text in question (Burrows, 2002/2005; Eder, 2010; Eder & Rybicki, 2009; Hoover, 2003/2004; Rybicki, 2009; to name but a few). The recent works by Craig & Kinney (2009) and Hoover (2010; 2011; 2013), on the other hand, have demonstrated Zeta’s strong power of differentiation between two sets of text samples. Other keywords extraction techniques include the use of Log-likelihood ratio (Dunning, 1993) criticised by Tabata (2012) for being prone to burstiness (and for its tendency to produce too many false-positives); Mann-Whiney U test (Kilgariff, 2001); t-test (Hoover, 2010); bootstrap test (Lijffijt et al., 2012); Random forests (Tabata, 2012), and so forth. The particular strength of Craig’s version of Zeta analysis is its simplicity and effectiveness well documented in Hoover’s studies mentioned above. A Zeta distinctiveness ratio for the word i is calculated in the following formula:

Fig. 2:

where N(x) = Total number of text-segments in the corpus x; DF(x)i = Document (or Segment) frequency for the word i in the corpus x; N(y) = Total number of text-segments in the corpus y; N(y) – DF(y)i = Document (or Segment) frequency negative for the word i in the corpus y.

The R package stylo (a suite of tools for stylometric analyses) (Eder, Rybicki, and Kestmont, 2013) includes the function oppose() with an option to calculate Zeta scores. In the present case, each text was sliced into 10,000-word segments with rare occurrence threshold set to 2 so as to exclude hapax legomena and filtering threshold set to 0.5 to extract strongly discriminating words (thus only words(i) with Zeta(i) < 0.5 or Zeta(i) > 1.5 are picked up). The procedure detected 122 authorial markers: 61 Dickens markers and 61 Collins markers as listed in Table 2.

Fig. 3:

When the marker words were fed into a cluster analysis, the resulting dendrogram (Fig. 1) clearly differentiated the two authors in the distinct clusters. The marker words are also sensitive enough to show sub-patterns: both in Dickens and Collins clusters, the texts written in their early career flock together (Dickens’s texts in the 1830’s and Collins’s texts published in the 1850’s). Dickens’s early works branch close to the sketches/travelogues as opposed to fictions.

Fig. 4: Cluster analysis: Dickens versus Collins (Distance: Burrows’s Delta)

Fig. 4 is a cluster dendrogram with the collaborative pieces included in the analysis. The prominent feature of this result is that:

1. The single-handed chapters/acts fit in well with the author’s cluster (Chapters 1 and 3 of The Perils of Certain English Prisoners perch in the topmost sub-cluster together with the Overture and Act 3 of No Thoroughfare, whereas The Frozen Deep and Act 2 of No Thoroughfare are found as the nearest neighbours to each other, with Chapter 2 of The Perils placed in the Collins cluster.
2. The co-authored chapters/acts form slightly distant sub-clusters in each of the two author’s main clusters
4.Letting the Delta roll through to find dynamic shifts in style
Although the result of cluster analysis, being in consonant with the bibliographical details given in Table 1, helps confirm the effectiveness of style markers found through a Zeta test, it inevitably tells us about the limitation of the procedure that captures only a static snapshot of stylistic similarity or difference between texts. Language is never monolithic throughout a text. Language in fiction varies wildly from narratives to dialogue, or vice versa. Language in fiction is indeed quite mobile.

We need, therefore, a technique to capture dynamic style change. If the collaborated chapters can be sliced into consecutive segments of n words with a partial over- lapping so that we can roll through to focus upon a certain stretch of text like a moving camera rather than take the entire text/chapter in one snapshot, it will be possible to detect subtle fluctuations of style between one stretch and another as well as to pinpoint where one author takes over from another, etc. Eder, Rybicki, and Kestemont’s Rolling Delta was developed exactly for this purpose.

Figure 5 shows a result of Rolling Delta run with a window size set to 3,000 words, a step size of 300 words, using 100 most common words as variables. The whole text of The Perils of Certain English Prisoners is cut into consecutive 3,000 word-segments, with each segment compared with the centroids of Dickens and Collins, respectively.

Fig. 5: Rolling Delta: The Perils of Certain English Prisoners and the centroids of Dickens and Collins

What strikes us is that the two major intersections (marked with green vertical lines) roughly correspond to the chapter boundaries in the text, a result that illustrates how the technique is capable of pinpointing possible authorial takeovers.

Fig. 6: Rolling Delta: The collaborated chapters of the Lazy Tour (The thicker line indicates Dickens’s centroid)

Fig 6 displays Delta polygons of the four collaborated chapters of the Lazy Tour of Two Idle Apprentices. Of particular interest is that a remarkably similar pattern holds throughout the four diagrams: it is always Dickens who takes the lead at the outset of a chapter. He runs quarter or halfway (at most) into each chapter before passing over to Collins. The series of Rolling Delta plots seem to reflect an interesting nature of collaboration as well as the unequal partnership (Nayder, 2002) between Dickens and Collins: Dickens always takes initiative, sets a keynote for the whole chapter, which Collins takes over and continues the rest of chapters, a typical relationship between a master and his disciples.

Allingham, P. V. (2011). A Comprehensive List of Dickens’s Short Fiction, 1833-1868. The Victorian Web. [Online] (Last accessed 31 October 2013.)

Brannan, R. L. (1966). Under the Management of Mr. Charles Dickens: His Production of "The Frozen Deep". Ithaca, New York: Cornell University Press.

Burrows, J. (2002). Delta: a Measure of Stylistic Difference and a Guide to Likely Authorship. LLC, 17. 267-287.

Burrows, J. (2005). Who wrote Shamela? Verifying the Authorship of a Parodic Text, LLC, 19/4: 453–475

Burrows, J. (2007). All the Way Through: Testing for Authorship in Different Frequency Strata, LLC, 22/1: 27–47.

Craig, H. and A. Kinney (eds.) (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19/1: 61–74.

Eder, M. (2010). Does Size Matter? Authorship Attribution, Small Samples, Big Problem, Digital Humanities 2010 Conference Abstracts, King’s College London, 132–5.

Eder, M. and J. Rybicki (2009). PCA, Delta, JGAAP and Polish Poetry of the 16th and the 17th Centuries: Who Wrote the Dirty Stuff?, Digital Humanities 2009 Conference Abstracts, University of Maryland, College Park, 242–4.Eder, M., J. Rybicki, and M. Kestemont (2013) Computational Stylistics [Online] (Last accessed 31 October 2013.)

Eder, M., M. Kestemont and J. Rybicki (2013). Stylometry with R: a suite of tools. Digital Humanities 2013 Conference Abstracts. University of Nebraska–Lincoln, NE, 487–9.

Hoover, D. L. (2003). Multivariate Analysis and the Study of Style Variation, Literary and Linguistic Computing, 18/4: 341–360.

Hoover, D. L. (2004). Testing Burrows’s Delta, Literary and Linguistic Computing, 19/4: 453–475. Hoover, D. (2010). Teasing out authorship and style with t-tests and Zeta. Digital Humanities 2010 Conference Abstracts, King’s College London, 168–70.

Hoover, D. (2011). The Tutor’s Story: a case study of mixed authorship. Digital Humanities 2012 Conference Abstracts, Stanford University, Stanford, CA, 149–51.

Hoover, D. (2012). The rarer they are, the more they are, the less they matter. Digital Humanities 2012 Conference Abstracts, Hamburg University, Hamburg, 218–21.

Hoover, D. (2013). Almost All the Way Through — All at Once. Digital Humanities 2013 Conference Abstracts, University of Nebraska–Lincoln, NE, 223–6

Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics 6 (1): 1–37.

Lane, M. (1956). Introduction. In The Oxford Illustrated Dickens, Christmas Stories. Oxford: OUP

Lijffjt, J., T. Sälly, and T. Nevalainen (2012) Chi-square test considered harmful: Better methods fo testing the significance of word frequencies, A paper presented at ICAME 33, 30 May–3 June 2012, Leuven, Belgium.

Nayder, L. (2002). Unequal Partners: Charles Dickens, Wilkie Collins, and Victorian Authorship. Ithaca/London: Cornell UP.

Rayson, P. and R. Garside (2000). Comparing Corpora Using Frequency Profiling, Proceedings of the Workshop on Comparing Corpora, Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), 1–8 October 2000, Hong Kong. 1–6. Available online at

Rybicki, J. (2009). Translation and Delta Revisited: When We Read Translations, Is It the Author or the Translator that We Really Read?, Digital Humanities 2009 Conference Abstracts, University of Maryland, College Park, 245–7.

Rybicki, J.and M. Eder (2011). Deeper Delta across genres and languages: do we really need the most frequent words?, Literary and Linguistic Computing, 26/3: 315–321.

Stone, H. (ed.) (1968). Charles Dickens’ Uncollected Writings from “Household Words” 1850–1859. 2 vols. Bloomington: Indiana UP.

Tabata, T.Approaching Dickens’s Style through Random Forests, Digital Humanities 2012 Conferenc Abstracts, University of Hamburg, Germany, 388–91.

Thomas, D. A. (1982). Dickesn and the Short Story. Philadelphia: University of Pennsylvania Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO