Gentleman in Dickens: A Multivariate Stylometric Approach to its Collocation

  1. 1. Tomoji Tabata

    University of Osaka

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This study proposes a multivariate approach to the collocation
of gentleman in Dickens. By applying a stylo-statistical analysis
model based on correspondence analysis (Tabata, 2004, 2005,
2007a, and 2007b) to the investigation of the collocation
of the word gentleman, the present study visualizes the
complex interrelationships among gentleman’s collocates,
interrelationships among texts, and the association patterns
between the gentleman’s collocates and texts in multidimensional
spaces. By so doing, I shall illustrate how the
collocational patterns of gentleman refl ects a stylistic variation
over time as well as the stylistic fi ngerprint of the author.
One thing that strikes readers of Dickens in terms of his style
is the ways he combines words together (i.e. collocation).
Dickens sometimes combines words in a quite unique manner,
in a way that contradicts our expectation to strike humour, to
imply irony or satire, or otherwise. Dickens also uses fi xed/
repetitive collocations for characterization: to make particular
characters memorable by making them stand out from others.
The collocation of word gentleman, one of the most frequent
‘content’ words in Dickens, is full of such examples, and would
therefore be worthy of stylistic investigation.
The signifi cance of collocation in stylistic studies was fi rst
suggested by J. R. Firth (1957) more than half a century ago.
Thanks to the efforts by Sinclair (1991) and his followers,
who have developed empirical methodologies based on largescale
corpora, the study of collocation has become widely
acknowledged as an important area of linguistics. However,
the vast majority of collocation studies (Sinclair, 1991;
Kjellmer, 1994; Stubbs, 1995; Hunston and Francis, 1999, to
name but a few) have been concerned with grammar/syntax,
lexicography, and language pedagogy, not with stylistic aspects
of collocation, by showing the company a word keeps. Notable
exceptions are Louw (1993), Adolphs and Carter (2003), Hori
(2004), and Partington (2006). The emergence of these recent
studies seems to indicate that collocation is beginning to draw
increasing attention from stylisticians.
Major tools used for collocation studies are KWIC
concordances and statistical measures, which have been
designed to quantify collocational strength between words
in texts. With the wide availability of highly functional
concordancers, it is now conventional practice to examine
collocates in a span of, say, four words to the left, and four
words to the right, of the node (target word/key word) with
the help of statistics for fi ltering out unimportant information.
This conventional approach makes it possible to detect a
grammatical/syntactic, phraseological, and/or semantic relation
between the node and its collocates. However, a conventional
approach would not be suitable if one wanted to explore
whether the company a word keeps would remain unchanged,
or its collocation would change over time or across texts. It
is unsuitable because this would involve too many variables to
process with a concordancer to include a diachronic or a crosstextual/
authorial perspective when retrieving collocational
information from a large set of texts. Such an enterprise would
require a multivariate approach.
Various multivariate analyses of texts have been successful
in elucidating linguistic variation over time, variation across
registers, variation across oceans, to say nothing of linguistic
differences between authors (Brainerd, 1980; Burrows, 1987
& 1996; Biber and Finegan, 1992; Craig, 1999a, b, & c; Hoover,
2003a, b, & c; Rudman, 2005). My earlier attempts used
correspondence analysis to accommodate low frequency
variables (words) in profi ling authorial/chronological/crossregister
variations in Dickens and Smollett (Tabata, 2005,
2007a, & c). Given the fact that most collocates of content
words tend to be low in frequency, my methodology based
on correspondence analysis would usefully be applied to a
macroscopic analysis of collocation of gentleman.
This study uses Smollett’s texts as a control set against which
the Dickens data is compared, in keeping with my earlier
investigations. Dickens and Smollett stand in contrast in the
frequency of gentleman. In 23 Dickens texts used in this study,
the number of tokens for gentleman amounts to 4,547, whereas
Smollett employs them 797 times in his seven works. In the
normalised frequency scale per million words, the frequency
in Dickens is 961.2, while the frequency in Smollett is 714.3.
However, if one compares the fi rst seven Dickens texts with
the Smollett set, the discrepancy is even greater: 1792.0 versus
714.33 per million words. The word gentleman is signifi cantly
more frequent in early Dickens than in his later works.
Stubbs (2001: 29) states that “[t]here is some consensus, but
no total agreement, that signifi cant collocates are usually found
within a span of 4:4”. Following the conventional practice to
examine collocation (Sinclaire, 1991; Stubbs, 1995 & 2001), the
present study deals with words occurring within a span of four
words prior to, and four words following, the node (gentleman)
as variables (collocates) to be fed into correspondence
The respective frequency of each collocate is arrayed to form
the frequency-profi le for 29 texts (Smollett’s The Adventure of
an Atom is left out of the data set since it contains no instance of
gentleman). The set of 29 collocate frequency profi les (collocate
frequency matrix) is then transposed and submitted to
correspondence analysis (CA), a technique for data-reduction.
CA allows examination of the complex interrelationships
between row cases (i.e., texts), interrelationships between
column variables (i.e., collocates), and association between
the row cases and column variables graphically in a multidimensional
space. It computes the row coordinates (word scores) and column coordinates (text scores) in a way that
permutes the original data matrix so that the correlation
between the word variables and text profi les are maximized.
In a permuted data matrix, adverbs with a similar pattern of
distribution make the closest neighbours, and so do texts of
similar profi le. When the row/column scores are projected in
multi-dimensional charts, relative distance between variable
entries indicates affi nity, similarity, association, or otherwise
between them.
Figure 1. Correspondance analysis of the
collocates of gentleman: Text-map
Figures 1 and 2 demonstrate a result of correspondence
analysis based on 1,074 collocates of gentleman across 29
texts. The horizontal axis of Figure 1 labelled as Dimension
1, the most powerful axis, visualizes the difference between
the two authors in the distribution of gentleman’s collocates.
It is also interesting that the early Dickensian texts, written in
1830s and early 1840s, are lying towards the bottom half of the
diagram. The same holds true for Smollett. Thus, the horizontal
axis can be interpreted as indicating authorial variation in the
collocation of gentleman, whereas the vertical axis can be
interpreted as representing variation over time, although text
entries do not fi nd themselves in an exact chronological order.
On the other hand, Figure 2 is too densely populated to identify
each collocate, except for the outlying collocates, collocates
with stronger “pulling power”. However, it would be possible
to make up for this by inspecting a diagram derived from
much smaller number of variables (say, 100 variables, which
will be shown shortly as Figure 4), whose overall confi guration
is remarkably similar to Figure 2 despite the decrease in the
number of variables computed.
Figure 2. Correspondence analysis of the collocates
of gentleman: A galaxy of collocates
The Dickens corpus is more than four times the size of the
Smollett corpus, and the number of types as well as tokens
of gentleman’s collocates in Dickens is more than four times
as many as those in Smollett. It is necessary to ensure that
a size factor does not come into play in the outcome of
analysis. Figures 3 and 4 are derived from the variables of 100
collocates common to both authors. Despite the decrease in
the number of variables from 1,074 to 100, the confi guration
of texts and words is remarkably similar to that based on
1,074 items. The Dickens set and the Smollett set, once again,
can be distinguished from each other along Dimension 1.
Moreover, in each of the two authors’ sets, early works tend
to have lower scores with later works scoring higher along
Dimension 2. The results of the present analysis is consistent
with my earlier studies based on different variables, such as –ly
adverbs, superlatives, as well as high-frequency function words.
It would be possible to assume that authorial fi ngerprints
are as fi rmly set in the collocation of gentleman as in other
component of vocabulary. These results seem to illustrate
multivariate analysis of collocates could provide interesting
new perspectives to the study of collocation. Figure 3. Correspondence analysis of 100 most
common collocates of gentleman: Text-map
Figure 4. Correspondance analysis of 100 most common
collocates of gentleman: Word-map of 100 collocates
Adolphs, S. and R. A. Carter (2003) ‘Corpus stylistics: point of
view and semantic prosodies in To The Lighthouse’, Poetica, 58:
Biber, D. and E. Finegan (1992) ‘The Linguistic Evolution of
Five Written and Speech-Based English Genres from the
17th to the 20th Centuries,’ in M. Rissanen (ed.) History
of Englishes: New Methods and Interpretation in Historical
Linguistics. Berlin/New York: Mouton de Gruyter. 668–704.
Brainerd, B. (1980) ‘The Chronology of Shakespeare’s Plays: A
Statistical Study,’ Computers and the Humanities, 14: 221–230.
Burrows, J. F. (1987) Computation into Criticism: A study of
Jane Austen’s novels and an experiment in method. Oxford:
Clarendon Press.
Burrows, J. F. (1996) ‘Tiptoeing into the Infi nite: Testing
for Evidence of National Differences in the Language of
English Narrative’, in S. Hockey and N. Ide (eds.) Research in
Humanities Computing 4. Oxford/New York: Oxford UP. 1–33.
Craig, D. H. (1999a) ‘Johnsonian chronology and the styles of
A Tale of a Tub,’ in M. Butler (ed.) Re-Presenting Ben Jonson: Text
Performance, History. London: Macmillan, 210–232.
Craig, D. H. (1999b) ‘Authorial Attribution and Computational
Stylistics: If You Can Tell Authors Apart, Have You Learned
Anything About Them?’ Literary and Linguistic Computing, 14:
Craig, D. H. (1999c) ‘Contrast and Change in the Idiolects of
Ben Jonson Characters,’ Computers and the Humanities, 33. 3:
Firth, J. R. (1957) ‘Modes of Meaning’, in Papers in Linguistics
1934–51. London: OUP. 191-215.
Hoover, D. L. (2003a) ‘Frequent Collocations and Authorial
Style,’ Literary and Linguistic Computing, 18: 261–286.
Hoover, D. L. (2003b) ‘Multivariate Analysis and the Study of
Style Variation,’ Literary and Linguistic Computing, 18: 341–360.
Hori, M. (2004) Investigating Dickens’ style: A Collocational
Analysis. New York: Palgrave Macmillan.
Hunston, S. and G. Framcis (1999) Pattern Grammar: A Corpus-
Driven Approach to the Lexical Grammar of English. Amsterdam:
John Benjamins.
Kjellmer, G. (1994) A Dictionary of English Collocations: Based on
the Brown Corpus (3 vols.). Oxford: Clarendon Press.
Louw, W. (1993) ‘Irony in the text or insincerity in the writer?
The diagnostic potential of semantic prosodies’, reprinted in
G. Sampson and D. McCarthy (eds.) (2004) Corpus Linguistics: Readings in a Widening Discipline. London and New York:
Continuum. 229–241.
Partington, A. (2006) The Linguistics of Laughter: A Corpus-
Assisted Study of Laughter-Talk. London/New York: Routledge.
Rudman, J. (2005) ‘The Non-Traditional Case for the
Authorship of the Twelve Disputed “Federalist” Papers:
A Monument Built on Sand?”, ACH/ALLC 2005 Conference
Abstracts, Humanities Computing and Media Centre,
University of Victoria, Canada, 193–196.
Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: OUP.
Stubbs, M. (1995) ‘Corpus evidence for norms of lexical
collocation’, in G. Cook and B. Seidlhofer (eds.) Principle
and Practice in Applied Linguistics: Studies in Honour of H. G.
Widdowson. Oxford: OUP. 243–256.
Stubbs, M. (2001) Words and Phrases: Corpus studies of lexical
semantics. Oxford: Blackwell.
Tabata, T. (2004) ‘Differentiation of Idiolects in Fictional
Discourse: A Stylo-Statistical Approach to Dickens’s Artistry’,
in R. Hiltunen and S. Watanabe (eds.) Approaches to Style and
Discourse in English. Osaka: Osaka University Press. 79–106.
Tabata, T. (2005) ‘Profi ling stylistic variations in Dickens and
Smollett through correspondence analysis of low frequency
words’, ACH/ALLC 2005 Conference Abstracts, Humanities
Computing and Media Centre, University of Victoria, Canada,
Tabata, T. (2007a) ‘A Statistical Study of Superlatives in
Dickens and Smollett: A Case Study in Corpus Stylistics’,
Digital Humanities 2007 Conference Abstracts, The 19th Joint
International Conference of the Association for Computers and
the Humanities and the Association for Literary and Linguistic
Computing, University of Illinois, Urbana-Champaign, June 4–June
8, 2007, 210–214.
Tabata, T. (2007b) ‘The Cunningest, Rummest, Superlativest Old
Fox: A multivariate approach to superlatives in Dickens and
Smollett ’, PALA 2007—Style and Communication—Conference
Abstracts, The 2007 Annual Conference of the Poetics and
Linguistics Association, 31 Kansai Gaidai University, Hirakata,
Osaka, Japan, July–4 August 2007, 60–61.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website:

Series: ADHO (3)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None