Profiling Stylistic Variations in Dickens and Smollett through Correspondence Analysis of Low Frequency Words

  1. 1. Tomoji Tabata

    University of Osaka

The aim of this paper is to present the result of a
corpus-driven, quantitative analysis of the style of
Dickens in comparison with the style of Smollett. The particular
problem discussed is the differing distribution of -ly adverbs
in the texts written by the two authors. By applying a
multivariate stylo-statistics model, this study illustrates how
sharply the two authors differ in their uses of adverbs as well
as how texts are differentiated according to genre and
chronology within authorial groups.
On the relationship between linguistic registers and adverbs,
Biber et al. (1999, 541) present interesting findings from a
large-scale corpus:
It is interesting to note that, overall, fiction ... uses many different
descriptive -ly adverbs, although few of these are notably common
(occurring over 50 times per million words). Rather, fiction shows
great diversity in its use of -ly adverbs. In describing fictional
events and the actions of fictional characters, writers often use
adverbs with specific descriptive meanings.
In fact, -ly adverbs found in Dickens are quite diverse. In the
23 texts used in this study, the number of types amount to 1,728;
Smollet employed 634 types. Among those, a few types are
highly frequent, such as really and certainly, occurring more
than one thousand times. Conversely, a large number of adverbs
occur only once. Such hapax legomena include a few types
which sound very much Dickensian, such as evil-adverbiously,
patientissamentally, Shakespearianly. Although the number of
tokens of -ly adverbs account for only a little more than 1% of
total word-tokens in the texts, the findings by Biber et al.
suggest that -ly adverbs deserve special attention in stylistic
study of fiction.
This study deals with a corpus of texts comprising Dickens'
and Smollett's major works. Dickens' set includes fifteen 'serial
fictions', six 'sketches', one 'miscellany', and one 'history'.
Smollett's contains six 'fictions' and one 'sketch'. The total
word-tokens in the corpus amount to 5.8 million, with the
Dickens component containing 4.7 million tokens and the
Smollett component totalling 1.1 million word-tokens. The
present project was initiated as a study based on a
comprehensive collection, not a sample corpus, of texts by the
targeted authors. Therefore, the imbalance in the number of
texts as well as tokens is inevitable. However, due attention
will be paid in the choice of variables to minimize a potential
effect of the differences in the population of the two sets. All
the texts in the corpus have been annotated with the POS tags,
using Eric Brill's Rule-Based Tagger (also known as the Brill
Tagger). Manual post-editing has been conducted to eliminate
a number of ill-assigned tags.
In an early successful attempt at a computational description
of literary style, Milic compared the style of Jonathan Swift
with the writings of his contemporaries, with special reference
to the relative frequencies of word-classes in the texts and to
grammatical features such as seriation and connection. Cluett
(1971 & 1976) adopted a similar approach to conduct a
diachronic study of prose style across 4 centuries: from the16th
to the 20th centuries. Brainerd's works (1979 & 1980) are
ambitious attempts to apply discriminant analysis to the question
of genre and chronology in Shakespeare plays. Takefuta's
approach to text typology, or register variation, is among the
first to successfully employ factor/cluster analysis to the lexical
differences between registers. His pioneering work, however,
is not widely acknowledged because it was written in Japanese.
Since Burrows (1987) and Biber (1988), it has become popular
practice to employ multivariate techniques in quantitative
studies of texts. Biber carried out factor analysis (FA) on 67
linguistic features to identify co-occurring linguistic features
that account for dimensions of register variation. A series of
research projects based on Biber's
Multi-Feature/Multi-Dimensional approach have been
successful in elucidating many interesting aspects of linguistic
variation, such as language acquisition, ESP, diachronic change
of prose style, and differences between conversational styles
in British and American English, to give a handful of examples
(Biber & Finegan; Conrad & Biber eds.).
The Biber model is one of the most sophisticated approach by
far. Yet it is not without its critics. Nakamura (1995) raises a
major objection. He argues that Biber's variables are "quite
arbitrarily selected with no definite criterion and mixed levels"
(1995, 77-86). Further, Sigley (1997) notes that almost half of
Biber's 67 linguistic features are too rare in texts of 2,000 words.
Burrows (1987), on the other hand, applied a Principal
Component Analysis (PCA) to the thirty most common words
in the language of Jane Austen. The method demonstrates that
differing frequency patterns in these very common words show
significant differentiations among Austen's characters, and that
the statistical analysis of literary style may lead not only to a
deeper understanding of the novel itself but may also contribute
to our deeper appreciation of it. In this use of a PCA, the frequencies of common words are used as variables. The
Burrows method seems to have higher replicability and
feasibility; since it focuses on common words, most of the
variables are frequent enough to produce stable statistical
results. In addition, it does not require a multi-layered tagging
scheme optimised for Biber's MF/MD approach.
A particular strength of the Burrows methodology is in testing
cases of disputed authorship and national differences in the
English first-person retrospective narrative, known as 'history'.
Among the most successful applications are Burrows (1989,
1992 & 1996), Craig (1999a, b, & c). The Burrows approach
or similar methodology has been applied to Bible stylometry.
Some scholars like Linmans, Merriam, and Mealand use
Correspondence Analysis (CA) instead of PCA. In the context
of text typology, Nakamura (1993) applied CA to the frequency
distribution of personal pronouns to visualize association
between personal pronouns and 15 text categories in the LOB
My earlier work (Tabata) also used CA to analyse the
distribution patterns of Part-of-Speech in Dickens's 23 texts
and identified a contrast between serial fiction and sketches.
The present study is different from the Burrows model in that
it extends the range of variables to include low-frequency
words, or rare words, by applying CA in the analysis of -ly
adverbs. CA is one of the techniques for data-reduction
alongside PCA and FA. Unlike PCA and FA, however, CA
does not require intervening steps of calculating correlation
matrix or covariance matrix, and can therefore process the data
directly to obtain solution. CA allows examination of the
complex interrelationships between row cases (i.e., texts),
interrelationships between column variables (i.e., adverbs), and
association between the row cases and column variables
graphically in a multi-dimensional space. It computes the row
coordinates (word scores) and column coordinates (text scores)
in a way that permutes the original data matrix so that the
correlation between the word variables and text profiles are
maximized. In a permuted data matrix, adverbs with a similar
pattern of distribution make the closest neighbours, and so do
texts of similar profile. When the row/column scores are
projected in multi-dimensional charts like Figures 1 to 4,
relative distance between variable entries indicates affinity,
similarity, association, or otherwise between them. One
advantage CA has over PCA and FA is that PCA and FA cannot
be computed on a rectangular matrix where the number of
columns exceeds the number of rows, a concern of the present
study. Yet CA can handle such types of a data table with, for
example, the row cases consisting of thirty texts and the column
variables consisting of hundreds of adverbs. Figures 1-4 summarise the results of applying a CA model in
the frequency analysis of -ly adverbs in texts. Figures 1 and 2,
based on 1,278 -ly adverbs which occur in more than one text,
clearly differentiate between the Dickens and Smollett sets.
The pattern along the horizontal axis allows quite
straightforward interpretation. A more sceptical mind, however,
might attribute it to the imbalance in the number of texts
between the authorial sets as well as in the number of types of
adverbs with the Dickens corpus at 4 times the size of the
Smollett corpus. One might be able to respond to such a
scepticism with Figures 3 and 4, which are based on the most
common 99 -ly adverbs used by both Dickens and Smollett.
Despite the decrease in the number of variables from 1,278 to
99, the configuration of Figure 3 is remarkably similar to that
of Figure 1. Of further interest is that, in each of the two
authors’ sets, earlier works tend to be found towards the bottom
of the chart with later works in the upper half of the diagram.
Additionally, in the Dickensian territory of Figures 1 and 3,
serial fiction texts occupy the right end while other genres, such
as sketches and history, are located slightly towards the left.
The series of results seems to illustrate how the authorial
difference, text genre, and chronology are reflected in the
frequency pattern of -ly adverbs in the texts written by Dickens
and Smollett. This pilot study might suggest the effectiveness
of the stylo-statistical approach based on correspondence
analysis of lower frequency words in texts.
