A Statistical Study of Superlatives in Dickens and Smollett: A Case Study in Corpus Stylistics

Tomoji Tabata

Authorship

1. Tomoji Tabata

University of Osaka

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
This study gives a quantitative overview of the use of
superlatives in Dickens in comparison with Smollett. The
focus is laid on the differing distribution of superlatives in the
texts written by the two authors. By applying correspondence
analysis, this study tries to illustrate how sharply the two authors
differ in their uses of superlatives as well as how texts are
clustered according to chronology within authorial sets.
Despite a number of studies on Dickens’ style have noted a
tendency for overstatement in his fiction (Brook 1970; Sorensen
1985; Golding 1985; Hori 2004, etc.), surprisingly little
attention has been paid to superlatives as a whole. Apart from
Dickens studies, however, Biber et al. (1999) gives an
interesting account of superlatives in four linguistic registers:
conversation, fiction, news, and academic prose. According to
Biber et al., —est superlative adjectives are most frequent in
news reportage (c. 1400 times per million words) while “the
comparatively low frequency of superlatives in academic
writing (c. 800 per million) reflects a general reluctance to make
extreme claims” (Biber et al., 521), with fiction showing even
lower frequency for the word class (c. 700 per million). Dickens and Smollett stand in contrast in the frequency of
superlative forms. In Dickens’ 23 texts used in this study, the
number of tokens for superlatives amounts to 4,960, whereas
Smollett employs them 634 times in his seven works. In the
normalised frequency scale per million words, the frequency
in Dickens is nearly twice as high as in Smollett: 1,049 versus
568.
Table 2
With regard to the number of types, 423 different superlative
forms are found in total. Among those, a few types are highly
frequent such as, most, best, and least, occurring more than one
thousand times. Conversely, more than one third of the whole
types occur only once. Such hapax legomena include unique
words, such as, superlativest and unfortunatest.
This study deals with a corpus of texts comprising Dickens’
and Smollett’s major works. Dickens’ set includes fifteen "serial
fictions", six "sketches", one "miscellany", and one "history".
Smollett’s contains six "fictions" and one "sketch". The total
word-tokens in the corpus amount to 5.8 million, with the
Dickens component containing 4.7 million tokens and the
Smollett component totalling 1.1 million word-tokens. The
present project was initiated as a study based on a
comprehensive collection, not a sample corpus, of texts by the
targeted authors. Therefore, the imbalance in the number of
texts as well as tokens is inevitable. However, due attention
will be paid in the choice of variables to minimize a potential
effect of the differences in the population of the two sets.
2. Quantitative approaches to
style/register variation
Milic (1967) is among the earliest successful specimens
of a quantitative description of style. He compared the
style of Jonathan Swift with the writings of his contemporaries,
with special reference to the relative frequencies of word-classes
in the texts and to grammatical features such as seriation and
connection. Cluett (1971 & 1976) adopted a similar approach
to conduct a diachronic study of prose style across 4 centuries:
from 16th to 20th Centuries. Brainerd’s works (1979 & 1980)
are ambitious attempts to apply discriminant analysis to the
question of genre and chronology in Shakespeare plays.
Takefuta’s (1981) approach to text typology, or register
variation, is among the first to successfully employ factor/cluster
analysis to the lexical differences between registers. His
pioneering work, however, is not widely acknowledged because
it was written in Japanese.
Since Burrows (1987) and Biber (1988), it has become popular
practice to employ multivariate techniques in quantitative
studies of texts. Biber carried out factor analysis (FA) on 67
linguistic features to identify co-occurring linguistic features
that account for dimensions of register variation. A series of
research based on Biber’s Multi- Feature/Multi-Dimensional
approach have been successful in elucidating many interesting
aspects of linguistic variation, such as diachronic change of
prose style, variation within a single author, and differences
between conversational styles in British and American English,
to give a few examples (Biber & Finegan 1992;
Opas-Ha􀀀nninen 1996, Watson 1997, Conrad & Bibereds.
2001)
The Biber model is one of the most sophisticated approaches
by far. Yet it is not without its critics. Nakamura (1995) raises
a major objection. He argues that Biber’s variables are “quite
arbitrarily selected with no definite criterion and mixed levels”
(1995: 77-86). Further, Sigley (1997) notes that almost half of
Biber’s 67 linguistic features are too rare in texts of 2,000
words.
Burrows (1987), on the other hand, applied a Principal
Component Analysis (PCA) to the thirty most common words
in the language of Jane Austen. The method demonstrates that
differing frequency patterns in these very common words show
significant differentiations among Austen’s characters, and that
the statistical analysis of literary style may lead not only to a
deeper understanding of the novel itself but may also contribute
to our deeper appreciation of it. In this use of a PCA, the frequencies of common words are used as variables. The
Burrows method seems to have higher replicability and
feasibility; since it focuses on common words, most of the
variables are frequent enough to produce stable statistical
results. In addition, it does not require a multi-layered tagging
scheme optimised for Biber’s MF/MD approach.
A particular strength of the Burrows methodology is in testing
cases of disputed authorship and national differences in the
English first-person retrospective narrative, known as ‘history’.
Among the most successful applications are Burrows (1989,
1992 & 1996), Craig (1999a, b, & c). The Burrows approach
or similar methodology has been applied to Bible stylometry.
Some scholars like Linmans (1998), Merriam (1998), and
Mealand (1999) use Correspondence Analysis (CA) instead of
PCA. In the context of text typology, Nakamura (1993) applied
CA to the frequency distribution of personal pronouns to
visualize association between personal pronouns and 15 text
categories in the LOB corpus.
3. Methodology
The present study is different from the Biber and Burrows
models in that it extends the range of variables to include
low-frequency words, or rare words, by applying CA in the
analysis of superlatives. CA is one of the techniques for
data-reduction alongside PCA and FA. CA allows examination
of the complex interrelationships between row cases (i.e., texts),
interrelationships between column variables (i.e., words), and
association between the row cases and column variables
graphically in a multi-dimensional space. It computes the row
coordinates (word scores) and column coordinates (text scores)
in a way that permutes the original data matrix so that the
correlation between the word variables and text profiles are
maximized. In a permuted data matrix, adverbs with a similar
pattern of distribution make the closest neighbours, and so do
texts of similar profile. When the row/column scores are
projected in multi-dimensional charts, relative distance between
variable entries indicates affinity, similarity, association, or
otherwise between them. One advantage CA has over PCA and
FA is that PCA and FA cannot be computed on a rectangular
matrix where the number of columns exceeds the number of
rows, a concern of the present study. Yet CA can handle such
types of a data table with, for example, the row cases consisting
of thirty texts and the column variables consisting of hundreds
of words.
Table 3: Frequency matrix for 242 types of superlatives across 30 texts: raw
frequency scores
4. Results
Figures 1 and 2 demonstrate a result of CA based on 242
superlative forms across 30 texts. The solution given as
Dimension 1, the most powerful axis, allows quite
straightforward interpretation: the horizontal axis of Figure 1
distinguishes between the Dickens and the Smollett sets. It is
also interesting that the early Dickensian texts, such as Sketches
by Boz, Pickwick Papers, and Nicholas Nickleby, are among
the closest to Smollett’s texts along the horizontal axis.
Figure 1: Correspondence Analysis of superlatives in Dickens & Smollett based
on 242 types that appear in two or more texts: Text-map showing
interrelationships between 30 texts Figure 2: Correspondence Analysis of superlatives in Dickens & Smollett based
on 242 types that appear in two or more texts: Word-map showing
interrelationships between 242 superlatives
The Dickens corpus is more than four times the size of the
Smollett corpus, and the number of types used by Dickens is
nearly four times as many as those used by Smollett (see Table
2). It is necessary to ensure that a size factor does not come
into play in the outcome of analysis. Figures 3 and 4 are derived
from 105 superlatives common to both authors. Despite the
decrease in the number of variables from 242 to 105, the
configuration of texts and words is remarkably similar to that
based on 242 items. Of further interest is that, in each of the
two authors’ sets, early works tend to have lower scores with
later works scoring higher along Dimension 2.
Such result seems to illustrate how the authorial difference and
chronology are reflected in the frequency pattern of superlatives
in the texts written by Dickens and Smollett. This study might
suggest the effectiveness of the stylo-statistical approach based
on correspondence analysis of texts.
Figure 3: Correspondence Analysis of superlatives in Dickens & Smollett based
on the 105 types common to both authors: Text-map showing interrelationships
between 30 texts
Figure 4: Correspondence Analysis of superlatives in Dickens & Smollett based
on the 105 types common to both authors: Word-map showing interrelationships
among 105 superlatives
Bibliography
Biber, Douglas. Variation Across Speech and Writing.
Cambridge: Cambridge University Press, 1988.
Biber, Douglas, and Edward Finegan. "The Linguistic Evolution
of Five Written and Speech-Based English Genres from the
17th to the 20th Centuries." History of Englishes: New Methods
and Interpretation in Historical Linguistics. Ed. Matti Rissanen.
Berlin: Mouton de Gruyter, 1992. 668–704. Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad,
and Edward Finegan. Longman Grammar of Spoken and Written
English. Harlow: Pearson Education Ltd, 1999.
Brainerd, Barron. "The Chronology of Shakespeare’s Plays: A
Statistical Study." Computers and the Humanities 14.4 (1980):
221-230.
Brainerd, Barron. "Pronouns and Genre in Shakespeare’s
Drama." Computers and the Humanities 13.1 (1999): 3-16.
Brook, G. L. The Language of Dickens. London: Andre
Deutsch, 1970.
Burrows, John F. Computation into Criticism: A Study of Jane
Austen’s Novels and an Experiment in Method. Oxford:
Clarendon Press, 1987.
Burrows, John F. "“A Vision” as a Revision?"
Eighteenth-Century Studies 22.4 (1989).
Burrows, John F. "Computers and the Study of Literature."
Computers and Written Texts. Ed. Christopher S. Butler.
Oxford: Blackwell , 1992. 167-204.
Burrows, John F. "Tiptoeing into the Infinite: Testing for
Evidence of National Differences in the Language of English
Narrative." Ed. Susan Hockey and Nancy Ide. Research in
Humanities Computing 4: Selected Papers from the ALLC/ACH
Conference, Christ Church, Oxford, April 1992. Oxford: Oxford
University Press, 1996. 1-33.
Cluett, Robert. "Style, Precept, Personality: A Test Case
(Thomas Sprat, 1635-1713)." Computers and the Humanities
5.5 (1971).
Cluett, Robert. Prose Style and Critical Reading. New York:
Teachers College Press, 1976.
Conrad, Susan, and Douglas Biber, eds. Variation in English:
Multi-Dimensional Studies. Harlow: Pearson Education Ltd,
2001.
Craig, D. H. "Authorial Attribution and Computational
Stylistics: If You Can Tell Authors Apart, Have You Learned
Anything About Them?" Literary & Linguistic Computing 14.1
(1999b): 103-13.
Craig, D. H. "Contrast and Change in the Idiolects of Ben
Jonson Characters." Computers and the Humanities 33.3
(1999c): 221-40.
Craig, D. H. "Jonsonian Chronology and the Styles of A Tale
of a Tub." Re-Presenting Ben Jonson: Text, History,
Performance. Ed. Martin Butler. London: Macmillan, 1999a.
210-32.
Hori, Masahiro. Investigating Dickens’ Style: A Collocational
Analysis. New York: Palgrave Macmillan, 2004.
Hori, Masahiro. "Collocational Patterns of –ly Manner Adverbs
in Dickens." English Corpus Linguistics in Japan 15 (2002):
149–163.
Hori, Masahiro . "Collocational Patterns of Intensive Adverbs
in Dickens: A Tentative Approach." English Corpus Studies 6
(1999): 51–65.
Linmans, A. J. M. "Correspondence Analysis of the Synoptic
Gospels." Literary & Linguistic Computing 13.1 (1998): 1-13.
Mealand, D. L. "Style, Genre, and Authorship in Acts, the
Septuagint, and Hellenistic Historians." Literary & Linguistic
Computing 14.4 (1999): 479–505.
Merriam, Thomas. "Heterogeneous Authorship in Early
Shakespeare and the Problem of Henry V." Literary &
Linguistic Computing 13.1 (1998): 15–28.
Milic, Louis Tonko. A Quantitative Approach to the Style of
Jonathan Swift. The Hague: Mouton, 1967.
Nakamura, Junsaku. "Text Typology and Corpus: A Critical
Review of Biber’s Methodology." English Corpus Studies 2
(1995): 75-90.
Nakamura, Junsaku. "Statistical Methods and Large Corpora:
A New Tool for Describing Text Types." Text and Technology:
In Honour of John Sinclair. Ed. Mona Baker, Gill Francis and
Elena Tognini-Bonelli . Amsterdam: John Benjamins, 1993.
293–312.
Opas-Ha􀀀nninen, Lisa Lena. "A Multi-Dimensional Analysis
of Style in Samuel Beckett’s Prose Works." Research in
Humanities Computing 4: Selected Papers from the ALLC/ACH
Conference, Christ Church, Oxford, April 1992 . Ed. Susan
Hockey and Nacny Ide. Oxford: Oxford University Press, 1996.
81-114.
Sigley, Robert. "Text Categories and Where You Can Stick
Them: A Crude Formality Index." International Journal of
Corpus Linguistics 2.2 (1997): 199–237.
Sørensen, Knud. Charles Dickens: Linguistic Innovator. Aarhus:
Aarhus Universitet, 1985.
Takefuta, Y. Kompyuta no mita gendai eigo: bokyaburari no
kagaku [The Computer Analysis of the Contemporary English
Language: A Quantitative Study of Vocabulary]. Tokyo: Educa,
1981.
Watson, Greg. Doin' Mudrooroo: Elements of Style and
Involvement in the Early Prose Fiction of Mudrooroo.
Publications in the Humanities, No 19 . Joensuu, FI:: University
of Joensuu, 1997.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Conference website: http://www.digitalhumanities.org/dh2007/

References: http://web.archive.org/web/20070810143343/http://digitalhumanities.org/dh2007/DH2007.detail.html http://web.archive.org/web/20080703194728/http://www.digitalhumanities.org/dh2007/abstracts/titles.xq

Series: ADHO (2)

Organizers: ADHO

A Statistical Study of Superlatives in Dickens and Smollett: A Case Study in Corpus Stylistics

1. Tomoji Tabata

ADHO - 2007