Stylometric Analysis Using Discriminant Analysis: A Study of Sherlock Holmes Stories

paper
Authorship
  1. 1. Peter Smith

    City University London

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Stylometric Analysis Using Discriminant Analysis: A
Study of Sherlock Holmes Stories

Peter
Smith

The City University London
peters@soi.city.ac.uk

2002

University of Tübingen

Tübingen

ALLC/ACH 2002

editor

Harald
Fuchs

encoder

Sara
A.
Schmidt

Introduction
Stylometric analysis may be defined as the quantitative analysis of the
recurrence of particular features of style for the purpose of deducing
authorship and/or chronology of texts. Methods of sylometric analysis may be
broadly subdivided up into lexical and non-lexical approaches (Binongo
2000). The approach to be taken in this study is to use function words using
discriminant analysis as a method of analysis. The use of function words has
often been criticised by stylistic experts as lacking in scientific
validity, for example, see (Waith 1984). An ultimate aim of this work is to
establish scientific foundations for the use of function words in
stylometric analysis. One promising line of enquiry is in the study of
aphasic patients where it has been suggested that function words may be
processed by the brain separately to lexical words (Garrett 1982), perhaps
we may have less conscious control over our use of them.
Three plausible variables for stylometric analysis to consider are:
chronology, genre and author. A writer's stylistic tendencies cannot be
assumed to span an entire writing career, hence when the work was written is
just as important as who wrote it. Similarly, it is unreasonable to assume
that style will be fixed in different genres. One aim of this study is to
fix two of the three variables as closely as possible while studying the
effect of the other variable.

The Basis for this Study
The primary basis for this study is to examine the effectiveness of
discriminant analysis using function words as a means of distinguishing
author while examining the effect of genre and time. The Sherlock Holmes
stories of Arthur Conan-Doyle were chosen as subjects for study because they
were freely available in machine-readable form and there were several
comparable texts written at the same time in the same basic genre. Arthur
Conan Doyle was also a prolific writer who has written different types of
fiction very closely in time.
The Sherlock Holmes stories were originally serialised in The Strand Magazine. However, when "The Hound of the
Baskervilles" appeared it was plagued with controversy. Arthur Conan Doyle
had insisted that his name appear jointly with a Mr. Fletcher Robinson. It
has been suggested that Fletcher Robinson had written part of the story, in
which case stylometric analysis might be able to throw light on this
mystery. However, if Conan-Doyle was just given an idea for a story then
there is not much that can be discovered by this technique.
Nine texts were chosen for analysis, which can be divided into three equal
groups:
Sherlock Holmes stories, including The Hound of
the Baskervilles, as close in time as possible to it.
Other works written by Conan Doyle at the same time as the three
Sherlock Holmes stories.
Other works by different writers in the same or similar genre
(including some published in The Strand
Magazine).

The sets of Sherlock Holmes stories that were written closest in time were
chosen as comparand texts (note that all stories were serialised and
published monthly). Thus the three Sherlock Holmes stories were: The Memoirs of Sherlock Holmes (1892/3), The Hound of the Baskervilles (1902) and The Return of Sherlock Holmes (1904) (Conan-Doyle
1986). The three comparand texts written by Conan-Doyle around the same time
were: The Parasite. (1894) The
Adventures of Gerard (1903) and Sir
Nigel (1906).The three texts written in a similar genre around the
same time but by different authors were: The Ponsonby
Diamonds (Meale and Halifax 1894 - published in The Strand Magazine). The Old Man in the
Corner (Baroness Orczy 1901)and The Scarlet
Pimpernel (Baroness Orczy 1905).
The texts were prepared in a manner that follow the technique employed by
(Smith 1993) very closely. The top 20 most commonly occurring function words
from "The Hound of the Baskervilles" were chosen. A discriminant analysis
was then run using SPSS 10.00 with the three texts forming three groups.
This produced a "nearness" metric for the texts. A series of tests will be
presented that provide a strong basis for the use of discriminant analysis.
The tests demonstrate the ability of this technique to separate texts by
author, by genre (when author is fixed), or even by time. Tests were also
carried out to ensure that the tests were not just arbitrary, but showed a
real variation in texts.

Discussion and Further Work
The starting point for this research was to investigate whether there is any
evidence to support the thesis that Conan-Doyle may not have written all of
"The Hound of the Baskervilles". There is absolutely no support for this in
the results. The technique attempted to test for authorship by attempting to
control two other major variables: time and genre. It was capable of
distinguishing texts by author consistently. It also appeared to have the
capability for separation of texts by genre and separated Conan-Doyle's
works by time as well (The Hound of the
Baskervilles was closer to The Return of Sherlock
Holmes, written only two years later, than it was to The Memoirs of Sherlock Holmes, written some 9 years
earlier.)
One possible reason why the Sherlock Holmes stories differ from, the story
used as a comparand text in a different genre might be because they employ a
considerable amount of spoken dialogue, whereas the other story chosen
contains far more narrative text. This has yet to be investigated.
Principal Components Analysis (PCA) has been successfully used in stylometric
analysis, (Burrows 1987, Binongo 2000, Binongo and Smith 1999). (Binongo
2000) seems overly pessimistic in his dismissal of discriminatory analysis,
arguing against its use because of the assumption of multivariate normality.
However his work reveals several worrying aspects of the use of principal
components analysis. Firstly: the most frequent words tend to have the least
discriminatory power and the first principal component may not be able to
reveal authorship accurately. If the frequencies of function words are
standardised, this in turn may lead more frequent words to be swamped by
less frequent words. (Binongo and Smith 1999) demonstrated that PCA was
capable of distinguishing difference in genres in a comparison between the
essays and plays of Oscar Wilde. In a later study (Binongo and Smith 1999)
they also demonstrated the success of this technique on a comparison of the
works of two contemporaneous American authors Nathaniel Hawthorne and Herman
Melville using 25 function words.
Principal Components Analysis was also employed on the texts used in this
study and although it appeared to reliably differentiate between three
different authors, the principal components appeared to be less reliable in
different genre tests and especially where two groups were from the same
text. In this case the Kaiser-Meyer-Olkin statistic, measuring sampling
adequacy (Kaiser 1970) indicated that the extracted components might be
unreliable. Kaiser (1974) recommends accepting values greater than 0.5 and
even values between 0.5 and 0.7 are considered mediocre (see also Field
2000). When PCA was applied to the same text split into two groups, KMO
scores of between 0.4 and 0.55 were observed.
Experiments with the numbers of function words were also tried. It was found
that increasing the set of function words from 20 to 25 or 30 made only a
marginal difference. Running a MANOVA test on the function word data allowed
us to identify function words that were unreliable and re-running a
discriminant analysis with these words removed produced a slight
improvement. As the number of function words was decreased progressively the
sensitivity of the test was diminished. The drawback with this approach is
that it requires large amounts of text to produce reliable results
(something approaching the size of a novella or short novel as minimum). The
test will not be so sensitive to the insertion or interleaving of texts by
different authors. If a form of dimension reduction can be established, then
it might be possible to treat a text as a time series and use a window to
drag over the text to look for anomalous sections that might indicate a
change of author. A further way in which this work can be developed is to
examine the function words themselves and examine why each author varies
their usage. Some function words are used as what Schiffrin (1987) calls
discourse markers and as higher-level indicators of structure, their use may
well vary from one writer to the next. A linguistic basis for the variation
in function words needs to be established if only to demonstrate the
scientific credentials for stylometric analysis.

Bibliography

J.
N.
G.
Binongo

W.
Smith

The Application of principal components analysis to
stylometry

Literary & Linguistic Computing

14
4
445-466
1999

J.
N.
G.
Binongo

Stylometry and its implementation by Principal
Components Analysis

Ph.D. Thesis

University of Ulster, Co. Antrim, Northern Ireland, UK
2000

J.
F.
Burrows

Computation into Criticism: A Study of Jane Austin's
Novels and an experiment in Method

Oxford
Clarendon
1987

Arthur
Conan
Doyle

The Illustrated Sherlock Holmes

London
Omega Books
1986

A.
Field

Discovering Statistics using SPSS for Windows

London
Sage Publications
2000

M.
F.
Garrett

Production of Speech: Observations from Normal and
Pathological Language Use

A.
Ellis

Normality and Pathology in Cognitive Functions

London
Academic Press
1982

H.
F.
Kaiser

A Second Generation little jiffy

Psychometrika

35

401-415
1970

H.
F.
Kaiser

An Index of factorial simplicity

Psychometrika

39

31-36
1974

D.
Schriffrin

Discourse Markers

Cambridge
Cambridge University Press
1987

W.
Smith

Edmund Ironside

Notes and Queries

238

202-5
1993

E.
M.
Waith

Titus Andronicus: The Oxford Shakespeare

Oxford University Press
1984

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2002
"New Directions in Humanities Computing"

Hosted at Universität Tübingen (University of Tubingen / Tuebingen)

Tübingen, Germany

July 23, 2002 - July 28, 2008

72 works by 136 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20041117094331/http://www.uni-tuebingen.de/allcach2002/

Series: ALLC/EADH (29), ACH/ICCH (22), ACH/ALLC (14)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None