Shakespeare on the tree

poster / demo / art installation
  1. 1. Giuliano Pascucci

    Sapienza Università di Roma (Sapienza University of Rome)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This work illustrates some computer based tools and
procedures which have been applied to the complete corpus
of Shakespeare’s works in order to create a phylogenetic tree1
of his works and to discriminate between Shakespeare’s and
Fletcher’s authorship in the writing of All is True, which is known
to be the result of a collaboration of both authors.
The general procedure applied to this study was devised in 2001
by a group of University of Rome La Sapienza2 scholars who
asked themselves whether starting from a graphic character
string it was possible to extract from it information such as
the language in which the string itself was written, whether or
not it belonged to a wider context (e.g. it is a part of a literary
work), its author and its fi liation from other texts.
Their method was based on the idea of linguistic entropy as it
has been defi ned by Chaitin-Kolmogorov3 and was applied to
the translations into 55 different languages of a single text: the
Universal Declaration of Human Rights.
Following Kolmogorov’s theories and considering each of
the 55 texts as a mere string of characters they decided to
study the linguistic entropy of each text, that is to say the
relationship between the information contained in a string
of characters and the ultimate compressibility limit of the
string itself. Because said limit, i.e. the complete removal of
all redundancies, is also the ideal limit that a zipping software
should reach, the authors used a compression algorithm to
fi nd out what kind of information could be extracted from a
given text starting from what they could get to know about its
linguistic entropy.
The algorithm they used was an appropriately modifi ed
version of LZ774, which was able to index the length and
position of redundant strings within a text, and also, among
other things, to extrapolate and collect them. Such modifi ed
version of LZ77 was called BCL1, thus named after the initials
of the authors’ names. This algorithm was sided by two more
software programs (FITCH and CONSENSE)5 created by
Joe Felsenstein for inferences about phylogenies and made
available online by the University of Washington. After
processing the texts using both BCL1 and FITCH, Benedetto,
Caglioti and Loreto obtained a graph in which the 55 languages
were grouped in a way that matches to a surprising extent
philological classifi cations of the same languages.
Although the authors had no specifi c interest in literature, at
the end of their paper they expressed the hope that their
method may be applied to a wider range of areas such as DNA
and protein sequences, time sequences and literature.
The most relevant difference between the present case study
and previous stylometric studies is that the latter have only
dealt with the analysis of single words or sentences. More
precisely, the studies which investigated single words especially
focussed on features such as length, occurrence and frequency,
whereas the works that dealt with phrase or sentence analysis
especially studied features such as the avarage number of
words in a sentence, the avarage length of sentences, etc.
These procedures have been followed in many different seminal
studies which gave birth to modern stylometry: Ellegård’s
study about the Junius Letters6, Mosteller’s investigation of the
Federalist Papers7, Marriott’s analysis of the Historia Augusta8, last
but not least Mendenhall’s scrutiny of Shakespeare’s works9.
During the last decades of the last century, some new studies
have been carried out using computer based tools. However,
such studies do not differ from those dating back to 19th
and 18th century, in that they use computers as fast helpers,
instead of bringing about new hypotheses based on specifi c
characteristics and potential of the computer. This is the case,
for example, of Thomas Horton’s10 study about function words
in Shakespeare’s All is True and Two Noble Kinsmen.
On the contrary, the present study doesn’t analyse function
words or a particular class of words, nor does it simply deal
with phrase or sentence analysis. The investigation here is
based on the ratio of equal character strings shared by two or
more texts. Moreover a character string cannot be identifi ed
with a single word or phrase in that it may contain diacritical
signs, punctuation, blanks and even word or phrase chunks.
A few years ago Susan Hockey clearly stated that11, if deeply
investigated, a text must show on some level its author’s
DNA or fi ngerprint. It goes without saying that the greater
the number of DNA sequences common to two biological
entities, the greater their phenotypical resemblance is. As a
consequence it is also very likely that the two entities are in
someway related.
Based on this conviction, which has nowadays become selfevident
both in Biology and Genetics, the present study
analyses shakespearian texts as though they were mere DNA
strings. Then the texts are automatically grouped into families
and placed on a phylogenetic tree, so as to account both for
their evolution and for their deepest similarities.
Results are surprisingly precise and the families thus created
confi rm the groupings which textual critics and phylologists
have agreed on over the last few centuries. Other interesting
similarities have also been brought to light. The algorithms, for
instance, have been perfectly able to recognise and group on a
single branch of the phylogenetic tree Shakespeare’s so called
Roman Plays, which share the same setting, some themes and
a number of characters. The system also grouped together
the Historical plays, works whose similarities have also been
acknowledged by textual and literary criticism. Furthermore
the experiment has pointed out a liguistic similarity between the tragedy of Romeo and Juliet and some of Shakespeare’s
comedies. Although such similarity has never been studied in
depth and certainly deserves further investigation, it would
not come across as unlikely or bizarre to the shakespearian
Experiments have also been carried out to test the validity of
the algorithms on shorter texts. In this second phase of the
study, the complete corpus of Shakespeare’s works was split
into 16 KiB text chunks. This time the system was able to
create 37 subsets of chunks, each of which coincided with a
play and appropriately placed each subset on the phylogenetic
In a fi nal phase of the experiment the effectiveness of the
algorithms on authorship was tested against a corpus of 1200
modern English texts by known authors with positive results
and it was then applied to the play All is True to discriminate
between Shakespere’s and Fletcher’s authorship vis-à-vis single
parts of the play.
Whereas the results achieved during the testing phase were
completely successful (100% of correct attributions) when
dealing with Shakespeare’s All is True they were a little less
satisfactory (about 90%). Two factors may account for this:
on the one hand this may be due to the fl uidity of English
morphology in that period; on the other hand this specifi c text
may have suffered the intervention of other authors as a few
critics have suggested during the last century.
Experiments are still being carried out to refi ne the procedure
and make the algorithms produce better performances.
1 A phylogenetic tree is a graph used in biology and genetics to
represent the profound relationship (e.g. in the DNA), between two
phenotypically similar entities which belong to the same species or
2 D. Benedetto, E. Caglioti (department of Mathematics), V. Loreto
(Department of Physics)
3 A.N. Kolmogorov, Probl. Inf. Transm. 1, 1(1965) and G.J. Chaitin,
Information Randomness and Incompleteness (WorldScientific,
Singapore, 1990), 2nd ed.
4 LZ77 is a lossless data compression algorithm published in a paper
by Abraham Lempel and Jacob Ziv in 1977.
5 Both programs are based on algorithms used to build phylogenetic
trees and are contained in a software package called PHYLIPS.
6 A. Ellegård, A Statistical Method for determining Autorship: The Junius
Letters 1769-1772, Gothenburg, Gothenburg University, 1962
7 F. Mosteller, D. Wallace, Inference and Disputed Authorship: The
Federalist, Reading (Mass.), Addison-Wesley, 1964
8 I. Marriot, “The Authorship of the Historia Augusta: Two Computer
Studies”, Journal of Roman Studies, 69, pagg. 65-77
9 T. C. Mendenhall, “A Mechanical Solution of a Literary Problem” ,
The Popular Science Monthly, 60, pagg. 97-105
10 The Effectiveness of the Stylometry of Function Words in Discriminating
Between Shakespeare and Fletcher, Edinburgh, Department of Computer
Science, 1987. This text can be found online at:
emls/iemls/shaksper/fi les/STYLOMET%20FLETCHER.txt
11 Hockey S., Electronic Texts in the Humanities, New York, Oxford
University Press, 2000

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website:

Series: ADHO (3)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None