Reconstructing the stemma of a textual tradition from the order of sections in manuscripts

  1. 1. Matthew Spencer

    Department of Biochemistry - Cambridge University

  2. 2. Barbara Bordalejo

    De Montfort University, University of Saskatchewan

  3. 3. Adrian C. Barbrook

    Department of Biochemistry - Cambridge University

  4. 4. Linne R. Mooney

    Department of English - University of Maine

  5. 5. Christopher J. Howe

    Department of Biochemistry - Cambridge University

  6. 6. Peter Robinson

    De Montfort University, Institute for Textual Scholarship - University of Saskatchewan

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Geoffrey Chaucer's Canterbury Tales exists in over 50 complete manuscripts
and early printed editions. It consists of a series of tales told by
pilgrims travelling from London to Canterbury, more or less loosely
connected by linking passages. Unlike other contemporary works of
literature having a similar form, the surviving manuscripts show many
different orders of the tales and links [Manly, J. M. & Rickert, E., pages
475-494 in Volume II of The text of the Canterbury Tales: studied on the
basis of all known manuscripts (eds. Manly, J. M. & Rickert, E.),
University of Chicago Press, Chicago, 1940]. Tales and links have been
inserted, deleted or moved from one position to another. Previous studies
of the order of these sections have focussed on two main questions: the
order intended by Chaucer, and the possibility that differences in order
among manuscripts can reveal the genealogy of the manuscripts (the
stemma). These studies have used verbal arguments about the internal
consistency of different orders, geographical references within the links,
and the plausibility of hypothetical rearrangements [e.g. Benson, L. D.
The order of The Canterbury Tales. Studies in the Age of Chaucer 3, 77-120

An analogous problem in evolutionary biology is the reconstruction of the
phylogeny (family tree) of a set of species from the order of genes on a
genome [Sankoff, D. Edit distance for genome comparison based on non-local
operations. Lecture Notes in Computer Science 644, 121-135 (1992)]. Genes
may be inserted, deleted, transposed (moved from one position to another)
or inverted (flipped from forward to reverse order). One could measure
the edit distance between two genome orders as the minimum number of these
operations needed to convert one order into the other. Similarly, the
distance between a pair of manuscripts could be measured as the minimum
number of insertions, deletions and transpositions needed to convert one
order into the other (inversions are not possible in manuscripts). One
could then use well-known methods from evolutionary biology to reconstruct
a stemma from the matrix of pairwise distances among manuscripts.
Unfortunately, calculating edit distance is a very difficult computational
problem. We describe two solutions.

First, the edit distance can be separated into the number of
transpositions and insertions/deletions. The breakpoint distance between
a pair of sequences (manuscripts or genomes) is the proportion of items
(tales or genes) common to both sequences but having different right-hand
neighbours between the two sequences. Breakpoint distance is simple to
calculate, and provides an approximate measure of the number of
transpositions provided that few transpositions have occurred [Blanchette,
M., Kunisawa, T. & Sankoff, D. Gene order breakpoint evidence in animal
mitochondrial phylogeny. Journal of Molecular Evolution 49, 193-203
(1999)]. For damaged manuscripts from which sections may have been lost
after writing, we cannot calculate the exact breakpoint distance but we
can set bounds on its possible value. The deletion distance is the
proportion of items present in one but not both of the pair of sequences,
and is an estimate of the number of insertions/deletions. However, we
cannot calculate the deletion distance for damaged manuscripts. We
therefore use breakpoint distances alone to estimate the number of
transpositions among each pair of extant manuscripts in the Canterbury
Tales tradition, and produce stemmata based on these distances. Second,
we present new maximum likelihood methods for estimating edit distance.
These methods are computationally intensive, but are more reliable than
breakpoint distance when the number of transpositions is large.

We compare stemmata produced using these more sophisticated methods with
those from breakpoint distance. For both methods, we use biological
software [Swofford, D. L. PAUP*: Phylogenetic Analysis Using Parsimony
(*and other methods), Sinauer Associates, Sunderland, MA, 1999] to search
for the stemma requiring the smallest sum of edge lengths necessary to
reproduce the observed pattern of distances among manuscripts, where
distance between two manuscripts on the stemma is measured as the sum of
the lengths of the edges (branches) linking the two manuscripts. Although
we only allow topologically binary stemmata, edge lengths may be
arbitrarily close to zero, so relationships in which a single manuscript
has many descendants can be represented. We do not discuss the methods
used by scholars such as Quentin, Dearing and Zarri, as we are working
with a distance matrix rather than a set of variants.

We suggest two methods for comparing different stemmata produced from the
same data set. Again, these methods were developed for analogous problems
in evolutionary biology. First, one can define a partition distance
between two stemmata [Penny, D. & Hendy, M. D. The use of tree comparison
metrics. Systematic Zoology 34, 75-82 (1985)]. Removing any edge linking
two manuscripts (whether extant or hypothetical) divides a stemma into two
sets of manuscripts. The order of manuscripts within the sets is not
important. If we can divide two stemmata into the same two sets of
manuscripts in this way, we say that the stemmata have an edge in common.
If there is no edge in the second stemma whose removal produces the same
two sets of manuscripts as the removal of an edge in the first stemma, we
say that the edge occurs in only one of the two stemmata. The partition
distance is then the proportion of edges that occur in only one of the two
stemmata. Partition distance is a simple summary of the number of
differences between two stemmata, and can be used to decide whether two
stemmata are more similar than one would expect by chance alone. Second,
a consensus stemma [Page, R. D. M. & Holmes, E. C. Molecular evolution: a
phylogenetic approach (Blackwell Science, Oxford, 1998)] includes only
those edges that occur in all (or a specified proportion of) the stemmata
to be compared. Parts of the consensus stemma for which the stemmata to
be compared contradict each other are left as unresolved, star-like
groups. A consensus stemma provides a good visual representation of the
areas of uncertainty in the relationships among manuscripts.

We also consider methods for dealing with contamination. For the distance
measure we discuss here, our preferred technique is to add edges to the
tree so as to minimize the sum of squared differences between observed
distances and shortest distances on the resulting network [Makarenkov, V.
& Legendre, P. in Data analysis, classification and related methods (eds.
Kiers, H. A. L., Rasson, J. P., Groenen, P. J. F. & Schader, M.) 35-40
(Springer, New York, 2000)].

The computer-based methods we describe here will not automatically tell us
why the manuscripts of the Canterbury Tales show so many different orders,
or decide which, if any, order best represents Chaucer's original
intention. However, the combination of traditional scholarship and modern
technology may allow great progress towards answering these long-standing
problems. In a companion paper [Bordalejo, B. and Spencer, M. The Order
of the Canterbury Tales: Praxis of Computer Analysis] we discuss the
implications of our analyses for Chaucer studies.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC