Authorship

###### 1. Matthew Spencer

Department of Biochemistry - Cambridge University

###### 2. Barbara Bordalejo

De Montfort University, University of Saskatchewan

###### 3. Adrian C. Barbrook

Department of Biochemistry - Cambridge University

###### 4. Linne R. Mooney

Department of English - University of Maine

###### 5. Christopher J. Howe

Department of Biochemistry - Cambridge University

###### 6. Peter Robinson

De Montfort University, Institute for Textual Scholarship - University of Saskatchewan

Work text

This plain text was ingested for the purpose of full-text search, not to preserve
original formatting or readability. For the most complete copy, refer to the original conference program.

Geoffrey Chaucer's Canterbury Tales exists in over 50 complete manuscripts

and early printed editions. It consists of a series of tales told by

pilgrims travelling from London to Canterbury, more or less loosely

connected by linking passages. Unlike other contemporary works of

literature having a similar form, the surviving manuscripts show many

different orders of the tales and links [Manly, J. M. & Rickert, E., pages

475-494 in Volume II of The text of the Canterbury Tales: studied on the

basis of all known manuscripts (eds. Manly, J. M. & Rickert, E.),

University of Chicago Press, Chicago, 1940]. Tales and links have been

inserted, deleted or moved from one position to another. Previous studies

of the order of these sections have focussed on two main questions: the

order intended by Chaucer, and the possibility that differences in order

among manuscripts can reveal the genealogy of the manuscripts (the

stemma). These studies have used verbal arguments about the internal

consistency of different orders, geographical references within the links,

and the plausibility of hypothetical rearrangements [e.g. Benson, L. D.

The order of The Canterbury Tales. Studies in the Age of Chaucer 3, 77-120

(1981)].

An analogous problem in evolutionary biology is the reconstruction of the

phylogeny (family tree) of a set of species from the order of genes on a

genome [Sankoff, D. Edit distance for genome comparison based on non-local

operations. Lecture Notes in Computer Science 644, 121-135 (1992)]. Genes

may be inserted, deleted, transposed (moved from one position to another)

or inverted (flipped from forward to reverse order). One could measure

the edit distance between two genome orders as the minimum number of these

operations needed to convert one order into the other. Similarly, the

distance between a pair of manuscripts could be measured as the minimum

number of insertions, deletions and transpositions needed to convert one

order into the other (inversions are not possible in manuscripts). One

could then use well-known methods from evolutionary biology to reconstruct

a stemma from the matrix of pairwise distances among manuscripts.

Unfortunately, calculating edit distance is a very difficult computational

problem. We describe two solutions.

First, the edit distance can be separated into the number of

transpositions and insertions/deletions. The breakpoint distance between

a pair of sequences (manuscripts or genomes) is the proportion of items

(tales or genes) common to both sequences but having different right-hand

neighbours between the two sequences. Breakpoint distance is simple to

calculate, and provides an approximate measure of the number of

transpositions provided that few transpositions have occurred [Blanchette,

M., Kunisawa, T. & Sankoff, D. Gene order breakpoint evidence in animal

mitochondrial phylogeny. Journal of Molecular Evolution 49, 193-203

(1999)]. For damaged manuscripts from which sections may have been lost

after writing, we cannot calculate the exact breakpoint distance but we

can set bounds on its possible value. The deletion distance is the

proportion of items present in one but not both of the pair of sequences,

and is an estimate of the number of insertions/deletions. However, we

cannot calculate the deletion distance for damaged manuscripts. We

therefore use breakpoint distances alone to estimate the number of

transpositions among each pair of extant manuscripts in the Canterbury

Tales tradition, and produce stemmata based on these distances. Second,

we present new maximum likelihood methods for estimating edit distance.

These methods are computationally intensive, but are more reliable than

breakpoint distance when the number of transpositions is large.

We compare stemmata produced using these more sophisticated methods with

those from breakpoint distance. For both methods, we use biological

software [Swofford, D. L. PAUP*: Phylogenetic Analysis Using Parsimony

(*and other methods), Sinauer Associates, Sunderland, MA, 1999] to search

for the stemma requiring the smallest sum of edge lengths necessary to

reproduce the observed pattern of distances among manuscripts, where

distance between two manuscripts on the stemma is measured as the sum of

the lengths of the edges (branches) linking the two manuscripts. Although

we only allow topologically binary stemmata, edge lengths may be

arbitrarily close to zero, so relationships in which a single manuscript

has many descendants can be represented. We do not discuss the methods

used by scholars such as Quentin, Dearing and Zarri, as we are working

with a distance matrix rather than a set of variants.

We suggest two methods for comparing different stemmata produced from the

same data set. Again, these methods were developed for analogous problems

in evolutionary biology. First, one can define a partition distance

between two stemmata [Penny, D. & Hendy, M. D. The use of tree comparison

metrics. Systematic Zoology 34, 75-82 (1985)]. Removing any edge linking

two manuscripts (whether extant or hypothetical) divides a stemma into two

sets of manuscripts. The order of manuscripts within the sets is not

important. If we can divide two stemmata into the same two sets of

manuscripts in this way, we say that the stemmata have an edge in common.

If there is no edge in the second stemma whose removal produces the same

two sets of manuscripts as the removal of an edge in the first stemma, we

say that the edge occurs in only one of the two stemmata. The partition

distance is then the proportion of edges that occur in only one of the two

stemmata. Partition distance is a simple summary of the number of

differences between two stemmata, and can be used to decide whether two

stemmata are more similar than one would expect by chance alone. Second,

a consensus stemma [Page, R. D. M. & Holmes, E. C. Molecular evolution: a

phylogenetic approach (Blackwell Science, Oxford, 1998)] includes only

those edges that occur in all (or a specified proportion of) the stemmata

to be compared. Parts of the consensus stemma for which the stemmata to

be compared contradict each other are left as unresolved, star-like

groups. A consensus stemma provides a good visual representation of the

areas of uncertainty in the relationships among manuscripts.

We also consider methods for dealing with contamination. For the distance

measure we discuss here, our preferred technique is to add edges to the

tree so as to minimize the sum of squared differences between observed

distances and shortest distances on the resulting network [Makarenkov, V.

& Legendre, P. in Data analysis, classification and related methods (eds.

Kiers, H. A. L., Rasson, J. P., Groenen, P. J. F. & Schader, M.) 35-40

(Springer, New York, 2000)].

The computer-based methods we describe here will not automatically tell us

why the manuscripts of the Canterbury Tales show so many different orders,

or decide which, if any, order best represents Chaucer's original

intention. However, the combination of traditional scholarship and modern

technology may allow great progress towards answering these long-standing

problems. In a companion paper [Bordalejo, B. and Spencer, M. The Order

of the Canterbury Tales: Praxis of Computer Analysis] we discuss the

implications of our analyses for Chaucer studies.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20011127030143/http://www.nyu.edu/its/humanities/ach_allc2001/

Attendance: 289 (https://web.archive.org/web/20011125075857/http://www.nyu.edu/its/humanities/ach_allc2001/participants.html)

Tags

**Keywords:**canterbury tales computer-assisted stemmatology sequence rearrangements**Language:**English**Topics:**None