The Provenance of Christian Doctrine, attributed to John Milton: An Evaluation of Alternative Statistical Methods

  1. 1. F.J. Tweedie

    University of the West of England

  2. 2. T.N. Corns

    University of Wales, Bangor (Bangor University / University College of North Wales)

  3. 3. J.K. Hale

    University of Otago

  4. 4. G. Campbell

    Leicester University

  5. 5. D.I. Holmes

    University of the West of England

This project attempts to resolve perhaps the most urgent issue in Milton Studies, namely the provenance of the Latin manuscript, found among the English State papers some 150 years after Milton's death, which has since its discovery been regarded as his definitive attempt to define Christian Doctrine.

The inclusion of the text in the Milton canon has never been unproblematic. Initial responses to it regarded the text as a sensational disclosure of a heretical tendency almost unsuspected previously. In the mid and late twentieth century many significant studies have remarked on and attempted to explain away discrepancies between that text and other work most certainly by Milton, particularly "Paradise Lost" (for example Hunter et al. 1971). In recent years, William B. Hunter, who formerly had sought sophisticated explanations for those contradictions, became convinced that the text is that of an unknown seventeenth-century writer and that it was mistakenly or cynically foisted on Milton shortly after his death, an error compounded on its rediscovery (Hunter 1992, 1993). A group of scholars, Miltonists and statisticians, have been drawn to the question (Campbell et al. 1995), and the text is currently under interrogation from several perspectives, including a study of the amanuenses, a study of the possible circumstances of its composition and early transmission, a study (in part computer-aided) of its Latinity and a stylometric analysis.

From the earliest point in the history of stylometry, work has been performed on works in many different languages, including French, Swedish, Russian, Hebrew, and Greek. However, despite the interest shown in the latter two cases involving classical languages, Latin has remained relatively untouched by stylometric hand. Greek, on the other hand, has attracted much attention, perhaps by virtue of its New Testament biblical connections. Work has been carried out on the New Testament, poetry, and letters, amongst others. However, despite the fact that both Latin and Greek are inflected languages, there are fundamental differences. For example, the lack of the article in Latin invalidates the enthusiasm for sentence length investigations in Greek writings, see Michaelson et al (1978) and Wake (1957). In this paper, a development of the work presented as a poster at the ACH/ALLC conference in Santa Barbara in 1995, we investigate the application of various stylometric techniques. Before they can be applied to "De Doctrina Christiana" it is important to validate their effectiveness on texts of known authorship.

"De Doctrina Christiana"
A group of scholars has formed with the intention of investigating the authorship of "De Doctrina Christiana". This neo-Latin manuscript was found in 1823 along with State Papers by Milton and it is this location that prompted the Miltonic attribution. The investigation has proceeded along various different lines, but the stylometric work is detailed in the next section.

Stylometric work
Despite the lack of interest shown by the stylometric community, some research has taken place in combining statistics and Latin, see for example Hubka (1985). With a few exceptions, it remains amateurish and open to attack on both statistical and literary grounds. Much of it is detailed in Tweedie et al. (1995) We were able to obtain a machine readable version of Milton's Latin prose, in particular the three Defences, "Defensio Prima", "Defensio Secunda" and "Pro Se Defensio". These are the only comparable works by Milton in neo-Latin and fall into the polemic genre. We have also entered text samples from other texts for use as control samples. Three theological samples were chosen, by Ames, Wolleb and Baxter. The works by Ames and Wolleb were mentioned in "De Doctrina Christiana". Five polemic samples were also identified, works by May, Prynne, Wentworth, Earle and Bate. We also had access to samples of text by Bacon from the Oxford Text Archive. The Milton polemics, "Defensio Secunda" and "Pro Se Defensio", were of the order of 25,000 words and were split into five samples, while "Defensio Prima" has around 47,700 words and was split into nine samples for analysis. "De Doctrina Christian" is very much larger and we only consider the first 25,000 words, again split into five samples. The control texts all have around 3000 to 5000 words and were kept as one sample each.

The aim of this section is to identify discriminators that are able to separate the known Milton samples from the control samples, and examine where "De Doctrina Christiana" falls on a similar scale. Measures employed so far include function word doubletons and the most common words used. The function word analysis is detailed in Tweedie et al (1995), here we concentrate on the analysis of common words.

Most common words
In a technique similar to that of Burrows (1992), we identified the one hundred most frequently occurring words in the corpus of text. Their frequencies were standardized and then examined using principal components analysis. The analysis proceeded in two distinct branches. Firstly, it was important to discover if this technique was applicable to neo-Latin, by applying it to texts of known authorship. Secondly, we wished to investigate the relative discriminatory ability of sections of words, the top fifty, second fifty, top twenty-five and so on.

Initially, therefore, we considered only the texts of known authorship, the Milton polemics, the polemic and theological control texts as well as the Bacon samples. The fifty most often occurring words were used as input to the principal components analysis which resulted in the graph below.

- P
0+ B T
- B P
PCA2 -
- P P P T
- M M
-10+ M
- M 2
- B 2 2 M
- B B B M M M2
- B M2
- B B
- T
-70 -60 -50 -40 -30 -20
The samples are designated M for Milton polemics, B for Bacon samples, P for polemic controls and T for theological controls. It is clear that the Milton samples are closely clustered together in the right of the graph, while the Bacon samples are towards the left. With the exception of the sample from Baxter, the rest of the polemic and theological control samples are towards the centre of the plot. The clustering of the polemic and theological samples is especially interesting when it is remembered that they are all written by different authors. Indeed, it is clear that the first principal component separates Milton from the polemic/theological controls, as well as the Bacon samples. This component is based mainly on the use of 'et' (r=-0.98). In Latin the enclitic 'que' is equivalent to 'et', but we have not counted -que usage. This appears to be a reason for the very low usages of 'et' evident in the Milton texts. The Bacon corpus, on the other hand, has a very low incidence of '-que'. Other words that are important on this axis include 'quid', 'quo', 'quidem' and 'qui', all words that Milton uses much more than Bacon. Stylistically, Bacon was known to write in the plain style, while Milton used more complex sentences. Our results would seem to confirm this.
The clustering of the Milton samples and their separation from the other controls indicates that this technique would be capable of discriminating between Milton and texts by other authors. We are therefore confident in applying it to the text of "De Doctrina Christiana". As mentioned above, the most commonly occurring one hundred words were enumerated. For computational reasons, we were unable to analyse all one hundred words at once. It was therefore decided to split the data into, firstly two, the top fifty, then the next top fifty, and secondly into four, considering the top twenty-five words, the next twenty-five and so on. The data from De Doctrina was included in the analysis, represented by 'C's, and that from the Bacon samples was removed as it was not directly comparable for these purposes.

The first analyses were carried out on the twenty-five word samples. Analysis of the most common twenty-five words produced the graph below. It can clearly be seen that the Milton samples are in the top left area of the graph, the "De Doctrina Christiana" samples in the bottom left, and the controls towards the right, the polemics at the top.

C31 - P
- P
- MM
0.0+ M M P
- M MM M P
- MM P
- M2
- M M
-8.0+ M M
- C M
- M
- C
- C C
-16.0+ T T
- C T
12 24 36 48 60 72
The discriminations are not complete, there are "De Doctrina Christiana" samples mixed with the Milton, and a Milton sample towards the control area. However, the general structure is clear. Analyses of the other twenty-five word sections resulted in graphs that generally separated the groups, with various facets being demonstrated.
The second stage of the analysis was carried out using the top fifty words. The resulting plot is shown below. It appears that there are four distinct areas of the graph. The upper half contains the polemic samples, both controls and the Milton Defences, while the lower half contains the theological samples, "De Doctrina Christiana" and controls. The exception is again the Baxter sample, which appears higher than it perhaps should for its genre. Also, the right hand side of the graph appears to contain the Milton samples as well as "De Doctrina Christiana", while the left hand side seems to contain the control samples from both genres. This is a development of the above graph from the top twenty-five words. Here the groups are separated better, and clustered more tightly together.

- P M
10+ M2 M
- P P MM
PCA2 - P 2
- M 3M
0+ T M
- M
- C
- C C
-10+ C C
- T
- T
-72 -60 -48 -36 -24 -12
This PCA plot indicates that the "De Doctrina Christiana" samples are closer to Milton's defences than they are to the theological control samples. This would appear to lead to the conclusion that the author of "De Doctrina Christiana" uses these words in a similar way to that of Milton.
The analysis of the second fifty most common words results in the plot below. It can be seen that the Milton samples are spread over a large area of the first principal component, but in the higher part of the graph, while the other samples, with the exception of one of the polemic samples, Eikon Basilike, are in the lower section. The "De Doctrina Christiana" samples are closely grouped, as are the theological samples, in the lower, right corner of the graph.

- M
- M 2
6.0+ M
- M
PCA2 - M M M P M
- M M M M M
3.0+ M
- M M
- C3
- M P P C T
0.0+ T
- P T
- P
-12.0 -9.0 -6.0 -3.0 0.0
Investigation of the words affecting the first principal component reveals that 'rex' plays the major part. Thus works of a polemic nature, with reference to the king, score more negatively than works of a theological nature. This may explain the close groupings of the theological samples. On the second principal component, analysis is more difficult. Many words play roles here, but the most significant are 'ego' and 'quis', as well as 'jam'. Again, the Milton samples appear different from the control samples and "De Doctrina Christiana", perhaps due to the complex sentence structure mentioned above.

Conclusions and Future Work
It is clear that this area represents almost virgin territory for the stylometrician. Previous work has been reviewed and found wanting in many areas. The results of applying more appropriate techniques to the case of "De Doctrina Christiana" seem to imply that the author of "De Doctrina Christiana" may well be Milton, but genre effects appear to be playing a part. This was revealed by the analysis of the most common fifty words. Consideration was also given to the second top fifty and tranches of twenty-five words. Other techniques should be investigated, but further investigation is needed into Latin stylometry before more can be done.

Some aspects of further work would be unable to be carried out without access to a morphological parser, to identify parts of speech automatically, and a metrical scanner. This would enable the rhythm of Latin prose to be taken into account in investigations. Certain works have been scanned, but they are limited. Access to these utilities would dramatically open up the subject to further research.

