Modelling Publishing History, 1475-1640: Change Points in the STC

  1. 1. Fiona Tweedie

    Department of Statistics - University of Glasgow

  2. 2. David Bank

    Department of Statistics - University of Glasgow

  3. 3. Brian McIntyre

    Department of Statistics - University of Glasgow

1: Introduction
This paper describes statistical analyses carried out on DEBORAH, a database which gives access to the contents of early printed books, as identified and listed in the (revised) Pollard and Redgrave Short Title Catalogue of English Books, 1475-1640. The total number of books published in this period was c.36,000, 92% of which are now reproduced on microfilm.
DEBORAH is the acronym of "Database of English Books of the Renaissance, their Authors and Histories". In the literature of the period, Deborah was one of many honorific names given to Queen Elizabeth as warrior, judge and leader of her people. The project's interest is historical and cultural rather than narrowly bibliographical.
The number of first or main editions in the revised STC is 18,321 when editions of the Prayer Book, the Bible, and a few other numerically small categories are excluded. To date, some 13,700 of these main editions have been examined on microfilm, providing information of the following kinds for the database: e.g. authorship details, translation history, polemical standpoint, a broad-to-narrow range of subject-index terms, occasion, a detailed abstract of the contents, illustration, etc. Dates and places of publication are those given in the revised STC, which was computerized as part of the DEBORAH project. The length of the abstracts varies according to the nature of the work analysed, but generally falls within the range of 50 to 200 words. Each record has twenty-five main fields, each of which can be searched separately or in combination, and each record is linked by program to its corresponding records (all related editions, issues, etc.) in the revised STC database.
The reference uses for such a database are obvious. But the database can also be used for searches which result in frequency tables, which provide the input for graphs. What is the proportion of books on a given topic in relation to the total books produced in a given period of years? How does this proportion change over time? Do the results confirm or disconfirm what, in existing scholarship, would be expected? Early work on quantification was supported by the British Academy. What follows is an account of how quantification and statistical modelling are being applied in new ways.
2: How Many Books? Modelling the Total Output
An obvious first question to ask about the database is the number of books published each year and listed in the revised STC. The number of books published increases from 2 books in 1475 to 789 in 1640. We would like to understand the nature of this increase.
The first step is to fit a straightforward linear regression model (Number of books=-6440.0+4.26*Year, R^2=86.4%). However, despite the apparent large amount of explained variation, the curved nature of the resulting residuals indicates that a higher order model is necessary. A quadratic model performs well, the equation: Number of books=77162.8-102.3*Year+0.0339*Year^2 explains 93.7% of the variation. The residuals from this model appear to be Normally distributed about zero and do not show evidence of an underlying structure. We can therefore conclude that the increase in the number of texts published each year and listed in the STC is Increase=0.0678*Year-102.3. Although Corns [1] found that the effect of censorship could be seen in the volume of material collected by Thomason between 1640 and 1662, we do not find similar results. The imposition or relaxation of censorship during our period imposes minor deviations from this model, but not such that detract from the good fit obtained. It may be possible to fit a model that takes the degree of press control into account. In this abstract, however, we shall turn to more specific questions about the books published during this period.
3: Other Questions
3.1 Aim
The period covered by the STC is one of the most interesting in the history of the United Kingdom. The times were turbulent and this is reflected in much of the material published during the 165 years from 1475 to 1640.
We shall therefore look at the behaviour of two types of data:
a) The occurrence of words which are known to be of historical import at the time, e.g. "Spain", "Spanish" and "Spaniard"
b) Larger proportions, e.g. percentage of books described as being on "science" or "geography".
3.2 Methodology
As we have shown that the total number of books published increases quadratically, it is important to model the proportion of books published in each year that fall into our categories, rather than the raw number. Proportions need to be analysed using logistic regression, rather than the more usual simple linear regression. The change point methodology of Fortier et al [2] can be modified to allow us to investigate where the major changes in publishing took place.
The change-points in a text are identified by considering the deviance (a measure of how the data diverges from the model) in fitting models with different change points. Each year in turn is considered as a change point; the year with the minimum deviance is considered to be the best-fitting change point.
This methodology deals with a single change point; to investigate for more than one change point would require permutation of the data, or consideration of subsets, which we shall not cover here.
3.3 Results
We shall consider three questions in this abstract: texts that deal with Spain; texts deemed to be about Science and texts concerning geography.
3.3.1 "Spanish texts"
During much of the STC period, England's relations with Spain were hostile. The major event was the invasion attempted in 1588 by the Armada. The proposed marriage of Prince Charles and the Infanta (1620's) also provoked a lot of anti-Spanish publications. These things are widely known, and there is no need for DEBORAH to confirm them. The interest, rather, is in testing DEBORAH and in seeing what statistical modelling can do with the search results.
There are very few DEBORAH records concerning Spain prior to 1554. In 1554 and 1555 there are seven and thirteen books respectively, then a very few until the 1570's, when there are a steady 3 - 7 books per year. The years surrounding the Spanish Armada, between 1587 and 1592, show a big increase, then again between 1596 and 1602. Another peak in the numbers of "Spain" books is strongly visible between 1619 and 1630, although - as there were many more books published overall - the percentage total is lower than for the Armada period.
Fitting the model described above to the proportion of books which have "Spain", "Spanish" or "Spaniard(s)" in the title or in the abstract results in a minimum deviance of 362.7 in 1604. This compares with a deviance of 649.5 from fitting a model without a change point, a significant decrease in deviance.
3.3.2: Books about Geography
The renaissance was a great period of geographical discovery, and was indeed partly a response to this. To examine the effect on publishing, we have searched for records whose faculty (i.e. broadest subject) headings include "geography".
Initial inspection of the data reveals that there are some dates in the early sixteenth century with a large proportion of books on geography. These large proportions are often due to a small number of books being published in total in the year, and also coincide with voyages of discovery. In particular, a large spike in 1555 can be explained by the publication of several books-in-one relating to the discoveries of the New World, the Pacific, the North-East and North-West passages. Earlier publications were about voyages and journeys to Palestine, and to kingdoms in Asia, China and various parts of Middle East.
Fitting the change-point logistic model to this data reveals that there are several local minima in the deviances. These coincide with the peaks in the early sixteenth century. The global minimum is found to be in 1585, with a deviance of 237.0, compared with 271.95 for a model with no change point. Consultation of the data reveals that 1585 is the start of general publishing in geography, before this date there are sporadic publications, but it is only after this date that books on geography appear consistently.
3.3.3: Books about Science
Science is an anomalous term when applied to STC books, if we have twentieth-century notions in mind. Science for an Elizabethan might be proto-modern, as in say Harvey on the circulation of the blood, but it might equally be archaic as in astrology, exorcism, or prognostication. These two kinds of knowledge coexisted, with surprisingly little discomfort, even in the most sophisticated minds. DEBORAH does provide a way of separating "science" into the two components, but our first experiments have left "science" as a compound term.
Initial inspection of the data reveals that books on "science" make up 3-5% of the books published each year, with one exception. In 1603, about 17% of books published attract the faculty heading "Science". An explanation may be the long-imminent death of Elizabeth I, and anxieties about the succession; another important consideration was the heightened anxiety of an especially horrible plague year. Much of the increase can be accounted for by science publications of the "occult" variety, in one form or another. As this is such an extreme observation statistically, we will remove it from the data before carrying out the change-point analysis.
Fitting the model reveals that there is no significant change point in the publication of science texts. Although a model with a change point is better than one for fitting year alone, it is not significantly better and so we conclude that fitting a model that just takes the year into account is acceptable. There appears to be no significant change in the publishing of science books during this period. But searching DEBORAH again, separating the one "science" component from the other, may be expected to show variance - in both cases.
4: Conclusions
We have used statistical techniques to investigate aspects of publishing history using data from the new online version of DEBORAH, incorporating RSTC, and now being prepared for use via Internet We have shown that the number of books published between 1475 and 1640 increased quadratically, with little regard for the imposition of censorship.
Examining the texts to do with Spain revealed a change point at 1604, while texts about geography were seen to be consistently published from 1585. The sensitivity of the deviance minima in the geography books to isolated data points provoked the removal of the 1603 data on the "science books", where there was found to be no significant change point after this outlier was removed.
We have illustrated the kind of quantitative questions that can be asked of the DEBORAH database, and a novel method of statistical analysis that can be applied to the results. The DEBORAH project offers the researcher an unprecedented opportunity to examine, both quantitatively and qualitatively, the publishing history of this period, and will no doubt inspire a large body of work.
1. Corns, T. N. (1986) Publication and Politics, 1640-1661: An SPSS-based account of the Thomason Collection of Civil War Tracts. Literary and Linguistic Computing 1(2): pp. 74-84.
2. Fortier, P. A., Keen, K. J. and Fortier, J. (1997) Change Points: Ageing and Content Words in a Large Database. Literary and Linguistic omputing 12(1): pp. 15-22.

