Some Properties of the Univariante Linear Modelling Approach to Autorship Questions

  1. 1. Roy Felton

    Manukau Institute of Technology

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Style data may be modelled in various ways. Pre-modern textual authorship problems are particularly difficult, in that it is impossible to obtain an exhaustive list of potential authors. Also, in ancient times authorship was a more flexible notion, in that works by disciples could be legitimately attributed to their master. Also, often a busy person would employ an amanuensis to fill out details of a partly completed work or indeed actually construct a work under some given guidelines.
Linear models, which enable various sources of variation to be separated out, are a powerful and flexible method of testing disputed authorship.[1] The simplest such model is:
yij = m + ai + eij
i = 1, 2, ..., I texts and
j = 1, 2, ...., J chunks of text;
where the usual error assumptions may be taken to hold. A feature of applying this linear model approach to texts is that the text effect ai is assumed to be random, even though a limited number only of the author's texts are extant. It may be reasoned also that the author was potentially capable of producing many other texts. (The text effect is a way of building into the model the possibility that an author's style is constant within a given text but differs from text to text. Such an effect has been observed in authors whose works span a considerable period of time. For example average word length has sometimes been noticed to change in a consistent direction with time.) Other factors, either fixed or random, may be incorporated into the model. These factors could be genre, subject matter and language. If text effect is considered to be random, then the condition that the undisputed texts are homogeneous in style with respect to the characteristics measured, may be relaxed. This means style characteristics (such as word length, vocabulary measures, usage of particular key words) which could alter between an author's works, but which remain constant within a work, may be utilised as well as those that remain constant across all works. Furthermore, schools of writers, amanuenses, or other convenient groupings of authors may be accommodated in a similar fashion with this model.
The null hypothesis assumes all texts satisfy a linear model and the alternative hypothesis assumes the undisputed satisfy a linear model with the disputed satisfying the same model but with a location shift. A contrast of the form: mean of the means of the undisputed minus mean of the mean(s) of the disputed may be used to test this hypothesis. An appropriate test statistic is formed when this contrast is divided by an estimate of its standard error. The distributional properties of such a test statistic are not easy to obtain. However critical values may be established by simulation for a wide range of parameter combinations (parameters could be for example variances of the random effects - in the simplest model these would be the variance of ai and eij or the mean m.) , style characteristics and error structures (some possibilities here could be the standard one in which the ai and eij are uncorrelated within themselves and between themselves or perhaps where there is serial first order autocorrelation within a text denoted by r or perhaps similar autocorrelation between texts denoted by rA or perhaps both these forms of autocorrelation. Both r and rA would be examples of model parameters.) Progress has been made in deriving the test statistic distributions for the simplest linear model.
This approach is particularly relevant to the situation where traditionally a corpus of writings has been grouped together but recent critical work has resulted in part of that corpus being questioned as far as authorship goes. Since text effect is taken to be random, confidence interval bounds for the variation in style for a particular author may be established on the basis of that writer's undisputed extant works.
Clearly the power of the linear model approach is increased as the number of undisputed texts increases and/or their text lengths are increased. However, because of practical textual limits, maximum power usually cannot be increased to any significant degree by the simple artifice of using more words. Although power can be altered for the same number of total words in the undisputed texts by altering I and J, this is not a particularly fruitful approach, as even greater power is achieved by taking all the available undisputed text.
One parameter under the investigator's control is chunksize. Simulation indicates that for count data, maximum power is achieved when chunksize is minimised and decreases slowly until chunksize is about 25% of the text length, whereupon it decreases markedly. A binary model results when chunksize equals one word, for style variables measured at the lexical level. A heuristic argument and also simulation indicate that power is little altered in the presence of first order autocorrelation within a text for such a model. The situation is not so simple for larger chunksizes, since for certain parameter combinations (within text variableness, between text variableness in particular, as well as the value and sign of the first order autocorrelation coefficient), maximum power is achieved at different chunksizes if the critical values are based on uncorrelated data. Similarly, as a generalisation, power is slightly increased in the presence of first order autocorrelation.
This linear model approach specifies the composition of the two clusters under the alternative hypothesis (the undisputed and the disputed texts), whereas cluster analysis, whilst perhaps specifying the number of clusters under the alternative hypothesis, certainly leaves the composition of the clusters open. The extra specificity of the alternative hypothesis for the linear model approach results in it being in general more powerful than cluster analysis. However some cluster analysis methods may be modified to incorporate information about which texts are disputed. Under the linear model the direction of the location shift may be incorporated into the alternative hypothesis.
When discriminant analysis is applied to author attribution problems it is assumed that all possible authors are known.[2] Any disputed texts are assigned to that author with the "closest" style. However it may be that the closest is not close enough. This may be decided by applying the linear model approach, if each potential author has some extant undisputed texts. Thus the disputed texts are compared with each author in turn. Only those where the null hypothesis is not rejected can be considered as candidates. Again texts would be assigned to the author within this group with the closest style. If all authors were excluded, then the text would not be assigned to the closest author, but rather to none. This appears to be a more realistic procedure, especially with ancient texts. However, even with more modern texts, it may be prudent to leave open the possibility of rogue authorship.
The paper will include the results of simulations (based on the binary model) showing the effect on power of changing chunksize for the linear model approach, with usual error structure and for a variety of other error structures. The results of comparing some cluster analysis methods with the linear model approach will be given. Some real life data will be analysed as an illustration. Linear modelling will be shown to be a powerful and flexible approach to author attribution. Multivariate analogues will be briefly introduced.
1. Felton, R. 'A New procedure for Author Attribution.' ALLC-ACH'96 Abstracts, pp. 74-75.
2.Neumann, K. J. 'The Authenticity of the Pauline Epistles in the Light of Stylostatistical Analysis'. S.B.L. Dissertation Series no 120. Atlanta: Scholars Press, 1990, pp. 218.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC