Duquesne University
Authorship Attribution (Juola, 2008) can be defined
as the inference of the author or her characteristics
by examining documents produced by that person. It is
of course fundamental to the humanities; the more we
know about a person’s writings, the more we know about
the person and vice versa. It is also a very difficult task.
Recent advances in corpus linguistics have shown that
it is possible to do this task automatically by computational
statistics.
A key question, however, in any statistical study (not
just of text statistics) is whether the data or methods will
transfer from one domain to another. Statistical analyses
hinge on assumptions which may or may not be met
by different languages, and of course, data which is representative
of one domain is highly unrepresentative of
any other, by definition. A method that performs well in
one area may fail miserably in another. As an example,
a part-of-speech tagger with 96% accuracy on newswire
data may achieve only 50% on chat logs. (Craig Martell,
p.c., 2008)
In light of this finding, cross-problem transferrence can
be a major problem for statistical authorship attribution.
Should we expect an authorship attribution system that
performs well on, say, English, to also perform well on
Dutch, Serbian, or Chinese? Alternatively, is it reasonable
for a scholar of, say, Polish poetry to have confidence
in methods that have been tested to perform well,
but only on English documents? This, of course, is a
major problem, especially with problems of “forensic”
interest where the accuracy rate is one of the major considerations
regarding the evaluation or even admissibility
of evidence.
The JGAAP software framework (Juola et al., 2006)
in conjunction with the Ad-hoc Authorship Attribution
Competition corpus (Juola, 2004) provide us with some
preliminary results. JGAAP (described elsewhere in this
volume) is a modular, Java-based program capable of
performing thousands of different types of authorship attribution
methods on a well-defined corpus. The AAAC
comprises 13 authorship attribution problems in a variety
of languages and genres. This setup makes large-scale
performance comparisons among categories practical.
For example, 8 of the 13 AAAC problems involved English
text (in some form or another), but 5 involved other
languages such as French, Dutch, Latin, or Serbian/Slavonic.
If authorship attribution did not transfer well,
we would expect to see little correlation between the
average performance of a method on English texts and
its performance in other languages, as high-performing
methods would not necessarily remain high-performing
in other environments. Conversely, if we see a high degree
of correlation across languages, this argues for a
high degree of transfer.
As part of some other large-scale technical comparisons
(this volume), we have gathered 281 separate analyses of
the AAAC data using a variety of preprocessors, event
set models, and analytic methods, ranging from simple
lexical statistics or nearest neighbor histogram measures
to complex machine learning models such as support
vector machines on word or character n-grams of
various sizes. In this database, the correlation between
a method’s average performance in English and non-
English was 0.6680, a highly significant (p < 0.0001)
result. More tellingly, the coefficient of determination
(r^2) was 0.4462, meaning that approximately 45% of
the variation in performance of an algorithm across non-
English data could be explained simply by a measure
of its performance on English-only data; variations in
genre, register, or even variations across the broad category
of languages-that-aren’t-English have only a little
more effect in total. (We should add that work is ongoing
and the 281 analyses will undoubtedly be expanded
in the next several months; these numbers are therefore
only preliminary and updated results will be presented.)
Similarly, we can divide the AAAC problems into
“large” problems (those with training samples of more
than 100,000 characters each) and “small” ones. (Of the
8 English problems, 4 were large; of the five non-English
problems, 3 were large.) Across the same 281 analyses,
the correlation between performance on large problems
and small problems was again highly significant
(r=0.6061, r^2 = 0.3674), meaning that almost 37% of
the variation in performance was explained by predicted
performance on other sizes.
This provides strong evidence, then, that good algorithms
for authorship in one domain will also be good
algorithms for authorship in other domains. In particular,
we have hopes that a set of best practices established
by looking at one particular data set will be a set of at least “good” practices in other domains, and can be a
useful starting point in a search for domain-specific best
practices in other, less studied or novel, domains. Unfortunately,
the way that the AAAC data was structured
prevents direct comparisons of accuracy (although it is
hard to imagine ways to establish that two authorship attribution
tasks are “comparably difficult” to enable such
direct comparisons). Of course, at one level, one could
make the argument that a bad algorithm for English
should not be expected miraculously to perform better
when transferred to a language the original designer cannot
speak or read. But it is still encouraging to find that
a good algorithm for English can be expected to perform
well in that same unknown language.
References
Juola, Patrick. (2004). “Ad-Hoc Authorship Attribution
Competition” ALLC/ACH 2004 Conference Abstracts.
Gothenberg: University of Gothenberg.
Juola, Patrick. (2006/8). Authorship Attribution.
Foundations and Trends in Information Retrieval 1:3.
Delft:NOW Publishing.
Juola, Patrick, John Sofko, and Patrick Brennan. (2006).
“A Prototype for Authorship Attribution Studies.” Literary
and Linguistic Computing 21:169-78
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO