Duquesne University
Duquesne University
Duquesne University
Duquesne University
Authorship Attribution (Juola, in press) can be defi ned
as the inference of the author or her characteristics by
examining documents produced by that person. It is of course
fundamental to the humanities; the more we know about a
person’s writings, the more we know about the person and
vice versa. It is also a very diffi cult task. Recent advances in
corpus linguistics have shown that it is possible to do this task
automatically by computational statistics.
Unfortunately, the statistics necessary for performing this task
can be onerous and mathematically formidable. For example, a
commonly used analysis method, Principle Component Analysis
(PCA), requires the calculation of “the eigenvectors of the
covariance matrix with the largest eigenvalues,” a phrase not
easily distinguishable from Star Trek technobabble. In previous
work (Juola, 2004; Juola et al., 2006) we have proposed a model
and a software system to hide much of the details from the
non-specialists, while specifi cally being modularly adaptable
to incorporate new methods and technical improvements.
This system uses a three-phase framework to canonicize
documents, create an event set, and then apply inferential
statistics to determine the most likely author. Because of the
modular nature of the system, it is relatively easy to add new
components.
We now report (and demonstrate) the recent improvements.
Version 3.0 of the JGAAP (Java Graphical Authorship Attribution
Program) system incorporates over fi fty different methods
with a GUI allowing easy user selection of the appropriate
ones to use. Included are some of the more popular and/or
well-performing methods such as Burrows’ function word
PCA (Burrows, 1989), Burrows’ Delta (2003; Hoover 2004a,
2004b)), Juola’s cross-entropy (2004), and Linear Discriminant
Analysis (Baayen, 2002). The user is also capable of mixing and
matching components to produce new methods of analysis; for
example, applying PCA (following Burrows), but to an entirely
different set of words, such as all the adjectives in a document
as opposed to all the function words. With the current user interface, each of these phases is
independently selectable by the user via a set of tabbed radio
buttons. The user fi rst defi nes the document set of interest,
then selects any necessary canonicization and pre-processing,
such as case neutralization and/or stripping HTML markup
from the documents. The user then selects a particular event
set, such as characters, words, character or word N-grams, the
K most common words/characters in the document, part of
speech tags, or even simple word/sentence lengths. Finally, the
user selects an analysis method such as PCA, LDA, histogram
distance using a variety of metrics, or cross-entropy.
More importantly, the JGAAP framework can hide this
complexity from the user; users can select “standard” analysis
methods (such as “PCA on function words”) from a set of
menus, without needing to concern themselves with the
operational details and parameters. Most importantly of all,
the framework remains modular and easily modifi ed; adding
new modules, event models, and analytic methods can be done
in minutes by Java programmers of only moderate skill. We will
demonstrate this by adding new capacity on the fl y.
Perhaps most importantly, we submit that the software
has achieved a level of functionality and stability suffi cient
to make it useful to interested non-specialists. Like the
Delta spreadsheet (Hoover, 2005), JGAAP provides general
support for authorship attribution. It goes beyond the Delta
spreadsheet in the variety of methods it provides. It has also
been tested (using the University of Madison NMI Build-and-
Test suite) and operates successfully on a very wide range
of platforms. By incorporating many cooperational methods,
it also encourages the use of multiple methods, a technique
(often called “mixture of experts”) that has been shown to
be more accurate than reliance on any single technique (Juola,
2008).
Of course, the software is not complete and we hope to
demonstrate some of its weaknesses as well. The user interface
is not as clear or intuitive as we hope eventually to achieve,
and we invite suggestions and comments for improvement. As
the name suggested, the software is written in Java, and while
Java programs are not as slow as is sometimes believed, the
program is nevertheless not speed-optimized and can take a
long time to perform its analysis. Analysis of large documents
(novels or multiple novels) can exhaust the computer’s
memory. Finally, no authorship attribution program can be
a complete survey of the proposed literature, and we invite
suggestions about additional methods to incorporate.
Despite these weaknesses, we nevertheless feel that the
new version of JGAAP is a useful and reliable tool, that the
community at large can benefi t from its use, and that the
development of this tool can similarly benefi t from community
feedback.
References
Baayen, Harald et al. (2002). “An experiment in authorship
attribution.” Proceedings of JADT 2002.
Burrows, John F. (1989). “`An Ocean where each Kind...’ :
Statistical Analysis and Some Major Determinants of Literary
Style.” Computers and the Humanities, 23:309-21
Burrows, John F. (2002). “Delta : A Measure of Stylistic
Difference and a Guide to Likely Authorship.” Literary and
Linguistic Computing 17:267-87
Hoover, David L. (2004a). “Testing Burrows’s Delta.” Literary
and Linguistic Computing, 19:453-75.
Hoover, David L. (2004b). “Delta Prime?” Literary and Linguistic
Computing, 19:477-95.
Hoover, David L. (2005) “The Delta Spreadsheet.” ACH/ALLC
2005 Conference Abstracts. Victoria: University of Victoria
Humanities Computing and Media Centre p. 85-86.
Juola, Patrick. (2004). “On Composership Attribution.” ALLC/
ACH 2004 Conference Abstracts. Gothenburg: University of
Gothenburg.
Juola, Patrick. (2008). “Authorship Attribution : What Mixtureof-
Experts Says We Don’t Yet Know.” Presented at American
Association of Corpus Linguistics 2008.
Juola, Patrick. (in press). Authorship Attribution. Delft:NOW
Publishing.
Juola, Patrick, John Sofko, and Patrick Brennan. (2006). “A
Prototype for Authorship Attribution Studies.” Literary and
Linguistic Computing 21:169-78
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Oulu
Oulu, Finland
June 25, 2008 - June 29, 2008
135 works by 231 authors indexed
Conference website: http://www.ekl.oulu.fi/dh2008/
Series: ADHO (3)
Organizers: ADHO