JGAAP 4.0 - A Revised Authorship Attribution Tool

poster / demo / art installation
Authorship
  1. 1. Patrick Juola

    Duquesne University

  2. 2. John Noecker Jr.

    Duquesne University

  3. 3. Michael Ryan

    Duquesne University

  4. 4. Sandy Speer

    Duquesne University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Authorship Attribution (Juola, 2006) can be defined
as the inference of the author or her characteristics
by examining documents produced by that person. For
some time, we have been working on a system (JGAAP
— Java Graphical Authorship Attribution Program) to
use advanced statistics to perform this task while not demanding
a high degree of expertise from the user (Juola,
et al., 2008). With the recent release of JGAAP 3.2 and
the near-term planned release of JGAAP 4.0, we are finally
confident that we have a production quality system
for general-purpose use.
We now report (and demonstrate) these recent improvements.
JGAAP now incorporates nearly 20 different analytic
methods (including eight different distance-based
nearest-neighbor algorithms), more than 20 different
event sets and models ranging from character- and wordbased
N-grams to reaction times, and several different
preprocessors incorporating a wide variety of different
document types including remote (Web-accessible) files
and text extraction from different formats. We estimate
that JGAAP is capable of performing more than 20,000
different types of analysis for authorship attribution or
similar text classification tasks, with more being added
as development continues.
Other improvements include:
• GUI improvements to enhance user-friendliness
• Enhanced graphical output capabilities Full report generation capacity for scholarly inspection
of the results
• Creation of a command-line interface
• Automatic batch processing capacity for large-scale
comparative testing
• Incorporation of the AAAC (Juola, 2004) test corpus
into the demo for comparative testing purposes
• Dynamic loading of new methods to encourage new
development
We are finally able to perform large-scale comparative
analyses of different processing methods. We include
here a short list of some JGAAP-related findings (published,
submitted, or in preparation) :
• Introduction of a small number of character errors
(as exemplified by modern OCR systems) does not
substantially reduce accuracy with most methods.
• Symmetric (“commutative”) distance-based methods
tend to outperform asymmetric ones.
• Linear classifiers such as LDA tend to outperform
nonlinear classifiers despite the apparent oversimplicity
of the underlying model
• Character-based methods tend to outperform wordbased
ones for authorship attribution in Chinese
• Both cosine distance (normalized dot product) and
simple event-based Kullback-Leibler divergence
tend to be the best-performing methods for distancebased
nearest-neighbor methods.
• The seminal word list of Mosteller and Wallace does
not generally perform well for texts other than the
Federalist Papers
Some of our findings have been submitted under separate
cover to this conference, but we hope to present a summary
of major results that have been achieved by June
2009 along with a demonstration of the newest version
of the program. We also hope to provide examples of
the sort of analysis that have been performed by JGAAP
(and invite cooperation from interested researchers for
further study).
Finally, we hope to demonstrate some example ad-hoc
analyses during the session; it should be possible, for example,
to demonstrate that “document length” or “words
that are palindromes” do not perform well as Event/feature
sets in less than ten minutes. While this is perhaps
not interesting (no sensible person has proposed palindromes
for authorship attribution), this clearly illustrates
the ease-of-use and of result generation.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None