How much of an effect do transcription errors in a
text document have on the ability to do useful statistical
analysis on that document? In order to perform
authorship attribution, it is often necessary to first have
a digital copy of the documents in question. The task
of authorship attribution is to assign an authorship tag
to a document of unknown origin based on statistical
analysis of the text and comparison with documents of
known authorship. This is often automated by means of
a computer, which necessitates the existence of digital
copies of all the works to be analyzed. The success rates
of optical character recognition (OCR) systems make
them an attractive choice for the creation of these digital
copies. The various documents can be scanned into a
computer and converted automatically to text. Rice et al.
documented OCR per-character accuracy rates of greater
than 90% for nearly all commercial OCR systems
tested, most of which scored in a range of 95-98% accuracy
(1996). More recent commercial claims by OCR
companies suggest that accuracy rates are above 98%.
However, is a 5% or even 2% error rate acceptable when
creating a statistical authorship model? Is it necessary to
proofread each scanned document by hand before performing
authorship attribution, or is the error rate small
enough that it is unlikely to affect the overall result?
We intend to present new results showing that it is not
necessary to proofread scanned documents before using
them to perform statistical authorship attribution. In fact,
these results suggest that no significant performance
degradation occurs when analyzing documents with percharacter
error rates of less than 15%. As this is well below
the published averages for OCR systems, there is
little need to worry about the few errors which will be
introduced during the automated image-to-text conversion.
Accuracy of study materials is of course crucial for
an ideal authorship attribution study. As Rudman (1997)
put it, “most non-traditional authorship attribution researchers
do not understand what constitutes a valid
study.” In 2002, he wrote, “do not include any dubitanda
—a certain and stylistically pure Defoe sample must be
established—all decisions must err on the side of exclusion.
If there can be no certain Defoe touchstone, there
can be no ... authorship attribution studies on his canon,
and no wide ranging stylistic studies.” Ideally, we would
have access to the original manuscripts to make sure that
what we have is the pure work --but in the real world,
we may not have such access. We argue, in contradiction
to Rudman, that by assessing the likely contribution of
types of error—such as errors introduced by bad OCR
technology—we can determine whether such errors are
likely to shake our confidence in our results. If we can
give 10:1 odds that a given paper was likely written by
Defoe, we will still be confident if we learn that our odds
might be as low as 9.8 or 9.9:1.
For this experiment, we made use of the Java Graphical
Authorship Attribution Program (JGAAP www.
jgaap.com), a freely available Java program for performing
authorship attribution created by Patrick Juola
of Duquesne University. This modular program breaks
the task of authorship attribution into three subtasks,
described as ’Canonicization’, ’Event Generation’ and
’Statistical Analysis’. During the Canonicization step,
documents are standardized and various preprocessing
steps can occur. For this experiment, we created a Canonicization
method which randomly changed a percentage
of the characters in each document, simulating the
per-character error of an OCR system. We also converted
all characters to lower case during this step. We generated
a feature set of ’Words’, which JGAAP defines as
any string of characters separated by whitespace. Finally,
we used a normalized dot product as a nearest neighbor
algorithm to assign authorship tags to unknown documents.
Noecker and Juola (personal correspondence)
have suggested that this normalized dot product scoring. which they refer to as the ’Cosine Distance’, outperforms
many more complicated techniques and is especially
well suited to the feature set we chose.
In order to test this experiment on real world data, we
have used the Ad-hoc Authorship Attribution Competition
(AAAC) corpus. The AAAC was an experiment
in authorship attribution held as part of the 2004 Joint
International Conference of the Association for Literary
and Linguistic Computing and the Association for
Computers and the Humanities. The AAAC corpus
provides texts from a wide variety of different genres,
languages and document lengths, assuring that the results
would be useful for a wide variety of applications.
The AAAC corpus consists of 98 unknown documents,
distributed across 13 different problems (labeled A-M).
An analysis method’s AAAC score is calculated as the
sum of the percent accuracy for each problem. Hence, a
AAAC score of 1300% represents 100% accuracy on all
problems. This score was designed to weight both small
problems (those with only one or two unknown documents)
and large problems equally. Because this score is
not always sufficiently descriptive on its own, we have
also included an overall accuracy rate in our experiment.
That is, we calculate both the AAAC scoring and the total
percentage of unknown documents which were assigned
the correct authorship labels. These two scores provide a
fair assessment of how the technique performed both on
a per-problem and per-document basis.
The experiment itself consisted of fifty-two iterations
of the authorship attribution process over the entire
AAAC corpus. The presented results will be an average
of the effect of random transcription errors from 1% to
100% of the characters in each document. As previously
noted, we calculated both the AAAC score and overall
percentage correct over some 5,096 experiments (98 experiments
per iteration). These overall results suggested
that there was essentially no decrease in performance on
either the AAAC score or the overall percentage for error
rates of 1% to 2%, which many commercial OCR
systems claim to achieve. Even for more skeptical error
rates of 5%, the overall percentage correct decreased
by about 1%, and the AAAC score by only about 20.
This trend continues until roughly a 15% error rate, after
which the performance drops off rather considerably.
Still, as Rice et al. have reported, all major OCR systems
are capable of error rates considerably lower than
15%, which strongly suggests that there is little reason to
spend additional time proofreading scanned documents
before performing authorship attribution.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)