Duquesne University
Duquesne University
A common method of performing authorship attribution
(or text classification in general) involves
embedding the documents in a high-dimensional feature
space and calculating similarity judgments in the
form of numeric “distances” between them. Using (for
example) a k-nearest neighbor algorithm, an unknown
document can be assigned to the “closest” (in similarity
or distance) group of reference documents. However, the
word “distance” is ill-defined and can be implemented
conceptually in many different ways. We examine the
implications of one broad category of “distances”.
This notion of distance can be generalized to dissimilarity
judgements without previous embedding in a space.
An example of this is the Kullback-Leibler Divergence
which calculates an information-theoretic dissimilarity
measure, effect size, between two event streams where
the events are not necessarily independent and thus cannot
be directly tabulated as simple histograms. This kind
of “distance” can easily be incorporated into a text classification
system.
To a topologist, a “distance” is a numeric function D(x,y)
between two points or objects, such that
• D(x,y) is always nonnegative, and always positive
if x != y
• D(x,y) = D(y,x)
• D(x,y) + D(y,z) >= D(x,z)
However, there are many useful distance-like measures
(technically known as “divergences”) that do not have
all these properties. In particular, divergences such as
the Kullback-Leibler divergence and vocabulary overlap
are not the same when measured from different basis.
So, assume that you have two documents A and B and
want to find the divergence between them. The methods
would be to find the divergence of B from A, D(A, B) or
to find the divergence of A from B D(B,A). These two
ways of applying the same divergence will give you different
results.
The two results being different is important because it
implies that there is different information captured by
each one. This has been shown to be true for some divergence
measures (Juola Ryan, 2008) but upon reviewing
those previous results, it seems there is rarely a case
where information spread across both divergences i.e.
the information gained from one is better then an average
of the two. The problem then becomes that we do not
know which one will contain more information. Because
of this we have come up with criteria for selecting one
over the other. The criteria devised are simple; we will
either use the max of the two methods, or the min of the
two methods.
To test this, we are in the process of applying several divergence
functions to a standardized corpus [the AAAC
corpus (Juola, 2004)] of authorship attribution problems
using the JGAAP framework (Juola et al, this conference).
We will compare each run normally, backwards,
then both taking the max and min.
Preliminary results using the Kullback-Leibler Divergence
and the LZW Distance indicate that using the max
of the two divergences will on some problems increase
accuracy by up to four-fold. We plan to continue this
work using other divergences; if this finding continues
to hold, we consider this to be an important step to eliminating
some of the “ad-hoc-ness” of the current state
of authorship attribution, as we will be able to provide
some steps to analyzing not merely what methods perform
best, but what extensions of these methods can be
used to improve their performance.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO