Authorship

###### 1. John Noecker Jr.

Duquesne University

###### 2. Patrick Juola

Duquesne University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve
original formatting or readability. For the most complete copy, refer to the original conference program.

There is a signiﬁcant overhead cost when using techniques

like support vector machines on large data

sets. It would certainly be convenient to be able to use

a less involved technique, but this often comes with a

cost of less desirable performance. This results in a trade

between time and memory restraints and prediction

accuracy.

It seems that a sophisticated technique like support

vector machines must surely outperform

a simple

nearest neighbor classiﬁcation. Fortunately, some recent

results suggest that in fact a simple nearest neighbor

classiﬁcation using the normalized dot product (the socalled

‘cosine distance’) as a ‘distance’

performs comparably

to radial basis function support vector machines for

the task of authorship attribution.

In some cases, this cosine

distance classiﬁcation actually outperforms SVMs.

Whether or not a space is linearly separable can have

important consequences on performing classiﬁcation

within that space. An n-dimensional space containing

two classes of points is said to be linearly separable

if

there exists an n-1 dimensional hyper-plane which separates

the classes. A linearly separable space has the advantage

that simpler classiﬁcation will work for classifying

points within that space. A simple distance metric

may be sufﬁcient to distinguish between two classes in a

linearly separable space, while a more complex method

like support vector machines or a neural network is necessary

to capture nonlinear class boundaries. The primary

advantage of linear separability is that is allows

us to develop classiﬁcation algorithms that are less computationally

intensive. That is, instead of taking hours

or even days to model the class boundaries, we can use

simple algorithms which will achieve comparable results

in only a few minutes. Linear classiﬁers tend to scale

considerably better than their more complex counterparts.

This is especially important when working with

very large corpora, where training a support vector machine

could take several days, while evaluating the cosine

distance between documents in the corpus may take

less than an hour.

We intend to present recent results in the ﬁ eld of authorship

attribution suggesting that the normalized

dot

product nearest neighbor classiﬁcation method is is comparable

to radial basis function support vector machine

classiﬁcation methods. For this experiment, we made use

of the Java Graphical Authorship Attribution Program

(JGAAP-www.jgaap.com), a freely available Java program

for performing authorship attribution created by

Patrick Juola of Duquesne University. This modular program

breaks the task of authorship attribution into three

subtasks, described as ‘Canonicization’, ‘Event Generation’

and ‘Statistical Analysis’. During the Canonicization

step, documents are standardized and various preprocessing

steps can occur. For this experiment, we have

used a variety of combinations of three preprocessing

steps: ‘Strip Punctuation’, ‘Unify Case’ and ‘Normalize

Whitespace’. Although the choice of canonicizers had

some effect on the overall performance of the statistical

analysis methods, the choice of preprocessors did not

signiﬁcantly

affect the results. For the feature sets, we

used characters, character bigrams, word lengths, word

bigrams and words. We performed the experiments both

with the full feature sets and with only the 50 most common

features. For the statistical methods, as previously

discussed, we used both radial-basis function SVMs and

the normalized dot product scoring for nearest neighbor

classiﬁcation. We made use of the libSVM package for

our SVMs.

In order to test this experiment on real world data, we

have used the Ad-hoc Authorship Attribution Competition

(AAAC) corpus. The AAAC was an experiment

in authorship attribution held as part of the 2004 Joint

International Conference of the Association for Literary

and Linguistic Computing and the Association for

Computers and the Humanities. The AAAC corpus

provides texts from a wide variety of different genres,

languages and document lengths, assuring that the results

would be useful for a wide variety

of applications.

The AAAC corpus consists of 98 unknown documents,

distributed across 13 different problems (labeled A-M).

An analysis method’s AAAC score is calculated as the

sum of the percent accuracy

for each problem. Hence, a

AAAC score of 1300% represents 100% accuracy on all

problems. This score was designed to weight both small

problems (those with only one or two unknown documents)

and large problems equally. Because this score is

not always sufﬁciently descriptive on its own, we have

also included an overall accuracy rate in our experiment.

That is, we calculate both the AAAC scoring and the total

percentage of unknown documents which were assigned the correct authorship labels. These two scores provide a

fair assessment of how the technique performed both on

a per-problem and per-document basis.

In those cases where all events were included, the cosine

distance classiﬁcation actually outperformed

radial basis

function SVMs, while performing only slight worse on

the most common event sets. This leads us to the conclusion

that although much of the information necessary

for authorship attribution is contained within the 50 most

common events, it is the less common events that actually

result in a sort of empirically linearly separably

clustering of the data points. That is, when presented

only with the 50 most common events as a feature space,

we require a more complex classiﬁer to model the class

boundaries between different authors. However, as we

increase the number of dimensions of this space by adding

the less-frequently used features, it becomes possible

to model these boundaries with only simple linear

classiﬁ

ers. Hence there is some important information

contained even within the rarely occurring events in the

feature set.

The ﬁ nding that the normalized dot product performs

comparatively to support vector machines is important

because it is very simple and computationally tractable

even in high-dimensional spaces. It is much quicker to

calculate the dot product between two vectors than it is

to train a support vector machine. Using a normalized

dot product scoring nearest neighbor algorithm will

allow very large data sets to be processed

much more

quickly and does not seem to cause a signiﬁcant loss of

accuracy. Hence, we propose that for some applications,

the normalized dot product scoring may be an acceptable

substitute for using a support vector machine.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/

Series: ADHO (4)

Organizers: ADHO

Tags

**Keywords:**None**Language:**English**Topics:**None