Authorship

###### 1. Mike Ryan

Duquesne University

###### 2. Patrick Juola

Duquesne University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve
original formatting or readability. For the most complete copy, refer to the original conference program.

A common method of performing authorship attribution

(or text classification in general) involves

embedding the documents in a high-dimensional feature

space and calculating similarity judgments in the

form of numeric “distances” between them. Using (for

example) a k-nearest neighbor algorithm, an unknown

document can be assigned to the “closest” (in similarity

or distance) group of reference documents. However, the

word “distance” is ill-defined and can be implemented

conceptually in many different ways. We examine the

implications of one broad category of “distances”.

This notion of distance can be generalized to dissimilarity

judgements without previous embedding in a space.

An example of this is the Kullback-Leibler Divergence

which calculates an information-theoretic dissimilarity

measure, effect size, between two event streams where

the events are not necessarily independent and thus cannot

be directly tabulated as simple histograms. This kind

of “distance” can easily be incorporated into a text classification

system.

To a topologist, a “distance” is a numeric function D(x,y)

between two points or objects, such that

• D(x,y) is always nonnegative, and always positive

if x != y

• D(x,y) = D(y,x)

• D(x,y) + D(y,z) >= D(x,z)

However, there are many useful distance-like measures

(technically known as “divergences”) that do not have

all these properties. In particular, divergences such as

the Kullback-Leibler divergence and vocabulary overlap

are not the same when measured from different basis.

So, assume that you have two documents A and B and

want to find the divergence between them. The methods

would be to find the divergence of B from A, D(A, B) or

to find the divergence of A from B D(B,A). These two

ways of applying the same divergence will give you different

results.

The two results being different is important because it

implies that there is different information captured by

each one. This has been shown to be true for some divergence

measures (Juola Ryan, 2008) but upon reviewing

those previous results, it seems there is rarely a case

where information spread across both divergences i.e.

the information gained from one is better then an average

of the two. The problem then becomes that we do not

know which one will contain more information. Because

of this we have come up with criteria for selecting one

over the other. The criteria devised are simple; we will

either use the max of the two methods, or the min of the

two methods.

To test this, we are in the process of applying several divergence

functions to a standardized corpus [the AAAC

corpus (Juola, 2004)] of authorship attribution problems

using the JGAAP framework (Juola et al, this conference).

We will compare each run normally, backwards,

then both taking the max and min.

Preliminary results using the Kullback-Leibler Divergence

and the LZW Distance indicate that using the max

of the two divergences will on some problems increase

accuracy by up to four-fold. We plan to continue this

work using other divergences; if this finding continues

to hold, we consider this to be an important step to eliminating

some of the “ad-hoc-ness” of the current state

of authorship attribution, as we will be able to provide

some steps to analyzing not merely what methods perform

best, but what extensions of these methods can be

used to improve their performance.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/

Series: ADHO (4)

Organizers: ADHO

Tags

**Keywords:**None**Language:**English**Topics:**None