PhiloMine: An Integrated Environment for Humanities Text Mining

Charles Cooney; Russell Horton; Mark Olsen; Robert Voyer; Glenn Roe

Authorship

1. Charles Cooney

University of Chicago
2. Russell Horton

University of Chicago
3. Mark Olsen

University of Chicago
4. Robert Voyer

University of Chicago
5. Glenn Roe

University of Chicago

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

PhiloMine [http://philologic.uchicago.edu/philomine/] is a set
of data mining extensions to the PhiloLogic [http://philologic.
uchicago.edu/] full-text search and retrieval engine, providing
middleware between PhiloLogic and a variety of data mining
packages that allows text mining experiments to be run on
documents loaded into a PhiloLogic database. We would
like to present a poster describing and demonstrating how
PhiloMine works.
The text mining process under PhiloMine has three main
components -- coprus selection, feature selection and
algorithm selection. Experimental corpora can be constructed
from the documents in the PhiloLogic database using standard
bibliographic metadata criteria such as date of publication or
author gender, as well as by attributes of sub-document level
objects such as divs and paragraphs. This makes it possible,
for example, to compare poetry line groups in male-authored
texts from 1800 - 1850 with those in female-authored texts
from that period or any other. The PhiloMine user then selects
the features to use for the experiment, choosing some or all
feature sets including surface forms, lemmas, part-of-speech
tags, and bigrams and trigrams of surface forms and lemmas.
Once the corpus and feature sets are selected, the machine
learning algorithm and implementation is chosen. PhiloMine
can talk to a range of freely available data mining packages such
as WEKA, Ken William’s Perl modules, the CLUTO clustering
engine and more. Once the learning process has executed, the
results are redirected back to the browser and formatted to
provide links to PhiloLogic display of the documents involved
and queries for the individual words in each corpus. PhiloMine
provides an environment for the construction, execution
and analysis of text mining experiments by bridging the gap
between the source documents, the data structures that form
the input to the learning process, and the generated models
and classifi cations that are its output.
Corpus Selection
Under PhiloMine, the text mining corpus is created by
selecting documents and sub-document level objects from a
particular PhiloLogic database. PhiloLogic can load documents
in a number of commonly used formats such as TEI, RTF,
DocBook and plain text. From all documents in a particular
PhiloLogic collection, the text mining corpus is selected using
bibliographic metadata to choose particular documents, and
sub-document object selectors to choose objects such as divs
and line groups. The two levels of criteria are merged, so that
the experimenter may easily create, for example, a corpus of
all divs of type “letter” appearing within documents by female
authors published in Paris in the 19th century. For a supervised
mining run, the PhiloMine user must enter at least two sets of
such criteria, and a corpus is created which contains multiple
sub-corpora, one for each of the classes. For an unsupervised
mining run, such as clustering, one corpus is created based on
one set of criteria.
The PhiloMine user is also able to specify the granularity of
text object size which is presented to the learning algorithm,
the scope of the “instance” in machine learning terminology.
Under PhiloMine, an instance may be either an entire
document, a div or a paragraph. A single document consisting
of a thousand paragraphs may be presented to a machine
learner as a thousand distinct text objects, a thousand vectors
of feature data, and in that case the learner will classify or
cluster on the paragraph level. Similarly, even if the user has
chosen to use sub-document level criteria such as div type,
the selected text objects can be combined to reconstitute a
document-level object. Thus the PhiloMine experimenter can
set corpus criteria at the document, div and/or paragraph level
and independently decide which level of text object to use as
instances.
Several fi lters are available to ensure that selected text
objects suit the experimental design. PhiloMine users may set
minimum and maximum feature counts per instance. They may
also balance instances across classes, which is useful to keep
your machine learner honest when dealing with classifi ers
that will exploit differential baseline class frequencies. Finally,
instance class labels may be shuffl ed for a random falsifi cation
run, to make sure that your accuracy is not a result of an overfi
tting classifi er.
Feature Selection
In machine learning generally, features are attributes of an
instance that take on certain values. The text mining process
often involves shredding the documents in the corpus into
a bag-of-words (BOW) representation, wherein each unique
word, or type, is a feature, and the number of tokens, or occurrences of each type in a given document, is the value
of that feature for that document. This data structure can
be envisioned as a matrix, or spreadhsheet, with each row
corresponding to a text object, or instance, and each column
representing a type, or feature, with an extra column for class
label if supervised learning is being undertaken. PhiloMine
generates a BOW matrix for the user-selected corpus which
serves as the input to the machine learner.
Because PhiloLogic creates extensive indices of document
content as part of its loading process, its internal data structures
already contain counts of words for each document in a given
database. PhiloMine extends PhiloLogic so that is available for
divs and paragraphs. In addition to the surface forms of words,
PhiloMine will also generate vectors for lemmas or parts-ofspeech
tags, provided by TreeTagger, and bigrams and trigrams
of surface forms or lemmas. The user may select one or more
of these feature sets for inclusion in a given run.
One practical concern in machine learning is the dimensionality
of the input matrix. Various algorithms scale in different ways,
but in general adding a new instance or feature will increase the
time needed to generate a classifi catory model or clustering
solution, sometimes exponentially so. For this reason, it can
be very helpful to limit a priori the number of features in
the matrix before presenting it to the machine learner, and
PhiloMine provides the capability to fi lter out features based
on a number of criteria. For each featureset, the user may
limit the features to use by the number of instances in which
the feature occurs, eliminating common or uncommon
features. Additionally, include lists and/or exclude lists may be
submitted, and only features on the include list and no features
on the exclude list are retained. Finally, features may be fi ltered
by their value on a per-instance basis, so that all features
that occur more or less times than the user desires may be
removed from a given instance, while remaining present in
other instances.
Algorithm and Implementation
Selection
PhiloMine can wrap data for, and parse results from, a
variety of machine learning implementations. Native Perl
functions currently include Ken William’s naive Bayesian and
decision tree classifi ers, a vector space implementation and a
differential relative rate statistics generator. The WEKA toolkit
provides numerous implementations and currently PhiloMine
works with the information gain, naive Bayes, SMO support
vector machine, multilayer perceptron, and J48 decision tree
WEKA components. PhiloMine also can talk to the compiled
SVMLight support vector machine and CLUTO clustering
engine. Relevant parameters for each function may also be set
on the PhiloMine form.
When the user selects an implementation, the feature vectors
for each instance are converted from PhiloMine’s internal
representation into the format expected by that package,
generally a sparse vector format, such as the sparse ARFF
format used by WEKA. The mining run is initiated either by
forking a command to the system shell or by the appropriate
Perl method call. Results of the run are displayed in the
browser, typically including a list of text objects instances with
classifi cation results from the model and a list of features used
by the classifi er. Each instance is hyperlinked to the PhiloLogic
display for that text object, so that the user can easily view
that document, div or paragraph. Similarly, the user can push a
query to PhiloLogic to search for any word used as a feature,
either in the entire corpus or in any of the classed sub-corpora.
If results show, for instance, that a support vector machine has
heavily weighted the word “power” as indicative of a certain
class of documents, the experimenter can quickly get a report
of all occurrences of “power” in that class of documents, any
other class or all classes.
This ability to easily move between the text mining results
and the context of the source documents is meant to mitigate
some of the alienating effects of text mining, where documents
become anonymous instances and words are reduced to
serialized features. For industrial applications of text mining,
the accuracy of a certain classifi er may be the only criterion
for success, but for the more introspective needs of the digital
humanist, the mining results must be examined and interpreted
to further the understanding of the original texts. PhiloMine
allows researchers to frame experiments in familiar terms
by selecting corpora with standard bibliographic criteria, and
then relate the results of the experiment back to the source
texts in an integrated environment. This allows for rapid
experimental design, execution, refi nement and interpretation,
while retaining the close association with the text that is the
hallmark of humanistic study.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

PhiloMine: An Integrated Environment for Humanities Text Mining

1. Charles Cooney

2. Russell Horton

3. Mark Olsen

4. Robert Voyer

5. Glenn Roe

ADHO - 2008