Feature Creep: Evaluating Feature Sets for Text Mining Literary Corpora

  1. 1. Charles Cooney

    University of Chicago

  2. 2. Russell Horton

    University of Chicago

  3. 3. Mark Olsen

    University of Chicago

  4. 4. Robert Voyer

    University of Chicago

  5. 5. Glenn Roe

    University of Chicago

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Machine learning offers the tantalizing possibility of discovering
meaningful patterns across large corpora of literary texts.
While classifi ers can generate potentially provocative hints
which may lead to new critical interpretations, the features sets
identifi ed by these approaches tend not to be able to represent
the complexity of an entire group of texts and often are too
reductive to be intellectually satisfying. This paper describes
our current work exploring ways to balance performance of
machine learning systems and critical interpretation. Careful
consideration of feature set design and selection may provide
literary critics the cues that provoke readings of divergent
classes of texts. In particular, we are looking at lexical groupings
of feature sets that hold the promise to reveal the predominant
“idea chunklets” of a corpus.
Text data mining and machine learning applications are
dependent on the design and selection of feature sets.
Feature sets in text mining may be described as structured
data extracted or otherwise computed from running or
unstructured text in documents which serves as the raw data
for a particular classifi cation task. Such features typically include
words, lemmas, n-grams, parts of speech, phrases, named
entities, or other actionable information computed from the
contents of documents and/or associated metadata. Feature
set selection is generally required in text mining in order to
reduce the dimensionality of the “native feature space”, which
can range in the tens or hundreds thousands of words in
even modest size corpora [Yang]. Not only do many widely
used machine learning approaches become ineffi cient when
using such high dimensional feature spaces, but such extensive
feature sets may actually reduce performance of a classifi er.
[Witten 2005: 286-7] Li and Sun report that “many irrelevant
terms have a detrimental effect on categorization accuracy
due to overfi tting” as well as tasks which “have many relevant
but redundant features... also hurt categorization accuracy”. [Li
Feature set design and selection in computer or information
science is evaluated primarily in terms of classifi er performance
or measures of recall and precision in search tasks. But the
features identifi ed as most salient to a classifi cation task may be
of more interest than performance rates to textual scholars in
the humanities as well as other disciplines. In a recent study of
American Supreme Court documents, McIntosh [2007] refers
to this approach as “Comparative Categorical Feature Analysis”.
In our previous work, we have used a variety of classifi ers and
functions implemented in PhiloMine [1] to examine the most
distinctive words in comparisons of wide range of classes, such
as author and character genders, time periods, author race
and ethnicity, in a number of different text corpora [Argamon
et. al. 2007a and 2007b]. This work suggests that using different
kinds of features -- surface form words compared to bilemmas
and bigrams -- allows for similar classifi er performance on a
selected task, while identifying intellectually distinct types of
differences between the selected groups of documents.
Argamon, Horton et al. [Argamon 2007a] demonstrated that
learners run on a corpus of plays by Black authors successfully
classifi ed texts by nationality of author, either American or
non-American, at rates ranging between 85% and 92%. The
features tend to describe gross literary distinctions between
the two discourses reasonably well. Examination of the top 30
features is instructive.
American: ya’, momma, gon’, jones, sho, mississippi, dude,
hallway, nothin, georgia, yo’, naw, alabama, git, outta, y’,
downtown, colored, lawd, mon, punk, whiskey, county, tryin’,
runnin’, jive, buddy, gal, gonna, funky
Non-American: na, learnt, don, goat, rubbish, eh, chief,
elders, compound, custom, rude, blasted, quarrel, chop, wives,
professor, goats, pat, corruption, cattle, hmm, priest, hunger,
palace, forbid, warriors, princess, gods, abroad, politicians
Compared side by side, these lists of terms have a direct
intuitive appeal. The American terms suggest a body of plays
that deal with the Deep South (the state names), perhaps the
migration of African-Americans to northern cities (hallway and
downtown), and also contain idiomatic and slang speech (ya’,
gon’, git, jive) and the language of racial distinction (colored). The
non-American terms reveal, as one might expect, a completely
different universe of traditional societies (chief, elders,
custom) and life under colonial rule (professor, corruption,
politicians). Yet a drawback to these features is that they have a
stereotypical feel. Moreover, these lists of single terms reduce
the many linguistically complex and varied works in a corpus
to a distilled series of terms. While a group of words, in the
form of a concordance, can show something quite concrete
about a particular author’s oeuvre or an individual play, it is
diffi cult to come to a nuanced understanding of an entire corpus through such a list, no matter how long. Intellectually,
lists of single terms do not scale up to provide an adequate
abstract picture of the concerns and ideas represented in a
body of works.
Performing the same classifi cation task using bilemmas (bigrams
of word lemmas with function words removed) reveals both
slightly better performance than surface words (89.6% cross
validated) and a rather more specifi c set of highly ranked
features. Running one’s eye down this list is revealing:
American: yo_mama, white_folk, black_folk, ole_lady,
st_louis, uncle_tom, rise_cross, color_folk,front_porch,
jim_crow, sing_blue black_male, new_orleans, black_boy,
cross_door, black_community, james_brown,
Non-American: palm_wine, market_place, dip_hand,
cannot_afford, high_priest, piece_land, join_hand,bring_
water, cock_crow, voice_people, hope_nothing, pour_
libation, own_country, people_land, return_home
American (not present in non-American): color_boy,
color_girl, jive_ass, folk_live
Here, we see many specifi c instances of Africian-American
experience, community, and locations. Using bigrams instead
of bilemmas delivers almost exactly the same classifi er
performance. However, not all works classifi ed correctly using
bilemmas are classifi ed correctly using bigrams. Langston
Hughes, The Black Nativity, for example, is correctly identifi ed as
American when using bigrams but incorrectly classifi ed when
using bilemmas. The most salient bigrams in the classifi cation
task are comparable, but not the same as bilemmas. The
lemmas of “civil rights” and “human rights” do not appear in
the top 200 bilemmas for either American or non-American
features, but appear in bigrams, with “civil rights” as the 124th
most predictive American feature and “human rights” as 111th
among non-American features.
As the example of The Black Nativity illustrates, we have found
that different feature sets give different results because, of
course, using different feature sets means fundamentally
changing the lexically based standards the classifi er relies on
to make its decision. Our tests have shown us that, for the
scholar interested in examining feature sets, there is therefore
no single, defi nitive feature set that provides a “best view” of
the texts or the ideas in them. We will continue exploring
feature set selection on a range of corpora representing
different genres and eras, including Black Drama, French
Women Writers, and a collection of American poetry. Keeping
in mind the need to balance performance and intelligibility, we
would like to see which combinations of features work best
on poetry, for example, compared to dramatic writing. Text
classifi ers will always be judged primarily on how well they
group similar text objects. Nevertheless, we think they can
also be useful as discovery tools, allowing critics to fi nd sets of
ideas that are common to particular classes of texts.
1. See http://philologic.uchicago.edu/philomine/. We have largely
completed work on PhiloMine2, which allows the user to perform
classifi cation tasks using a variety of features and fi ltering on entire
documents or parts of documents. The features include words,
lemmas, bigrams, bilemmas, trigrams and trilemmas which can be
used in various combinations.
[Argamon 2007a] Argamon, Shlomo, Russell Horton,
Mark Olsen, and Sterling Stuart Stein. “Gender, Race, and
Nationality in Black Drama, 1850-2000: Mining Differences
in Language Use in Authors and their Characters.” DH07,
Urbana-Champaign, Illinois. June 4-8, 2007.
[Argamon 2007b] Argamon, Shlomo, Jean-Baptiste Goulain,
Russell Horton, and Mark Olsen. “Discourse, Power, and
Ecriture Féminine: Text Mining Gender Difference in 18th and
19th Century French Literature.” DH07, Urbana-Champaign,
Illinois. June 4-8, 2007.
[Li 2007] Li, Jingyang and Maosong Sun, “Scalable Term
Selection for Text Categorization”, Proceedings of the 2007
Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning
(EMNLP-CoNLL), pp. 774-782.
[McIntosh 2007] McIntosh, Wayne, “The Digital Docket:
Categorical Feature Analysis and Legal Meme Tracking in
the Supreme Court Corpus”, Chicago Colloquium on Digital
Humanities and Computer Science, Northwestern University,
October 21-22, 2007
[Witten 2005] Witten, Ian H. and Eibe Frank. Data Mining:
Practical Machine Learning Tools and Techniques, 2nd ed. San
Francisco, CA: Morgan Kaufmann, 2005.
Yang, Yiming and Jan Pedersen, “A Comparative Study on
Feature Selection in Text Categorization” In D. H. Fisher
(ed.), Proceedings of ICML-97, 14th International Conference
on Machine Learning, pp. 412-420, Nashville, US. Morgan
Kaufmann Publishers, San Francisco, US.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None