An Evaluation of Text Classification Methods for Literary Study

  1. 1. Bei Yu

    University of Illinois, Urbana-Champaign

  2. 2. John Unsworth

    University of Illinois, Urbana-Champaign

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Asurvey study 1 shows that text classification is a typical
scholarly activity in literary study, and automatic text
classification methods can be used in three scenarios. The first
is information organization - a classifier can learn the target
category concepts (e.g. news article about trade, acquisition,
etc.) from the training documents, and then assign new
documents into these predefined categories. The second purpose
is knowledge discovery - a successful classifier can provide
insights to understand a target concept by revealing the
correlations between the features and the concept. The third
purpose is example-based retrieval - a classifier might be able
to learn a concept from a small number of training documents
with the help of semi- supervised learning or active learning
methods, and then retrieve more documents similar to the
training examples from a large collection.
Text classification techniques have been well developed in the
past twenty years. With the availability of many text
classification methods, empirical evaluation is important to
provide guidance for method selection in applications. Because
of the sub jectivity in the class concept definition, analytical
evaluation of text classifiers is difficult. Therefore empirical
experiments became the common text classification evaluation
methods.2 The major text classification methods have been
evaluated on topic classification tasks using some benchmark
data sets, such as the Reuters-21578 news collection and the
Usenet newsgroup collection. Some topic classification
evaluation results have been widely accepted. For example,
SVM is currently the best text classifier3; no feature selection
improves SVM performance 4; SVM feature selection is better
than Odds Ratio for naive Bayes, etc.5 There are mixed
conclusions regarding some document preprocessing techniques,
such as stemming and stop word removal.
However, these evaluation data sets were limited to news and
web documents; the evaluation tasks were limited to topic
classification for information organization purpose. The target
concepts in literary text classification range from topic to style,
genre, emotion, and more. These different types of target
concepts can also be called document properties. Previous study
6 showed that document properties interact with clustering
methods. Will the various document properties in literary text
classification tasks affect the classification methods? Are the
previous evaluation results still valid for literary text
This paper describes an empirical evaluation of text
classification methods for literary study. We choose a new kind
of data - the literary documents - to evaluate classification
methods. Because no benchmark data is available in the literary
domain, we select two literary text classification problems -
the eroticism classification in Dickinson’s poems and the
sentimentalism classification in early American novels - as two
cases for this study. Both problems focus on identifying certain
kinds of emotion - a document property other than topic.
We also choose two popular text classification algorithms -
naive Bayes and Support Vector Machines (SVM), and three
feature engineering options - feature merging (stemming),
stopword removal and statistical feature selection (Odds Ratio
and SVM) - as the sub jects of evaluation. We aim to examine
the effects of the chosen classifiers and feature engineering
options on the two emotion classification problems, and the
interaction between the classifiers and the feature engineering
options. As a special case of feature merging, we also examine
the impact of Dickinson’s unconventional capitalizations on
classification performance. We choose bag-of-words (BOW)
model for document representation.
We seek empirical answers to the following research questions:
1. 1. Is SVM a better classifier than naive Bayes regarding
classification accuracy, new literary knowledge discovery
and potential for example-based retrieval?
2. 2. Is SVM a better feature selection method than Odds Ratio
regarding feature reduction rate and clas- sification accuracy
3. 3. Does stop word removal affect the classification
4. 4. Does stemming affect the performance of classifiers and
feature selection methods?
Our experiment results show that SVM is not a universal winner
in literary text classification. After fea- ture reduction naive
Bayes achieves high accuracies in both cases while SVM
succeeds in the sentimentalism classification only. Figure 1
and 2 show that SVM and naive Bayes select their top features
from different frequency ranges. Naive Bayes tends to pick
unique words, which are often not frequent. The large number
of low frequency words results in the success of naive Bayes
in the eroticism classification. These unique words also
surprised the Dickinson scholars, who finally found some new
erotic indicators from them. SVM tends to pick high frequent and discriminant words, which are scarce in the Dickinson
collection. These words (such as personal pronouns) are within
the scholars’ expectation and therefore not interesting anymore.
Figure 1: Dickinson feature ranks and frequencies
Figure 2: Sentimentality feature ranks and frequencies
Despite the high classification accuracies, the learned naive
Bayes eroticism classifier, and also the SVM classifier with
low accuracy, are useless for example-based retrieval purpose.
Figure 3 shows that the concept of eroticism can not be learned
from small number of examples, and the classifiers’ prediction
confidence drop quickly with the expanding prediction
Both classifiers achieve high classification accuracies in the
sentimentalism classification task, which in- dicate that
sentimentalism is a more straightfoward concept than eroticism
for bag-of-words representation. The two classifiers still choose
different top features but reach comparable performances for
the sentimen- talism classification. So for the purpose of
feature-category correlation analysis the two methods should
be used as complemental to each other rather than one over the
other. This time the unique words picked by naive Bayes are
so strange that the scholars can not make sense of it. The
common but discriminant words picked by SVM are still within
the scholars’ expectation. The learning curves and confidence
curves in figure 4 show that both classifiers yield high potential
for example-based retrieval. The experiment results also show that self feature selection
helps both naive Bayes and SVM improve classification
accuracies. For SVM the improvement is not as significant as
for naive Bayes. Odds Ratio is better than SVM as feature
selection method for naive Bayes. However Odds Ratio cannot
improve the SVM performance. Without feature selection the
stemmed and unstemmed features obtain similarly low
classification accuracies in both cases, so did the case merging
in the Dickinson case. The micro level analysis finds that the
effects of good mergings and bad mergings are neutralized
overall. Stemming does not affect both feature selection
methods in the eroticism classification case, but we are surprised
to find that stemming negatively affects both feature selection
methods, especially SVM, in the sentimentalism classification
We have found that the stop words obtained from the Brown
corpus are also overly common and useless in sentimentalism
classification. However, the Brown stop words are mostly
uncommon in the Dickinson collection. Personal pronouns -
the group of function words usually treated as stop words -
turns out to be highly relevant features for eroticism
Our study extends the empirical evaluation of text classification
methods to emotion classification tasks in the literary domain.
Some conclusions are consistent with what are obtained in
previous research, such as Odds Ratio does not improve SVM
performance and stop word removal might harm classification.
Some conclusions contradict previous results, such as SVM
does not beat naive Bayes in both cases. Some findings are new
to this area - SVM and naive Bayes select top features in
different frequency ranges; stemming might harm feature
selection methods. These experiment results provide new
insights to the relation between classification methods, feature
engineering options and non-topic document properties. Figure 3: Figure: potential for Dickinson example-based retrieval Figure 4: Figure: potential for sentimentalism example-based retrieval
Our experiment results also provide guidance for classification
method selection in literary text classifi- cation applications.
We suggest that both SVM and naive Bayes be used for
feature-category correlation analysis purpose. The number of
support vectors in the SVM model indicates the complexity of
the target concept. A complex concept is hard to learn from
small training set. Feature reduction produces smaller and more
generalizable models, but statistical methods are a better choice
than the arbitrary feature reduction (like stemming and stop
word removal) which are insensitive to particular classification
1. Bei Yu and John Unsworth, " Toward Discovering Potential Data
Mining Applications in Literary Criticism," Digital Humanities
2006 Conference Abstracts (Paris: CATI, Université
Paris-Sorbonne, 2006): 237-239.
2. Fabrizio Sebastiani, " Machine Learning in Automated Text
Categorization," ACM Computing Surveys 34.1 (2002): 1-47.
3. Thorsten Joachims, " Text Categorization with Support Vector
Machines: Learning with Many Relevant Features," Proceedings
of ECML-98, 10th European Conference on Machine Learning
1998, and Yiming Yang and Xin Liu, " A Re-evalution of Text
Categorization Methods," Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR’99), (Berkley, CA:
AMC, 1999): 42–49
4. George Forman, " An Extensive Empirical Study of Feature
Selection Metrics for Text Categorization," Journal of Machine
Learning Research 3 (2003): 1289–1305, and Dunja Mladenic
and Marko Grobelnik, " Feature Selection for Unbalanced Class
Distribution and Nave Bayes," Proceedings of the Sixteenth
International Conference on Machine Learning (ICML) (199):
5. Dunja Mladenic, Janez Brank, Marko Grobelnik, and Natasa
Milic-Frayling, " Feature Selection Using Linear Classifier
Weights: Interaction with Classification Models," ACM SIGIR
’04 (2004): 234–241.
6. J. Morato, J. Llorens, G. Genova, and J. A. Moreiro, " Experiments
in Discourse Analysis Impact on Information Classification and
Retrieval Algorithms," Information Processing and Management
39 (2003): 825– 851.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None