University at Buffalo, State University of New York (SUNY)
Janya, Inc
Janya, Inc
Introduction There has been an increasing interest in applying automatic
text analysis techniques to various text classification
problems in literature and the social sciences.
Examples of such tasks include determining biases in
political coverage, or analyzing mood in literature. Earlier
techniques were based on simplistic corpus analysis
techniques such as counting word frequencies and
co-occurrences. The access to robust machine learning
technology and tools has enabled more sophisticated text
mining techniques to be developed. Yu (Yu, 2008) discusses
the use of text classification methods in the literary
domain. His study compared the performance of two
popular algorithms, naïve Bayes and support vector machines
(SVMs) in two literary text classification tasks.
While this trend represents progress in automatic text
mining, it still reflects a reliance on primitive features
such as the bag-of-words model. In such models, text is
represented as a vector of weighted words; word order
is disregarded and only frequency information is used.
Such techniques are inherently limited in the granularity
of the analysis they can perform, typically limited to
the document level. For more fine-grained tasks such as
sentiment analysis with respect to people, characters or
topics, a more sophisticated model of relevant context is
required.
This work discusses the use of entity profiles to represent
the context in which to make judgments regarding an entity,
where an entity can represent an individual or an organization,
or other salient entity types. An entity profile
reflects a consolidation of all important information pertaining
to an entity within a document. For a person (or a
character in a novel), the entity profile would include all
mentions of the individual, including co-referential mentions,
as well as relationships and events involving the
person. The representation of such information is typically
highly structured such as spouse_of(Maria
Bertram, Mr. Rushworth) with a link to the text
snippet or sentence from which the relationship or event
was extracted. An entity profile, when compiled from a
collection of documents, or a lengthy novel is rich information
that provides the required context in which to
compare two individuals, classify human behaviour, etc.
Automatically extracting entity profiles (and associated
text snippets) is a challenging task in information extraction;
the next section describes a system which has been
designed for this purpose. The rest of the paper describes
the use of entity profiles as the context in which automatic
sentiment analysis (Chesley et al, 2006) of fictional
characters can be computed. The example is from Jane
Austen’s Mansfield Park.
Semantex: An Information Extraction
Engine
Semantex (Srihari 2008) is a domain independent, intermediate
level information extraction (IE) engine. The
linguistic processor modules support different levels
of natural language processing, including orthography,
morphology, syntax, co-reference resolution, semantics,
and discourse. The categories of information objects created
by Semantex are (i) Named Entities (NE): proper
names of persons, organizations, product, location etc.,
(ii) Correlated Entity (CE) relationships: capture local
relationships between entities within sentence boundaries.
The results are consolidated into EPs based on coreference
and alias support, (iii) Entity Profiles (EP):
Entity Profiles are complex rich information objects that
collect entity-centric information—in particular, all the
individual mentions of an entity in a document and any
CE relationships the entity is involved in, (iv) Subject-
Verb-Object (SVO) triples: SVO triples decoded by Semantex
are logical, rather than syntactic: surface variations
such as active voice vs. passive voice are decoded
into the same underlying logical relationships, (v) General
Events (GE): verb-centric information objects representing
`who did what to whom when and where'. These
five types of information objects capture key content of
the processed text. For this project, the most relevant objects
are CEs, EPs, and SVOs.
Sentiment Analysis based on Context
provided by Entity Profiles
We use the set of text snippets (or sentences) from an entity
profile as the context in which features for sentiment
analysis are computed. Sentiment analysis is performed in two phases: (i) the first phase, training, focuses on
compiling a lexicon of subjective words and phrases
along with their polarities (positive/negative) and an associated
weight. (ii) in the second phase, sentiment association,
a text document collection is processed and
sentiment assigned to entity profiles of interest.
For sentiment analysis, a lexicon of subjective words/
phrases (those with positive or negative polarity associated
with them) is first compiled through (i) expansion
from adjectives in WordNet using synonyms based on
positive and negative seed adjectives and (ii) use of a
search engine to find words that appear “near” a known
positive/negative adjective. To associate sentiment with
an entity, we accumulate polarity weights (using a sliding
window) from the sentences within the entity profile;
thresholding results in a final positive, negative or neutral
polarity for the entity in question.
Sentiment Analysis applied to Jane Austen’s
Mansfield Park
In this section, sentiment analysis has been applied to
characters in Mansfield Park by Jane Austen. Specifically,
it has been applied to the entity profile for the character
Mary Crawford at different times in the novel. This is
the process that was employed.
1. The text of Mansfield Park , originally consisting
of 159,500 words was split into four parts at chapter
breaks with some consideration to the progress
of the plot. These breaks were chosen to track the
transformation of the character Mary Crawford
from first meeting through the revelation of some
flaws in the character.
2. Each of the four sections was processed by Semantex;
entity profiles were generated for all the characters,
including Mary Crawford. This resulted in four
entity profiles for Mary Crawford at different stages
in the plot.
3. Sentiment analysis was computed for each of the
entity profiles: the goal was to correlate the output
of automatic sentiment analysis with the transformation
in the character over time.
The sentiment analysis output based on two entity profiles
for Mary Crawford generated at different stages
(parts one and three) is shown in the table below. Part
three reflects the duration of time just before and after
Maria’s elopement with Henry Crawford, Mary’s
brother. Mary’s reaction to this event exposes flaws in
her nature, and contributes to a reader’s judgment of her
character as negative. In each case, a subset of the subjective
words that contribute to the overall polarity (positive
or negative) are shown, along with snippets of text
(based on entity profile) in which those words appeared.
These text snippets are a subset of the sentences which
contribute to the entity profile for Mary Crawford. The
entire profile is not shown for space considerations. It
should be noted that snippets from the entity profile are
not necessarily contiguous.
Our system has judged the first profile to be positive, but
the second one to be neutral rather than negative. This
could be partly due to an aggregation of sentiment that
is performed over the entire section. There is considerable
effort that remains in improving the accuracy of
automated sentiment analysis of fictional characters. For
example, words such as “ashamed” and “embarrassed”
are not necessarily associated with negative sentiment
depending on the context. Another problem is proper association
of the sentiment with the character in question.
We continue to work on these issues.
Co-referential Mentions: Mary, Mary Crawford, Miss
Crawford, she, herself, his sister This paper has described an experiment in which automatic
sentiment analysis is used to illustrate either the
change in a character, or the perception of the character
by other characters over the progression of a story. Entity
profiles provide rich context in which to attempt other
tasks, such as measuring the similarity of characters,
both within a novel, as well as across novels. Standard
document similarity measures may be used to accomplish
this.
The challenge to making this technique more robust is
the accuracy of coreference, including anaphora resolution.
Mistakes in this module can cause irrelevant sentences
to be pulled into the entity profile, thus rendering
the analysis inaccurate. Efforts are underway to improve
this accuracy. Sentiment analysis can also be improved
by fine tuning the association of subjective words with
the correct character. Nevertheless, this is a more sophisticated
method of performing text analysis with respect
to analyzing human behaviour.
References
R. K. Srihari, W. Li, C. Niu and T. Cornell (2008) "InfoXtract:
A Customizable
Intermediate Level Information Extraction Engine,"
Journal of Natural Language Engineering, Cambridge
U. Press, 14(1), 2008, pp.33-69.
P. Chesley, B. Vincent, L. Xu, and R. K. Srihari (2006)
"Using Verbs and Adjectives to Automatically Classify
Blog Sentiment”, Proc. AAAI-2006 Spring Symposium
on Computational Approaches to Analyzing Weblogs,
Stanford University, CA March 2006, AAAI Press, TR
SS-06-03, pp.27-29.
Bei Yu (2008) An evaluation of text classification methods
for literary study, Linguist Computing 23: 327-343.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO