Central Connecticuit State University
The proposed poster will summarize recent work on
finding clusters of Edgar Allan Poe’s short stories.
These are defined by high rates of word usage for five
word collections, each of which is defined by a similar
meaning. For example, one of them focuses on death
and has terms such as corpse, dead and die. By choosing
word groups pertinent to the types of stories that Poe
wrote, the resulting clusters make more intuitive sense
than other clustering algorithms that are common in information
retrieval.
Humans enjoy dividing texts into similar clusters. For
example, the literary critic Daniel Hoffman discusses
Poe’s stories in thematic groups (Hoffman, 1990). For
instance, vendors such as Amazon.com create clusters
that consist of recommendations made to users, which
are based on past purchases. Consequently, there is
value in creating a clustering algorithm that can explain
how the grouped texts are similar in terms that a human
can appreciate.
Unfortunately, the most popular techniques in information
retrieval for computing text similarity produce
results that lack appeal to human sensibilities. For example,
the vector space model (Section 2.1 of Grossman
and Frieder, 2004) reduces a text to a table of (usually
weighted) word counts where each column represents a
text and each row represents a word. Each column can
be thought of as a vector, so the geometric idea of angles
between texts can be introduced, where smaller angles
correspond to higher similarity. This geometric point of
view is powerful because it is well understood by mathematicians,
and it already has been applied to studying
information. For example, this is the focus of the book
Geometry and Meaning (Widdows, 2002). However, the
dimensions of these vectors can be quite high (in the tens
of thousands), which makes the results hard to reconcile
with human intuition.
The technique used in this poster builds on the vector
space model as follows. Instead of including all the
words that Edgar Allan Poe uses in his short stories
(about 20,000 total), these are analyzed to find a few
clusters, each defined by a shared meaning, which are
built by using a thesaurus. Note that this removes function
words such as the, of, and, a and to, which primarily
serve grammatical roles and are less interesting to a human
reader. Five groups are used in this poster, which
are based on the following themes: death, body, spiritual,
horror, and family. For example, the top ten most
frequent death words in Poe are death, corpse, dead,
murder, died, die, deceased, dying, fatal, and deadly.
Anyone familiar with his short stories would not be surprised
by these word groups. For example, many of his
stories are about people who die (for instance, “Morella,”
“Eleonora,” and “The Oval Portrait”) or are killed
(for instance, “The Black Cat,” “The Murders in the Rue
Morgue,” and “The Tell-Tale Heart”).
Since word frequencies are a function of text length,
Poe’s story lengths must be taken into account. This can
be done with word rates, that is, the proportion of a word
in a text. Unfortunately, these can also depend on text
length (see discussion in Chapter 1 of Baayen, 2002), so
stories are grouped by length, which are then analyzed
separately. Three groups are used in this poster: 2000 to
3000, 3001 to 4200, and 4201 to 6000.
For each of these three ranges, matrices are constructed
where each row represents a word group, each column
represents a Poe short story, and the entries are word
group rates. Although the vector space model could be
used in this situation, the alternative method of formal
concepts is used because it is more interpretable. Formal
concept analysis (FCA) is a technique to create a
double lattice of concepts given a list of objects and another
one of attributes. It is a central tool in concept data
analysis (Carpineto and Romano, 2004) and is used to
organize information in a way more similar to humans.
For this application, the objects are Poe’s short stories,
and the attributes are a story having a word group rate
above the median.
A formal concept consists of a set of objects and a set
of attributes, where each object shares all the attributes,
and each attribute is shared by all the objects. This set
of objects is a cluster, which is describable by its shared
attributes. Applied to Poe’s short stories, each cluster
is determined by its use of the five word groups defined
above. Since a list of death words is both straightforward
to compute and is evocative for a human reader,
such story clusters have intuitive appeal. Moreover,
there are several algorithms published that efficiently
find all the formal concepts. For this project Ganter’s
algorithm is used (Ganter, 1984). Here is a specific example. There are thirteen Poe stories
that are between 2000 and 3000 words long. We use the
five attributes defined above, one for each word group.
A story has one of these if its word group rate is above
the median rate for all thirteen stories. For example,
the median death word rate is 1.22 words per thousand
(wpt). The six stories with higher rates have the attribute
death, the other seven do not. For body words, the median
is 5.44 wpt, so the six stories above this threshold
have the body attribute. The same computation is done
for the remaining three word groups.
Figure 1. Formal concept lattice for Edgar Allan Poe
stories with lengths between 2000 and 3000 words.
Formal concepts always form two lattices, called a Galois
lattice. One focuses on the stories, and the other focuses
on attributes. Both of these are ordered by subsets, and
these are related: as the number of attributes increases,
the number of objects decreases, and vice versa. In this
application, nineteen formal concepts are found, which
are given in Fig. 1. We consider one here, which consists
of the Poe stories “The Tell-Tale Heart,” “The Masque of
the Red Death,” and “Morella,” and the attributes death,
body, and horror. Although the plots of these stories differ,
there are similarities apparent to a human. First, all
three stories are about death. In “The Tell-Tale Heart”
the narrator tells how he kills his roommate. The red
death is a pestilence, and in “Morella” the narrator tells
of his wife’s obsession with death and rebirth. Second,
all three stories discuss the body. The first one features
an evil eye and a beating heart, while the red of the second
one refers to blood. Morella dies giving birth, and
her daughter grows to be just like her mother physically,
which is described in the story. Finally, it is no surprise
that stories about death that include bodily descriptions
would have many words related to horror.
The tools of FCA and Galois lattices used in this poster
are flexible. It is this author’s belief that additional applications
will be found, both for other literature and for
other types of texts.
References
Baayen, R. H. (2002). Word Frequency Distributions.
New York: Springer.
Carpineto, C. and Romano, G. (2004). Concept Data
Analysis: Theory and Applications. Chichester: Wiley.
Ganter, B. (1984). Two basic algorithms in concept
analysis. Technical Report FB4 – Preprint No. 831, TU
Darmstadt, Germany.
Grossman, D. A. and Frieder, O. (2004). Information
Retrieval: Algorithms and Heuristics, 2nd Edition. Dordrecht:
Springer.
Hoffman, D. (1990). Poe Poe Poe Poe Poe Poe Poe.
New York: Marlowe & Co.
Widdows, D. (2004). Geometry and Meaning. Palo
Alto: Center for the Study of Language and Information.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO