Human Computer Interaction Lab - University of Maryland, College Park
Department of English - University of Maryland, College Park, Maryland Institute for Technology and Humanities (MITH) - University of Maryland, College Park
National Center for Supercomputing Applications (NCSA) - University of Illinois, Urbana-Champaign
Computer Science - University of Maryland, College Park
Graduate School of Library and Information Science (GSLIS) - University of Illinois, Urbana-Champaign
Department of English - University of Maryland, College Park
This paper develops a rationale for “provocational” text mining in literary interpretation; discusses a specific application of the text mining techniques to a corpus of some 200 XML-encoded documents; analyzes the results from the vantage point of a literary scholar with subject expertise; and finally introduces a tool that lets non-specialist users rank a sample set, submit it to a data mining engine, view the results of the classification task, and visualize the interactions of associated metadata using scatterplots and other standard representations.
Text mining, or machine learning as it is also known, is a rapidly expanding field. Canonical applications are
classification and clustering (Weiss 2005, Widdows 2004,
Witten 2000). These applications are becoming common in industry, as well as defense and law enforcement. They are also increasingly used in the sciences and social sciences,
where researchers frequently have very large volumes
of data. The humanities, however, are still only just
beginning to explore the use of such tools. In the
context of the Nora Project, a multidisciplinary team is collaborating to develop an architecture for non-
specialists to employ text mining on some 5 GB of 18th
and 19th century British and American literature. Just as
importantly, however, we are actively working to discover what
unique potential these tools might have for the humanist.
While there are undoubtedly opportunities for all
of the normative text mining applications in large
humanities repositories and digital library collections, their
straightforward implementation is not our primary
objective with Nora. As Jerome McGann and others have argued, computational methods, in order to make
significant inroads into traditional humanities research, must concern themselves directly with matters of
interpretation (2001). Our guiding assumption, therefore, has been that our work should be provocational in spirit—rather than vocational, or merely utilitarian—and that the intervention and engagement of a human subject expert is not just a necessary concession to the limits of machine learning but instead an integral part of the interpretative
loop. In important respects we see this work as an
extension of insights about modeling (McCarty 2004),
deformation (McGann 2001), aesthetic provocation (Drucker 2004), and failure (Unsworth 1997). It also comports with some of the earliest applications of
data mining, such as when Don Swanson associated
magnesium deficiency with migraine headaches, an insight
provoked by patterns uncovered by data mining but only
subsequently confirmed through a great deal of more
traditional medical testing (Heast 1999).
We began with a corpus of about 200 XML-encoded
letters comprising correspondence between the poet Emily
Dickinson and Susan Huntington (Gilbert) Dickinson, her sister-in-law (married to her brother William Austin). Because debates about what counts as and constitutes the erotic in Dickinson have been primary to study of her work for the last half century, we chose to explore
patterns of erotic language in this collection. In a first step our domain expert classified by hand all the documents
into two categories “hot” and “not hot.” This was done in order to have a baseline for evaluation of the automatic classifications to be performed later.
We then developed an exploratory prototype tool to allow users to explore automatic classification of documents based on a training set of documents classified manually. The prototype allows users to read a letter and classify it as “hot” or “not-hot” (Fig 1). After manually classifying
a representative set of examples (e.g. 15 hot and 15
not-hot documents) this training set is submitted to the data mining classifier. For every other letter in the
corpus, users can then see the proposed classification,
review the document, and accept or change the proposed classification. The words identified by the data mining as possible indicators of erotic language are highlighted in the text of the document.
Importantly, this process can be performed in an iterative fashion as users improve the training set progressively and re-submit the automatic classification. Currently results are presented in the form of a scatterplot which allows users to see if there is any correlation between the classification and any other metadata attribute of the letters (e.g. date, location, presence of mutilation on the physical document, etc.) Users can see which documents
have been classified by hand (they are marked with
triangles) and which have been categorized automatically
(they appear as a circle). Letters that have been classified as not-hot always appear in black, and in color for hot, making it easy to rapidly spot the letters of interest.
A key aspect of our work has been to test the feasibility
of this fairly complex distributed process. The Web user
interface for manual and automatic classification is a Java
Web Start application developed at the University of
Maryland, based on the InfoVis Toolkit by Jean-Daniel
Fekete (2004). It can be launched from a normal Web
page and runs on the user’s computer. The automatic
classification is performed using a standard Bayesian
algorithm executed by a data mining tool called D2K,
hosted at the University of Illinois National Center for
Supercomputing Applications. A set of web services
perform the communication functions between the Java
Interface and D2K. The data mining is performed by
accessing a Tamarind data store provided by the
University of Georgia, which has preprocessed and
tokenized the original XML documents. The entire system
is now functional.
What of the results? The textual critic Harold Love
has observed of “undiscovered public knowledge”
(consciously employing the aforementioned Don
Swanson’s phrase) that too often knowledge, or its
elements, lies (all puns intended) like scattered pieces
of a puzzle but remains unknown because its logically
related parts are diffused, relationships and correlations
suppressed (1993). The word “mine” as a new indicator
identified by D2K is exemplary in this regard. Besides
possessiveness, “mine” connotes delving deep,
plumbing, penetrating--all things we associate with the
erotic at one point or another. So “mine” should have
already been identified as a “likely hot” word, but
has not been, oddly enough, in the extensive critical
literature on Dickinson’s desires. “Vinnie” (Dickinson’s
sister Lavinia) was also labeled by the data mining classifier
as one of the top five “hot” words. At first, this word
appeared to be a mistake, a choice based on proximity
to words that are actually erotic. Many of Dickinson’s
effusive expressions to Susan were penned in her early
years (written when a twenty-something) when her
letters were long, clearly prose, and full of the daily details
of life in the Dickinson household. While extensive
writing has been done on the blending of the erotic with
the domestic, of the familial with the erotic, and so forth,
the determination that “Vinnie” in and of itself was just as
erotic as words like “mine” or “write” was illuminating.
The result was a reminder of how or why some words are
considered erotic: by their relationship to other words.
While a scholar may un-self-consciously divide
epistolary subjects within the same letter, sometimes
within a sentence or two of one another, into completely
separate categories, the data mining classifier will
not. Remembering Dickinson’s “A pen has so many
inflections and a voice but one,” the data mining has made
us, in the words of our subject expert, “plumb much more
deeply into little four and five letter words, the function
of which I thought I was already sure, and has also
enabled me to expand and deepen some critical connections
I’ve been making for the last 20 years.”
Figure 1: Users can select a document in the collection
(here “Her breast is fit for pearls”) and read the document.
They can then classify it as hot (red) or not-hot (black),
which helps build the training set.
Figure 2: After requesting that the remaining documents be
automatically classified, purple color squares are placed
next to each document that had not been classified manually.
Bright colors mean that the data mining suggests that the
document might be “hot” and black means “not hot”. On
the most left pane, a list of words is provided, with the words
found to be more representative of the hot documents of the
training set listed at the top. Figure 3: This is an alternate ”scatterplot” view of the
collection. Each dot represent a document. Time (i.e. the
median of the estimated date range) is mapped on the X axis,
and the length of the document is mapped on the Y axis.
Color represents hotness. The same color coding is used.
We can see that the longer documents were written earlier
on. The display also suggests that there is no correlation
between time and hotness, and no particular time periods
where significantly more hot documents were written.
Zooming is possible to inspect particular clusters.
Drucker, J. and B. Nowviskie. (2004). “Speculative
Computing: Aesthetic Provocations in Humanities
Computing.” In S. Schreibman, R. Siemens, and
J. Unsworth (eds.), The Blackwell Companion to
Digital Humanities (pp. 431-447). Oxford: Blackwell
Fekete, J-D. (2004). “The Infovis Toolkit.” In Proceedings
of the 10th IEEE Symposium on Information
Visualization (pp. 167-174). Washington DC: IEEE
Hearst, M. (1999). “Untangling Text Data Mining.”
Love, H. (1993). Scribal Publication in Seventeenth-
Century England. Oxford: Clarendon Press.
McCarty, W. (2004). “Modeling: A Study in Words
and Meanings.” In S. Schreibman, R. Siemens, and
J. Unsworth (eds.), The Blackwell Companion to
Digital Humanities (pp. 254-270). Oxford: Blackwell
McGann, J. (2001). Radiant Textuality: Literature After
the World Wide Web. New York: Palgrave.
Plaisant, C., Rose, J., Yu, B., Auvil, L.,
Kirschenbaum, M.G, Smith, M.N, Clement, T.
and Lord, G. (2006). Exploring Erotics in Emily
Dickinson’s Correspondence with Text Mining and
Visual Interfaces, to appear in the Proceedings of the
Joint Conference on Digital Libraries (JCDL 06).
Unsworth, J. (1997). “The Importance of Failure.” The
Journal of Electronic Publishing 3.2. At < http://
Weiss, S., et al. (2005). Text Mining: Predictive Methods
for Analyzing Unstructured Information. New York:
Widdows, D. (2004). Geometry and Meaning. Stanford:
Witten, I. and E. Frank. (2000). Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations. San Diego: Academic Press.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)
July 5, 2006 - July 9, 2006
151 works by 245 authors indexed
The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.
Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/