Mining Shakespeare

Stephen Ramsay

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

"[T]he computer," writes Susan Hockey in her 2000 book Electronic Texts in the Humanities, "is best at finding features or patterns within a literary work and counting occurrences of those features" (Hockey 66). For many areas of inquiry, such finding and counting is eminently useful. Word-frequency analysis in the context of computational linguistics, concordance generation as a prelude to the study of word usage, and the various search functions that have now become an ordinary part of the task of research in so many disciplines represent clear examples of the utility of computational tools. For scholars engaged in the task of literary critical interpretation, however, such finding and counting can seem beside the point. As Hugh Craig put it:

"The leap from frequencies to meanings must always be a risky one — Lower-level features are easy to count but impossible to interpret in terms of style; counting images or other high-level structures brings problems of excessive intervention at the categorization stage, and thus unreliability and circularity." (Craig 103)

The risk of which Craig speaks is not merely a matter of interpretive caution. In most cases, low-level features simply don't assert themselves in any obvious way into the broad, complex patterns upon which literary critical interpretation depends.

Data mining provides a suggestive set of methods for bridging the gulf between low and high. It has its roots in a number of statistical techniques with venerable histories in digital humanities (in particular, the use of factor analysis in the study of literary and philosophical texts)
[Note 1: cf. J. F. Burrows and D. H. Craig's studies of Romantic and Renaissance tragedy and John Bradley and Geoffrey Rockwell's use of cluster analysis for the study of Hume's Dialogues, op cit.]
, but introduces an exploratory dimension far more conformable to the elaborate task of prompting meaningful critical insight. Data mining techniques operate on low-level features, but use a variety of statistical and logic-programming methods to discern broad complex patterns in the data set (such as classifications, categorizations, and prediction models) that are not conceived in advance. In other words, data mining lets us ask what's interesting about an apparently disparate set of low-level features without having to form any concrete expectations in advance.

This paper presents research on the structure of Shakespearean drama and its relation to genre categorization using two programs: StageGraph and D2K. StageGraph generates directed graph visualizations of scene changes and character movements from XML representations of plays (figure 1). It can also use the generated graphs to produce matrices of individual graph properties (e.g. degree number, number of cycles, diameter, chromatic number). While such graphs and matrices provide fertile ground for interpretive reflection, the utility of the graphs is greatly enhanced when the generated properties are themselves mined for broad patterns and features, and the results presented in the form of an open-ended visualization.

[Figure 1]

The author and a research assistant at the National Center for Computing Application used the D2K software to conduct naive Bayesian analysis and decision tree generation (two standard data mining techniques) on the graph property matrices to try to see if the low-level structural features of Shakespeare's plays assemble themselves into clusters that correspond to the traditional genre categories of comedy, tragedy, history, and romance. We then used a technique pioneered by one of the project participants that combines concept-tree clustering with shaded similarity matrices to generate a visualization of the degrees of similarity among Shakespeare's plays in terms of genre (figure 2). The results are extremely suggestive, in that the visualization not only groups the plays broadly into traditional generic categories, but also contains anomalies that correspond to some of the more influential insights into Shakespearean genre (for example, Susan Snyder's argument that Othello possesses the basic structure of comedy appears to be confirmed by the data mining algorithm).

[Figure 2]

Most interesting of all, however, are the anomalies that represent neither traditional classification nor the product of exegetical insight. For it is here, I believe, that data mining and visualization present the most promise for text analysis practitioners in literary study. Though it may be used to support or refute hypotheses, data mining is far more useful in the service of the broad humanistic mandate to find new and insightful ways of looking at textual artifacts.

Bibliography

Bradley, John
Rockwell, Geoffrey
Watching Scepticism: Computer Assisted Visualization and Hume's Dialogues
Research in Humanities Computing
5
32-47
1996

Burrows, J.F.
Craig, D.H.
Lyrical Drama and the 'Turbid Montebanks': Styles of Dialogue in Romantic and Renaissance Tragedy
Computers and the Humanities
28
63-86
1984

Craig, Hugh
Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything About Them?
Literary and Linguistic Computing
14
103-13
1999

Hockey, Susan
Electronic Texts in the Humanities
Oxford University Press
Oxford
2000

Synder, Susan
The Comic Matrix of Shakespeare's Tragedies: Romeo and Juliet, Hamlet, Othello, and King Lear
Princeton University Press
Princeton
1979

Wang, Jun
Yu, Bei
Gasser, Les
Concept Tree Based Clustering Visualization with Shaded Similarity Matrices
Proceedings of the 2002 IEEE International Conference on Data Mining (2002)
697-700
2002

Full text license: This text is republished here with permission from the original rights holder.

Mining Shakespeare

1. Stephen Ramsay

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2005