Voice Mining: A Promising New Application of Data Mining Techniques in the Humanities Domain

poster / demo / art installation
  1. 1. J. Stephen Downie

    University of Illinois, Urbana-Champaign

  2. 2. M. Cameron Jones

    University of Illinois, Urbana-Champaign

  3. 3. Xiao Hu

    University of Illinois, Urbana-Champaign

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

There is a growing interest by digital humanities researchers to examine the cultural computation utility of those data mining (DM) techniques that have been traditionally used by the scientific community. The scientific community has for a long time now used such techniques as Naïve Bayes, Support Vector Machines,
Neural Networks and Decision Trees, etc. to build
weather prediction systems, classification structures, risk management models and so on. Over the years, the
scientific community has developed several DM
experimenting environments that have reached such a level of maturity that one no longer needs to be
especially an expert in the use of these systems. One such moderately easy-to-use DM toolkit is Weka (http://www.
cs.waikato.ac.nz/ml/weka/) from the University of
Waikato in Hamilton, NZ. Another is the Data-to-Knowledge
(D2K)/Text-to-Knowledge (T2K) DM toolkit developed
by the Automated Learning Group (ALG) at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign (UIUC)
(Downie et al 2005) (http://alg.ncsa.uiuc.edu/do/
downloads/d2k). The relative maturity of DM environments
like these has provided opportunities for humanities
researchers to focus on the generation of new types of
research questions that could be addressed by DM methods
by freeing them from the previously arduous task of writing their own complicated DM algorithms. Witness the NORA project (http://www.nora.lis.uiuc.edu) whose project description states:
“In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful)
answers to those queries; by contrast, the goal of
data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends.” [Emphasis is ours.]
Inspired by the NORA project’s spirit of “…exposing unanticipated similarities or differences…and trends”, this poster/demonstration is intended to illustrate some novel types of research questions that can be addressed through the intra-textual application of sophisticated DM techniques. More specifically, we are currently exploring the application of a variety of DM procedures to the text associated with utterances made by characters in plays. That is, we are deeming the utterances made by each
character (i.e., the words they speak along with their
frequency of use) to be the set of attributes that “define” that character.
We have coined the phrase “voice mining” as the rubric for this new line of computer-assisted textual inquiry. We chose “voice” to highlight that we are interested in the characters as “individuals” who, through the words they “speak”, create their own individual identities and personas. We are thus adopting an anthropomorphic reading of the play text which constructs it as being analogous to the transcription of real human-to-human interactions. This
anthropomorphic reading stance could be considered as a kind of extension to those traditional stylometric
analyses wherein the collection of words (along with their
relative frequencies) written by a given author makes
up the set of attributes for describing that author.
“Mining” was chosen to denote that DM techniques are the technological mode of exploration.
Character Identification: Is it possible to construct a DM generated model that can successfully identify the character which uttered a given line of play text? If yes, what are the characteristics that contribute to the construction of each character’s unique voice? Even if not entirely successful, are there interesting patterns of confusion (i.e., are two or more “voices” consistently
misattributed)? Do confusions suggest a kind of affinity or clustering among sub-groups of characters (perhaps providing otherwise non-obvious indications of such things as class or gender identity)?
Gender Identification: Is it possible to construct a DM generated model that can successfully identify the gender of the character which uttered a given line of play text? If so, what are the characteristics that make up the gender identities of the characters? If not, are there consistent and interesting patterns of confusion? Do the confusions
suggest a possible “deliberate” subversion of gender
Class/Status Identification: Is it possible to construct a DM generated model that can successfully identify the socio-economic class or status of the character which
uttered a given line of play text? If so, what are the
characteristics that make up the socio-economic status or
class of the characters? If not, are there consistent and interesting patterns of confusion? Do the confusions
suggest a possible “deliberate” subversion of class or
status roles?
As one can see, it is possible to come up with many other
similarly framed research questions under the voice
mining paradigm. The repetitive nature of the sample questions above is deliberate and is designed to emphasize that our voice mining work is not being presented as technological magic box that provides the researcher with definitive and positive proofs. Rather, we want to stress that “voice mining” is an exploratory procedure
and it is very important to use ones own intrinsic
analytical abilities. As an exploratory procedure, prima facie “failure” to successfully model a given question should be seen as an invitation to explore the potential
explanations for the “failure” because the “failed”
model could, in fact, be based upon hitherto unnoticed
groupings that indicate important stylistic or subversive intentions of the creator of the play’s characters.
Because there is no a priori way to predict which
particular DM techniques are the best at answering
the kinds of questions that can be posited under the
voice mining paradigm, we have limited our initial proof-
of-concept work to three play texts drawn from the Project
Gutenberg collection (http://www.gutenberg.org/):
Oscar Wilde: “The Importance of Being Earnest” (http://www.gutenberg.org/dirs/etext97/tiobe10.txt) Bernard Shaw: “Pygmalion” (http://www.gutenberg.org/dirs/etext03/pygml10.txt)
Bernard Shaw: “Arms and the Man” (http://www.
These plays were chosen for our initial exploratory
studies for they have:
a) relatively limited character sets;
b) relatively balanced representations of female and male characters; and,
c) a mix of characters by class and status.
Our poster/demonstration uses these three play texts as illustrative case studies in the application of voice
mining techniques. Using examples run through both the
Weka and the D2K/T2K toolkits, our poster/demonstration walks the audience step-by-step through the procedures necessary to conduct a voice mining exploration. These procedures include:
a) preprocessing of raw play text using PERL scripts to strip out extraneous text (e.g., stage directions, etc.);
b) conversion of clean text into the form needed by the DM tools (e.g., ARFF, CVS);
c) use and effects of possible attribute selection
d) selection and running of one (or more) DM
algorithms; and,
interpretation of results output for signs of successful modeling and for indications of meaningful confusions.
(See Figure 1 for an example output set from our
“Important of Being Earnest” voice mining exploration on gender identity). Figure 1. Results from a Naïve Bayes voice mining experiment constructed to explore gender identity in Wilde’s “Importance of Being Earnest” using the Weka DM toolkit.
Our initial voice mining experiments show great promise as Figure 1 demonstrates with its strong positive results on the gender identification task. We hope
this success will prod other humanities scholars to pose new and interesting questions that voice mining could help them explore. While this poster/demonstration of voice mining techniques is intended to be an illustrative proof-of-concept, we have begun work on selecting more challenging texts from which to extract character voices.
We initially chose play texts as our proof-of-concept
medium because each character is clearly associated with
its own utterances by the convention of script writing. Book texts, for example, pose significant pre-processing
challenges in that there are fewer reliable clues that
consistently provide labels for “who is saying what” upon which to build DM input.
Downie, J. S., Unsworth, J., Yu, B., Tcheng, D.,
Rockwell, G., and Ramsay, S. J. (2005). A
Revolutionary Approach to Humanities Computing?:
Tools Development and the D2K Data-Mining
Framework. Proceedings of the 17th Joined
International Conference of ACH/ALLC.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

  • Keywords: None
  • Language: English
  • Topics: None