Understanding the Linguistic Construction of Gender in Shakespeare via Text Mining

Authorship
  1. 1. Sobhan Raj Hota

    Linguistic Cognition Lab - Illinois Institute of Technology

  2. 2. Shlomo Argamon

    Linguistic Cognition Lab - Illinois Institute of Technology

  3. 3. Rebecca Chung

    Lewis Department of Humanities - Illinois Institute of Technology

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
Can computational analysis better reveal how
Shakespeare's words and phrases construct characters
clearly gendered as male and female? What happens when
stylistic analysis is brought to bear on a longstanding notion in
literary and cultural studies that gender identity is a discursive
(i.e. a culturally-decodable lexical and semantic) construction?
How helpful can linguistic style be in explaining aspects of
literary style?
Last year we presented our first results (Hota et al. 2006) in
analyzing lexical items of character gender in Shakespeare. Our
observations on gender character were in line with previous
work (Argamon et al. 2003) on discriminating author gender
in modern texts, supporting the idea that Shakespeare projects
character gender in a manner consistent with patterns of
authorial gender projection found in other texts both literary
and nonliterary.
In this abstract, we extend and refine our methods by focusing
on lexical and semantic language use by Shakespeare for
determining the gender of his literary characters. Here we used
a better version of Shakespeare corpus, the Nameless
Shakespeare. In the Nameless Shakespeare, each lexical item
is fully tagged with lemma entries. Our accuracy has improved
to 81% using lemma features, compared to last year’s results.
We observe that lemmas and tri-grams help identify a
Shakespearean character’s gender. In addition, discrimination
models using lemmas and tri-grams may allow literary and
cultural scholars to see discursive patterns that, while impossible character’s attitudes as male or female
The fact that these patterns hold across literary and nonliterary
texts, and from early-modern to modern English, supports their
possible significance in understanding discursively-formed
gender identity. We further observe other distinguishing
features, including the fact that some feature constellations
match well to previous reports of features that distinguish male
from female authors (Argamon et al. 2003). We have analyzed
the concordance lines of lexical and lemma tri gram occurrences
from the corpus and found patterns of phrasal usage that indicate
significant gender differences in language use in the plays.
We are interested in understanding gender characterization
based on major characters and minor characters, and in
understanding how prose and verse forms impact gendered
speech in Shakespeare. We are also interested in how words
highly-associated with a particular gender may also help with
plotting, dramatic tension, and closure. We will present our
findings at the conference.
2. Methodology
We applied text classification methods using machine
learning with feature sets described above, under the
umbrella of a well-tagged corpus. If reasonable classification
accuracy is achieved with these new sets of features, it will
show that Shakespeare used words differently for his male and
for his female characters. If this is the case, then examining the
most discriminating features should give some insight into how
gender stylistics flows into socio-linguistic and literary gender
construction.
2.1 Corpus Construction
We constructed a corpus of characters' speeches from 35
Shakespearean plays, collected from the
lexically-and-lemmatically-tagged Nameless Shakespeare. The
plays are in XML (Extensible Markup Language) format. To
import them into our system, we extracted the speeches and
the gender of each character automatically, cleaning the stage
directions. A text file for each character in each play was
constructed by concatenating all of that character’s speeches
in the play. We only considered characters with 200 or more
words. From that collection, all female characters were chosen.
Then we took the same number of male characters as female
characters from a play, restricted to those not longer than the
longest female character from that particular play. In this way,
we balanced the corpus for gender, giving a total of 101 female
characters and 101 male characters, with equal numbers of
males and females from each play (see Table 1). We balance
the corpus to avoid bias in the automated learning procedure;
this introduces other issues which we address below.
2.2 Feature Extraction
We processed the text using the ATMan system, a text
processing system in Java that we have developed. The text is
tokenized and the system produces a sequence of tokens. Each
token corresponds to a word in the input text file. We used
lexical and lemma features with n = 1, 2 and 3 gram
combinations. In order to understand gendered language more
deeply, we extracted those n-grams most linked to gender
(Tables 2, 3). We collected most frequent 500 words, bigrams
(2670) and trigrams (356) from the lexical entries. In the same
way 2001 unigrams, 2860 bigrams, and 571 trigrams from
lemmas were collected. We calculated the frequencies of these
various features and computed their relative frequencies. The
list of various feature sets with their counts is given Tables 3-5. 2.3 Text Classification
The classification learning phase of this task is carried out by
Weka's (Frank & Witten 1999) implementation of Sequential
Minimal Optimization (Platt 1998) (SMO) using a linear kernel
and default parameters. The output of SMO is a model linearly
weighting the various features. Testing was done via 10 fold
cross validation. With this methodology, we ensure that each
character is tested on at least once with training that does not
include it. Table 4 presents the results obtained by running
various experiments.
3 Results: Accuracy and Feature
Analysis
Many feature combinations give classification accuracies
near or above 70%, which is quite good (random would
be 50%, since the corpus is balanced). The highest accuracy of
all (80.69%) was attained using the 500 most frequent word
lemmas as features. Lexical (surface tokens) also worked well,
with unigrams plus bi grams plus tri grams combination as a
whole giving the highest accuracy (75.74%). The accuracy is
captured in Table 4.
The feature analysis phase is carried out by taking the results
obtained from Weka’s implementation of SMO. SMO provides
weights to the features corresponding to both class labels. To
discriminate binary class labels, SMO uses positive and negative
weight values in a linear model. After sorting the features based
on their weights, we collected the top ten features indicative of
each gender. We have also computed the average value of each
feature for each gender (Tables 5-10). For reasons of space we
consider here just those feature sets giving the most insight (as
well as good classification accuracy). We also ranked features
using information gain (IG), defined as the expected reduction
in entropy caused by partitioning the training set according to
the attribute.
Lemma Unigrams:
In Shakespeare, several meaningful clusters of words emerge.
Female lemmas indicate family relationships (‘husband’,
‘mother’, ‘court’) , feelings (‘sick’, ‘merry’), emotional
injections (‘alas’, ‘o’, ‘prithee’), and integration of personal
context (‘he’, ‘you’). Male features indicate concern with
quantification (‘three’) and social status (‘noble’, ‘solemn’,
‘savage’). Male lemmas also include some less-clearly
interpretable verb forms (‘begin’, ‘alight’, ‘beat’).
Lemma Trigrams: More specific meaning patterns can be seen in lemma triples.
Female trigrams mostly indicate construal of self and others
(‘I/see/you’, ‘for/I/to’, ‘I/know/I’, ‘be/he/not’, ‘say/I/be’),
politeness (‘thank/you/for’), conditionals (‘if/he/have’), and
questions (‘who/be/that’). Male trigrams focus on assertions
(‘I/say/to’) mainly about personal/social status (‘but/I/be’,
‘be/a/very’, ‘be/a/ass’, ‘I/be/ he’), possessions (‘have/no/more’,
‘I/have/lose’), and manner (‘the/manner/of’).
Analysis of Concordance Lines:
For both lexical and lemmatic trigrams, we contextualize usage
by examining concordance lines. For males, 'the name of'’ is
followed usually by 'truth', 'hero', 'love', 'justice', 'whore', while
for females, this trigram is followed by 'wife', 'jesting' and
sometimes with the name of a female character. Men use this
to invoke overarching abstractions (or to insult women), while
women talk about “names of” in a less metaphorical and more
neutral fashion. For 'do you know', males tend to follow with
another question, but not females. Strikingly, males use 'the
manner of,' to set up dramatic contrasts or even shifts in
dramatic action, but female use of this trigram is entirely
unremarkable in this way. 4 Discussion
These findings capture word patterning in Shakespeare
inaccessible to non-computational methods of literary
analysis, because of the scale of data processing involved.
Literary scholars work almost exclusively with well-elaborated
methods of semantic analysis (New Criticism, structuralism,
and post-structuralism), developed with all the strengths and
limitations posed by a book-only, eye-centered,
subjectivity-dependent research context. In contrast, these
findings encourage comparisons between non-computational
and computational approaches. It is remarkable that these
findings support aspects of non-computational methodology
(words linked together in meaningful patterns like informational
discourse/male and involved discourse/female), while also
bringing to light new structural features of Shakespeare’s
discursive gender construction through language: parts of speech
use, tri-gram combinations of words. These findings may, in
addition, capture creative and literary patterning in greater detail
than is possible with noncomputational literary methods alone.
Since Shakespeare’s plays depend greatly on gender-identified
characters, the words linked to gender most likely also serve
literary purposes. In the plays, heterosexual romance is linked
not only to characterization, but also to action, and in the case
of romantic comedies, to how the plays conclude. In the cases
of complex characterization, a character’s misuse of gendered
speech may also be central to how Shakepeare develops
dramatic action. Limitations:
The editorial procedures for The Nameless Shakespeare are
sound and practical for the project’s purposes and for the work
of this paper, but they need to be read and understood fully:
both by literary scholars wanting to apply these findings to
particular words in particular plays, and by computational
scholars thinking through the problem of establishing textual
accuracy prior to inviting a wider community to conduct
searches. Also, with respect to Shakespeare’s literary art, the
findings here do not at this stage account for the impact of blank
verse dialogue (for high or elite characters) versus prose
dialogue (for low or common-born characters) on word choices
and the numbers of words. It may be that blank verse fosters
semantically significant tri-gram constructions because
Shakespeare needed short words to complete plays primarily
written in ten-syllable lines. But at least the question can be
asked, and the answer will tell us something about Shakespeare
both as dramatist and as poet. In addition, our work focuses on
the heternormativity clearly present in Shakespeare’s plays, but
does not exclude nonheteronormative gender construction.
Finally and significantly, the gender-balancing used here had
the odd result of excluding from the corpus all major male
characters in Shakespeare, including every male character
named in a play’s title, because all these characters speak more
than 600 lines. All major female characters (including females
named in titles) are included. We now know that very-long
speech length per play efficiently identify characters as male,
but we also will test our findings on the males excluded so far
and report the results at the conference.
5 Conclusions
This is the first work, to our knowledge, in analyzing
various textual features (lexical and lemma) collected
from a single source in understanding literary character gender.
We see, as in our earlier work (Hota et al. 2006) that the male
and female language in Shakespeare’s characters is similar to
that found in modern texts by male and female authors
(Argamon et.al 2003). Here we also observed the importance
of trigrams for lexical and lemma features. Trigrams are few
in number, so they are information rich and computationally
efficient for identifying gender. The true import of the features
identified by this analysis need to be confirmed by more
traditional digital humanities methods such as examining
concordance lines, to allow a more properly contextual
interpretation. In any case, we believe that this study shows
how classification learning can be used as a tool in developing
new ‘statistical’ interpretative methodologies for bodies of
literary works.
Acknowledgements:
Many thanks to Dr. Martin Mueller for providing us the
Nameless Shakespeare corpus and many helpful
comments in gender characterization in Shakespeare.
Bibliography
Argamon, Shlomo, Moshe Koppel, and Galit Averni. "Routing
Documents According to Style." Proceedings of the First
International Workshop on Innovative Internet Information
Systems (IIIS-98) . 1998.
Argamon, Shlomo, Moshe Koppel, Jonathan Fine, and Anat
Rachel Shimoni. "Gender, Genre, and Writing Style in Formal
Written Texts." Text 23.3 (2003): 321–346.
Corney, Malcolm, Olivier de Vel, Alison Anderson, and George
Mohay. "Gender Preferential Text Mining of E-mail Discourse."
Proceedings of 18th Annual Computer Security Applications
Conference ACSAC . 2002.
Hota, Sobhan, Shlomo Argamon, Moshe Koppel, and Iris
Zigdon. "Performing Gender: Automatic Stylistic Analysis of
Shakespeare's Characters." Digital Humanities 2006 Conference
Abstracts. Paris: CATI, Université Paris-Sorbonne, 2006.
100--106.
Joachims, Thorsten. " Text Categorization with Support Vector
Machines: Learning with Many Relevant Features." ECML-98,
Tenth European Conference on Machine Learning. 1998.
Koppel, Moshe, Shlomo Argamon, and Anat Rachel Shimoni.
"Automatically Categorizing Written Texts by Author Gender."
Literarcy & Linguistic Computing 17.4 (2002): 401-12.
Mueller, Martin. "The Nameless Shakespeare." TEXT
Technology 14.1 (2005): 61-70. <http://texttechnolo
gy.mcmaster.ca/pdf/vol14_1_06.pdf>
Platt, J. Sequential Minimal Optimization: A Fast Algorithm
for Training Support Vector Machines. Microsoft Research
Technical Report MSR-TR-98-14. 1998.
Witten, Ian, and Eibe Frank. Weka3: Data Mining Software in
Java. <http://www.cs.waikato.ac.nz/ml/weka/>

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None