Conjecture Generation in the Digital Humanities

paper
Authorship
  1. 1. Patrick Juola

    Duquesne University

  2. 2. Ashley Bernola

    Duquesne University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Digital scholarship has been very helpful in the development
of humanities research (Juola, 2008a),
primarily by automating processes such as communication,
text processing, and search and permitting scholars
to concentrate on analysis and explanation. However,
when computers attempt to generate meaning, to perform
analysis themselves, the results are usually less
than satisfactory and don’t actually explain much.
An example of analytic failure can be seen in the reception
of “nontraditional” (i.e. statistical, computer-aided)
authorship attribution. It is now unquestionable that
computers can infer authorship attributes with high accuracy
(see Juola, 2008b), but the accurate inference
processes tend not to inform us about the actual authors
(Craig, 1999). Argamon (2006) has provided a theoretical
analysis of one particular method, but in the unfamiliar
and ``inhuman’’ language of statistics, which again
sheds little light on authorial language and authorial
thought. By contrast, studies of gender differences in
language (e.g., Coates, 2004) offer not only lists of differences,
but explanations in terms of the social environment.
In fact, the interesting part of scholarship is not in
mere observation, but in the refinement and explanation.
This suggests a relatively novel model for computer/human
interaction in scholarship, one in which the computer
is used to identify patterns that are passed to human
experts for validation and explanation. While not widely
used, this model has been successfully applied in mathematics
by the Graffiti program (Fajtlowicz, 1986). This
program generates conjectures (randomly) of the form X
< Y (or similar forms such as X < Y + Z) where X, Y, and
Z are “graph invariants,” simple numeric properties of
graphs such as average number of edges per node, number
of colors necessary to color the graph like a map, average
distance between nodes, number of different paths
between nodes, and so forth. Graffiti then compares this
conjecture against a library of graphs to see if it is true
for all the graphs in the library. Of course, “true for all
the graphs in the library” is not proof of universal truth,
but it provides evidence in support. If the conjecture
is thus supported, Graffiti publishes the conjecture, and
professional mathematicians are invited to prove (or disprove)
it. Since inception, Graffiti has developed and
published more than 1000 conjectures and inspired more
than 100 papers.
This model can easily be applied to the humanities.
Application in the field of text analysis is straightforward;
we need analogues to “graph invariants.” Such
“text invariants” might include the frequency of specific
words, phrases, or structures in particular text types. A
simple conjecture might be that “color adjectives” are
more common than “size adjectives,” or that “verbs of
motion” are more common in male-written novels than
in female-written poetry. Of course, text invariants are
not limited to token frequency analysis; any “property”
that can be assessed via computer analysis would be a
possibility. If one can find a way to determine a text’s
eroticism, degree of animacy, personification, etc., any
of those would be potential features.
In addition to text invariants, we need a framework for
conjectures (in analogy to the X < Y + Z framework described
above). Simple comparisons are likely to find
uninteresting conjectures (prepositions are more common
than proper nouns). A more interesting conjecture
(in the author’s opinion) would be something like
“color adjectives are more common than size adjectives
in women’s writing, but the reverse holds in men’s writing.”
Such a finding would clearly show a relationship,
as yet unexplained, between adjective choice and gender.
If this were true—why? It obviously says something
about gender and culture, but what? Here is where
a traditional humanist could take advantage of the ability
of a computer to read and analyze a huge amount of data
very quickly. In general, conjectures of the form “X > Y
in texts of category A, but Y > X in texts of category B”
(where A and B are non-overlapping categories, ideally
pulled from the document’s metadata) are likely to be of
interest to category A/B specialists, especially if X and
Y are themselves interesting natural properties. Another
framework would be that “X is more common in A-texts
than B-texts,” and it is this framework that we use in our
prototype.
To illustrate this, we have built a simple version of this
conjecture generator (“conjecturator”) using standard
Java technology, much of it drawn from the JGAAP
project (Juola, 2007). The Moby Thesaurus II lists
more than 30,000 different synonym sets: for example,
the word group “raft” (as in “a raft of money”) includes
words/phrases such as “barge,” “boat,” “pile,” “pot,” and
“quite a little.” The word group “take back” includes terms like “abjure,” apologize,” “renege,” “disown,” and
“nullify.” We have also collected eight separate (English)
translations of the Bible ranging from the Authorized
(King James) Version to Revised Standard Version
and the Bible in Basic English. Our program selects one
synonym set and two Bible versions (at random), then
counts every appearance of each word token listed in the
synonym set. Our conjectures are therefore of the form:
“Words in <this category> appear at least 50%
more frequently in <this> Bible translation than in
<that> one.”
Our prototype strips punctuation and case differences,
but does not perform morphological analysis or even
word sense disambiguation. Despite this limitation,
simple word-counting reveals that the word group “take
back” appears approximately twice as often in the RSV
as it does in Young’s Literal Translation. We have similarly
found that the word group “rhythmical” occurs
substantially more often in the KJV than in Darby’s
Translation. We have at this writing no explanation, but
offer them (along with many other findings) to interested
Biblical scholars as a potentially unexplored facet of
the differences among different translations. (A list of
conjectures will be available both electronically and at
the conference—about 40% of conjectures appear to be
valid, a percentage we find surprising.)
Testing these conjectures has been relatively easy (if
time-consuming); We simply attach the program to a
large database (in this case, of Bibles) and allow it to
sample from the database until it has either confirmed or
rejected the hypothesis to its satisfaction. (For example,
there does not appear to be a significant difference in
the word group “unauthorized” between the RSV and the
American Standard Version.) We can extend such a program
even to help solve the “how do you read a million
books” problem, since the program could not only do
the bulk of the initial reading to see if the hypothesis is
true in the first place, but would automatically generate
a reading list for scholars interested in following up on
the conjecture. (At a second per book, a computer could
analyze all million texts and deliver a list of how each
work fared vs. the conjecture in less than two weeks.
By contrast, a human closely reading one book per day
would require 3000 years to read a million books.) Even
our prototype system can examine many more categories
and hypotheses than even the most avid and interested
human reader—in its first 24 hours alone, it found more
than 200 possibly interesting differences between Bible
versions.
As a further extension, we have extended the Conjecturator
to include multiple documents and a more robust
form of statistical analysis. Using a collection of more
than 100 Victorian novels (courtesy of David Hoover,
NYU), we now observe mean word usage within a group
such as bildungsromanen or gothic novels, compute variance
and t-statistics and accept a conjecture if the computed
p-value is sufficiently low (in either direction).
Results of this further experiment will also be presented.
Although our prototype is limited to text analysis, the
possibility of automatic conjecture generation may extend
further. A large and rich database of GIS and/or
census information may be able to support, for example,
conjectures of the form “<Object A> is more common
in <Environment X> than <Environment Y>.”
An example of such a conjecture would be a relationship
previously unimagined between the number of veterinarians
and Methodist churches in coastal counties.
What are the benefits of such a program? This conjecture
generator can deliver a set of (partially) validated
observations about easily observable, superficial properties
of the texts in the library (or points in the database
more generally defined). By construction, all published
conjectures are more or less guaranteed to describe
something true, at least about the library. These partial
truths, to humanists interested in the study of the library,
may represent insights that they have not considered
and a the particular hypothesis under study. Indeed, the
scholars may lack the time to familiarize themselves
with every volume in the library, and may not even be
“digital” enough to understand the computer analysis,
but who may be interested enough to see out the new
material that they now know is there.
At the simplest possible level, 1000 validated conjectures
are 1000 topics for student projects, research papers,
or Ph.D. theses, a partial solution to the “I need to
do a term paper but don’t know what I want to do it on”
question that plagues all supervisors.
More generally, however, this program would also allow
humanists to concentrate their efforts on what is generally
the most interesting and rewarding part of humanities
research; the search for an explanatory theory of human
behavior. By giving scholars a list of statements
that are probably true, they can concentrate their efforts
on producing statements and theories that are genuinely
meaningful.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None