Corpus Analysis and Literary History

paper
Authorship
  1. 1. Matthew Wilkens

    Rice University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The problem of periodization has long occupied literary
studies. Our ability to distinguish the cultural
and aesthetic production of one era from that of another
is a basic assumption of our historicizing critical method.
Moreover, there exists a broad consensus concerning
the general arc of literary history and its major moments.
As a practical matter, however, we often find such periods
to be both less mutually distinct and less internally
uniform than we have been lead to believe. When, for
instance, does modernism begin and end? And what, exactly,
do Proust, Joyce, Woolf, and Hemingway have in
common, to say nothing of mass-market detective and
shopgirl fiction from the same era?
Such uncertainty is neither insurmountable nor even especially
problematic, but it does emphasize the centrality
and the limitations of both close reading and theorization
as the working methods of literary study. Because we
can read only a finite (and quite small) number of texts,
the specific texts that we do manage to read will have a
disproportionate influence on our understanding of the
larger field of cultural production we understand them
to represent. Theorization is then frequently dedicated
to working back out or up from this restricted corpus
of widely read material to the larger set social and economic
arrangements that must have been in place so that
work of just such a type could have been produced.
There’s nothing wrong with this approach, but it would
be useful to have relative measures of the extent to which
the periods and associated genres we now commonly
identify in literary history apply to the full field of literary
production over the last several centuries. That is,
we would like to be able to answer questions about the
coherence of literary periods and genres, their internal
variation, their distinction from one another, the sharpness
of the breaks between them, and the mechanics of
the transitions between them, and we would like to be
able to do all this with reference not just to a restricted
corpus or canon, but to the broadest possible survey of
texts. We would also like to know whether or not there
exist discernibly coherent periods, genres, or geographic
regions that we have not yet identified, or if our current
historical/geographic/generic boundaries are the best
possible ones.
The work presented in this paper describes the early
steps and initial results of a long-term project designed
to address these issues. Building on insights and tools
pioneered in corpus linguistics (see especially the work
of Mark Davies at BYU), but aiming squarely at questions
of literary history and theory, this work begins with
the construction of a significant corpus of public-domain
literary texts spanning the sixteenth to twentieth centuries
from the Gutenberg archive. Gutenberg, of course,
is not a typical scholarly resource, but it has the advantage
of being large and unencumbered by intellectual
property restrictions; one of the intermediate products of
the work presented here is an evaluation of its relative
merits and suitability in comparison to the smaller but
fully curated Wright American fiction and Chadwyck-
Healy nineteenth-century fiction collections from the
MONK Project. The paper describes the techniques used
to construct and to characterize this corpus, as well as
the quantitative results of this analysis. It is thus the first
full, computationally assisted description and evaluation
of the Gutenberg English-language fiction holdings as a
resource for digital literary studies.
The paper also advances a set of conclusions based on
these results and in dialogue with existing theoretical
work on literary-cultural periodization. There is reason
to believe that the comparatively brief periods of rapid
change between more stable literary eras should be
marked by increased incidence of allegorical and tropological
language use (this fact of course superimposed
on long-term baseline changes in, for example, metaphor
generally). Although it is as yet difficult to measure such
features directly (but see interestingly related attempts
such as Pasanek and Sculley, “Mining millions of metaphors,”
LLC 23 [2008]: 345-60; Bei Yu’s 2006 dissertation
on literary text mining; and Matthew Jockers and
Franco Moretti’s as-yet unpublished work on automatic
genre classification), the present analysis suggests that
texts drawn from such transitional periods are relatively
poor in adjectives and adverbs, a fact that may correspond
to their increased reliance on tropes (the expressive
ability of which is diminished by increased specificity).
By examining the historical variations of such
features, we may begin to reshape our understanding of
periodization and its mechanisms.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None