The Good, the Bad and the Ugly: Corralling an Okay Text Corpus from a Whole Heap o' Sources

Glen Worthey

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

An increasingly important variety of digital humanities activity
has recently arisen among Stanford’s computing humanists, an
approach to textual analysis that Franco Moretti has called
“distant reading” and Matthew Jockers has begun to formally
develop under the name of macro-analysis. At its most simple,
the approach involves making literary-historical arguments
based on statistical evidence from “comprehensive” text
corpora. One of the peculiarities of this activity is evident from
the very beginning of any foray into it: it nearly always requires
the creation of “comprehensive” literary corpora. This paper
Digital Humanities 2008 _____________________________________________________________________________
_____________________________________________________________________________
14
discusses the roles and challenges of the digital librarian in
supporting this type of research with specifi c emphasis on the
curation (creation, licensing, gathering, and manipulation) of a
statistically signifi cant corpus.
Because the arguments at stake are largely statistical, both the
size of the “population” of texts under examination, and any
inherent bias in the composition of the corpus, are potentially
signifi cant factors. Although the number of available electronic
texts from which to draw a research corpus is increasing
rapidly, thanks both to mass digitization efforts and to
expanding commercial offerings of digital texts, the variability
of these texts is likewise increasing: no longer can we count
on working only with well-behaved (or even well-formed)
TEI texts of canonical literary works; no longer can we count
on the relatively error-free re-keyed, proofread, marked-up
editions for which the humanities computing community has
spent so much effort creating standards and textual corpora
that abide by these standards.
In addition to full texts with XML (or SGML) markup that
come to us from the model editions and exemplary projects
that we have come to love (such as the Brown Women Writers
Project), and the high-quality licensed resources of similar
characteristics (such as Chadwyck Healey collections), we are
now additionally both blessed and cursed with the fruits of
mass digitization (both by our own hands and by commercial
and consortial efforts such as the Google Library Project and
Open Content Alliance), which come, for the most part, as
uncorrected (or very lightly corrected) OCR. Schreibman,
et al., 2008, have described University of Maryland efforts to
combine disparate sources into a single archive. And a similar
effort to incorporate “messy data” (from Usenet texts) into
usable corpora has been described in Hoffman, 2007; there is
also, of course, a rich literature in general corpus design. This
paper will discuss the application of lessons from these other
corpus-building projects to our own inclusion of an even more
variant set of sources into a corpus suitable for this particular
fl avor of literary study.
This digital librarian’s particular sympathies, like perhaps those
of many or most who participate in ADHO meetings, lie clearly
with what we might now call “good,” “old-fashioned,” TEIbased, full-text digital collections. We love the perfectionism of
“true” full text, the intensity and thoughtfulness and possibility
provided by rich markup, the care and reverence for the text
that have always been hallmarks of scholarly editing, and that
were embraced by the TEI community even for less formal
digital editions.
Now, though, in the days of “more” and “faster” digital text
production, we are forced to admit that most of these lessthan-perfect full-text collections and products offer signifi cant
advantages of scale and scope. While one certainly hesitates to
claim that quantity actually trumps quality, it may be true that
quantity can be transformed into quality of a sort different
from what we’re used to striving for. At the same time, the
“macro-analysis” and “distant reading” approaches demand
corpora on scales that we in the “intensely curated” digital
library had not really anticipated before. As in economics, it is
perhaps diffi cult to say whether increased supply has infl uenced
demand or vice versa, but regardless, we are certainly observing
something of a confl uence between mass approaches to the
study of literary history and mass digitization of its textual
evidence.
This paper examines, in a few case studies, the gathering
and manipulation of existing digital library sources, and the
creation of new digital texts to fi ll in gaps, with the aim of
creating large corpora for several “distant reading” projects at
Stanford. I will discuss the assessment of project requirements;
the identifi cation of potential sources; licensing issues; sharing
of resources with other institutions; and the more technical
issues around determining and obtaining “good-enough” text
accuracy; “rich-enough” markup creation, transformation and
normalization; and provision of access to the corpora in ways
that don’t threaten more everyday access to our digital library
resources. Finally, I’ll discuss some of the issues involved in
presenting, archiving, and preserving the resources we’ve
created for “special” projects so that they both fi t appropriately
into our “general” digital library, and will be useful for future
research.
References
Sebastian Hoffmann: Processing Internet-derived Text—
Creating a Corpus of Usenet Messages. Literary and
Linguistic Computing Advance Access published on June
1, 2007, DOI 10.1093/llc/fqm002. Literary and Linguistic
Computing 22: 151-165.
Susan Schreibman, Jennifer O’Brien Roper, and Gretchen
Gueguen: Cross-collection Searching: A Pandora’s Box or
the Holy Grail? Literary and Linguistic Computing Advance
Access published on April 1, 2008, DOI 10.1093/llc/fqm039.
Literary and Linguistic Computing 23: 13-25.

Full text license: This text is republished here with permission from the original rights holder.

The Good, the Bad and the Ugly: Corralling an Okay Text Corpus from a Whole Heap o' Sources

1. Glen Worthey

ADHO - 2008