Text analysis of large corpora using High Throughput Computing

  1. 1. Mark Hedges

    King's College London

  2. 2. Tobias Blanke

    King's College London

  3. 3. Gerhard Brey

    King's College London

  4. 4. Richard Palmer

    King's College London

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The recent initiative of the US NEH on supercomputing
[1] is just one sign that there is a growing interest
in the use of highly parallel processing in the humanities.
This comes as no surprise if one considers that all over
the world, governments and funding bodies are investing
heavily in digitization of cultural heritage and humanities
research resources. At the same time, the 21st century
sciences are demanding an infrastructure to support
their advanced computational needs; the computational
infrastructure to distribute and process the results from
the Large Hadron Collider is just one example. Humanities
researchers have therefore begun to investigate these
infrastructures to find out whether they can be used to
help analyse the extensive, newly available online resources.
We offer an example of such an infrastructure based on
High Throughput Computing (HTC). HTC differs from
High Performance Computing (HPC), in that the latter
relies on hardware specifically designed with performance
in mind, whereas the former typically uses multiple
instances of more standard computers to accomplish
a single computational task. The de facto standard HTC
implementation is the Condor Toolkit, developed by the
University of Wisconsin-Madison (http://www.cs.wisc.
edu/condor). Condor can integrate both dedicated computational
clusters and standard desktop machines into
one computational resource.
We will present the work of the UK HiTHeR project (http://hither.cerch.kcl.ac.uk/), which created a prototype
of such an infrastructure to demonstrate the utility
of HTC methods to textual scholars, and indeed to humanities
researchers in general. It did this by using Condor
to set up a Campus Grid, which may be defined as
environments which utilise existing computational and
data resources within a university or other research institutions.
The project used the resuting infrastructure to
produce a case study based on a high profile digitisation
project in the UK, addressing questions of textual analysis
that would not be feasible without this infrastructure.
At the same time, we will show how the application can
be integrated into the web publication of text-based resources.
After demonstrating the power of HTC for standard text
processing in the digital humanities, the main aim of the
project is to show how digital humanities centres can be
served by implementing their own local research infrastructure,
which they can relatively easily build using
existing resources like standard desktop networks. There
have been many experiments in the digital humanities
using dedicated HPC facilities, less on the application
of these relatively light-weight computational infrastructures.
We demonstrate the feasibility of such infrastructures
and evaluate their utility for the particular task of
textual analysis.
Only with such local infrastructures will it be possible
to fulfil the often expressed demand of textual studies
researchers to be able to experiment with the statistical
methods of textual analysis rather than to be simply confronted
with the results. Faced with the opportunities of
HPC and HTC, these researchers frequently express the
desire to transform the underlying statistical algorithms
‘interactively’ by changing parameters and constraints,
and in this way to follow their particular interests by experimenting
with the outcome of the analysis and thus
gaining better insights into the structures of the text.
If the humanities researcher has go to dedicated supercomputing
centres, such an approach is more difficult to
maintain, as it will depend on the relationship with that
supercomputer centre. HiTHeR has thus two research
goals: (1) to carry out textual analysis in a parallel computing
environment and (2) to investigate new types of
e-Infrastructures for supporting the work of digital humanities
HiTHeR infrastructure - Case Study
Automatic textual processing is relatively well researched
and can rely on a large range of specialised algorithms
and data structures to process textual data. In
the digital humanities, many of these algorithms have
been tried on complex historical or linguistic collections.
Language modelling, vector space analysis, support
vector machines or LSI are only some of the machine
learning approaches that have attracted growing
interest in the digital humanities. In the HiTHeR project,
we focused on relatively simple processing, which
nevertheless has proven to be highly useful to many humanities
research institutions. A recent study in the ICT
needs of humanities researchers [2] found out that text
analysis tools and services are still generating most interest
among humanities researchers. Among these text
analysis tools, the comparison of 2 or more texts was
seen as the most useful tools. Such tools help with many
textual studies activities from the comparison of different
versions of texts to finding texts about the same topic
in large textual collections. Such comparisons may rely
on stable algorithms but are often costly in terms of the
computational resources needed, as each document in a
collection needs to be compared to all other documents
in the collection. Also, the digital textual resources processed
are often ‘dirty’, containing a high proportion of
transcription errors, because of the problems of digitising
older, irregular print. This leads to further increases
in computational size and complexity, as more advanced
methods are needed to reduce the “noise” from the OCR
processes. Overcoming the complexities of machine
based learning in the humanities, was therefore recognized
quite early as a use case for an advanced computational
The corpus used for the HiTHeR case study is the Nineteenth
Century Serials Edition (http://www.ucse.ac.uk/),
which contains circa 430,000 articles that originally
appeared in roughly 3,500 issues of six 19th Century
periodicals. Published over a span of 84 years, materials
within the corpus exist in numbered editions, and
include supplements, wrapper materials and visual elements.
Currently, the corpus is explored by means of
a keyword classification, derived by a combination of
manual and automated techniques. A key challenge in
creating a digital system for managing such a corpus is to
develop appropriate and innovative tools that will assist
scholars in finding materials that support their research,
while at the same time stimulating and enabling innovative
approaches to the material. One goal would be to
create a “semantic view” that would allow users of the
resource to find information more intuitively. However,
the advanced automated methods that could help to create
such a semantic view require greater processing power
that is available in standard environments. Prior to the
current case study, we were using a simple document
similarity index to allow journals of similar contents to
be represented next to each other. The program used the
lingpipe (http://alias-i.com/lingpipe/) software to calculate
similarity measures for articles based on frequencies of character n-grams within the corpus. A character
n-gram is any subsequence of n well defined characters.
Initial benchmarks on a stand-alone server allowed us to
conclude that, assuming the test set was representative,
a complete set of comparisons for the corpus would take
more than 1,000 years. Consequently, we ran a sequence
of systematic experiments, carrying out different text
analysis of the selected corpus, to provide benchmarks
for the throughput improvements provided by the grid
environment. The detailed results will be presented in
the presentation.
In the experiments, we have set up an institutional CampusGrid
using Condor at King’s College London on
spare servers and desktops (in use during the day) within
2 departments. No new hardware had to be bought. We
than ran several text mining algorithms on a subset of
the data (the “English Women’s Journal”—which has the
highest OCR quality) which have been adapted locally
so that parts of the code can be run in parallel. This has
reduced the processing time from a few days to a few
To conclude: One driver of the project is the NCSE corpus,
for which the project addresses a genuine research
need to be able to create new semantic views on textual
resources automatically. But, more widely than this, we
see the project as an opportunity to start building the einfrastructure
required to support humanities research
that has complex (or simply large) computational requirements.
[1] National Endowment for the Humanities (NEH) Digital
Humanities Initiative (Workshop on Supercomputing
& the Humanities (July 11, 2007), http://www.neh.gov/
[2] Toms, Elaine and O’Brien, Heather L.: Understanding
the information and communication technology
needs of the e-humanist, Journal of Documentation, vol
64, 2008.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None