Text Mining in the Digital Humanities

workshop / tutorial
  1. 1. Gerhard Heyer

    Institute of Mathematics and Computer Science - Universität Leipzig (Leipzig University), Natural Language Processing (NLP) Group - Universität Leipzig (Leipzig University)

  2. 2. Marco Büchler

    Institute of Computer Science - Universität Leipzig (Leipzig University), Natural Language Processing (NLP) Group - Universität Leipzig (Leipzig University)

  3. 3. Thomas Eckart

    Institute of Computer Science - Universität Leipzig (Leipzig University), Natural Language Processing (NLP) Group - Universität Leipzig (Leipzig University)

  4. 4. Charlotte Schubert

    Ancient History Group - Universität Leipzig (Leipzig University), Department of History - Universität Leipzig (Leipzig University), Faculty for History, Art, and Oriental Studies - Universität Leipzig (Leipzig University)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Thinking about text mining and its scope in the
Digital Humanities, a comparison between the theory
based work of the Humanities and the model driven
approaches of Computer Science can highlight the
decisive differences. Whilst Classicists primary rely
on manual work e. g. using a search engine which just
finds what is requested and skips non-requested but
nonetheless interesting results, an objective model
can be applied to the whole text and is closer to
completeness. Even if the implication of the result
doesn't depend on what the researcher does, the
quality itself is typically worse than manual work.
That's why the workshop combines both the quality
of manual work and the objectivity of a model.
The workshop contains four sessions of 90 minutes
as well as one hour for lunch (not provided) and two
half-hour breaks (all in all 8 hours). Every session is
segmented into three parts:
1. Theoretical background (30 minutes):
Within this section the necessary background is
given to bring workshop relevant knowledge to the
participants. This includes a soft brainstorming of
the algorithms working behind the user interfaces.
2. Introduction of the user interface (15 to 30
minutes): To avoid reading a manual a short
introduction to the user interface is given. The
short introduction of the presenter can be followed
locally by every participant. When a problem
Digital Humanities 2010
occurs, the non active presenters will help the
respective participant.
3. Hands-on section (30-45 minutes): After
receiving the text mining background and a short
introduction to the user interface, the participants
have up to half of a session for working on their
own laptops. All presenters can be asked for
detailed questions.
Based on the works within the eAQUA project of
the last years, the modules Explorative Search, Text
Completion, Difference Analysis as well as Citation
Detection are chosen to highlight the benefits of
computer based models. In detail that means:
- Explorative Search: By using Google in daily life
almost everything can be found. The basic idea is: if
a web page doesn't contain the information sought,
any other will do. The differences in searching
humanities texts can be grouped to two main
clusters: a) The text corpus is closed and relatively
small compared to the Internet; b) In relation to
daily life queries on Google complete requests in
the humanities are quite uncommon since the set
of words are often unknown. For this reason a
graph based approach is used to find (starting
with a single word like a city or a person name)
interesting associated words you would typically
not have directly in your mind. At the end of this
session, it will be discussed briefly how such an
approach can be integrated into teaching since
especially for students a search like this can be
useful to explore and learn a domain.
- Text Completion: Because of the degree of
fragmentation of papyri and inscriptions, a
dedicated session for completing texts is set on the
agenda. In this session well established approaches
of spell checking will be combined with dedicated
techniques addressing ancient text properties.
- Difference Analysis: In this session a web based
tool is introduced to compare word lists of e.g.
two authors, works or literary classifications. The
result is divided into five categories: two categories
containing words only used in one of the two text
sets, two categories representing words which are
significantly more often used in one of the two sets
and finally a class of words with similar frequency.
Based on these separations, differences can be
identified faster than by manual reading.
- Citation Detection: The session of detecting
citations contains three different aspects: a) How
can citations be detected? b) How can found
citations be accessed as efficiently as possible
by Ancient Greek philologists (micro view on
citations)?; c) How can more global associations
be found like dependencies between centuries
and dedicated passages of works (macro view on
citations)? The main focus of this session is not set
on the algorithms to find citations but on both user
interfaces for different research groups.
Full day workshop: Monday, 5 July.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2010
"Cultural expression, old and new"

Hosted at King's College London

London, England, United Kingdom

July 7, 2010 - July 10, 2010

142 works by 295 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: http://dh2010.cch.kcl.ac.uk/

Series: ADHO (5)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None