Use cases driving the tool development in the MONK project

multipaper session
  1. 1. Catherine Plaisant

    University of Maryland, College Park

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Mellon funded MONK project ( is developing a digital environment to help
humanities scholars discover and analyze patterns. During the past two years of this project major leaps have
been made in the development of the infrastructure necessary to 1) normalize and ingest text into a datastore,
2) support data mining and rich text analytics and 3)
provide a flexible workbench with a variety of user interfaces for scholars to conduct their analysis tasks. The
design and development process of this environment
was guided by a small set of representative use cases.
While the development of the MONK environment is
still ongoing, scholars have been able to use early prototypes as well as other research or off-the shelf tools to
conduct their scholarly work, while informing the development team as to the useful features to be incorporated
in MONK environment.
In this session we report on three MONK use cases
which are interesting individually but collectively illustrate the type of tools that are being developed by Monk
to support other scholars in the future. The use cases
were selected to be diverse, representative of the questions MONK aim to address, and driven by a scholar already actively engaged in the study of the question. The
scholars were selected from the institutions of the project’s principal investigators. In the discussion period we
hope to discuss the struggles and successes our team encountered in the process.
The rest of this statement summarizes the status of the
MONK project.
As of the time of submission MONK includes approximately 1,200 texts, including 300 American novels published between 1851 and 1875, 250 British novels published between 1780 and 1900, 300 plays, 30 works of
16th and 17th century poetry, and some 300 works of
16th and 17th prose, including fiction, sermons, travel
literature, and witchcraft texts.
The texts are normalized (using Abbot, a complex XSL
stylesheet) to TEI -A, and each text has been “adorned”
(using Morphadorner) with tokenization, sentence
boundaries, standard spellings, parts of speech and lemmata, before being ingested into a database that provides
Java access methods for extracting data for many purposes, including searching for objects; direct presentation in
end-user applications as tables, lists, concordances, or
visualizations; getting feature counts and frequencies for
analysis by data-mining and other analytic procedures;
and getting tokenized streams of text for working with ngram and other collocation analyses, repetition analyses,
and corpus query-language pattern-matching operations.
Finally, quantitative analytics like naive Bayesian analysis, support vector machines, Dunnings log likelihood,
etc., are run through the SEASR environment.
MONK will combine texts and tools to enable literary
research through the discovery, exploration, and visualization of patterns. Users start a project with one of the
toolsets that has been predefined by the MONK team.
Each toolset combines individual tools (e.g. a search
tool, a browsing tool, a rating tool, and a visualization)
that are applied to worksets of texts selected by the user
from the MONK datastore. Worksets and results can be
saved for later use or modification, and results can be
exported in some standard formats (e.g., CSV files).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None