Open Source and Digital Humanities

Amy Earhart; Dominic Forest; James Smith

Authorship

1. Amy Earhart

English Dept - Texas A&M University
2. Dominic Forest

École de bibliothéconomie et des sciences de l'information - Université de Montréal
3. James Smith

Computer Information Services - Texas A&M University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Funding agencies such as NEH and AHRC have clearly
stated that the open source approach to digital humanities
work is necessary for both short-term financial support of
projects and long-term success of digital humanities work.
However, both agencies have predominantly emphasized the
creation of open source projects rather than asking scholars to
consider the possibilities offered when projects tap into the use
of externally produced open source software. Even less
emphasis has been placed on how academic projects might
structure projects to invite participation from the open source
development community.
There are excellent examples of groups that have modeled the
importance of open source methodologies in Digital Humanities
including the TEI/XML movement to standardize an XML
appropriate to humanities projects and NINES, currently
developing open source tools for use by the broader community.
Less common, however, is the exploration of how we might
engage the developer community with our work. Examples
abound in the larger technology framework. BBC’s Backstage
movement has created a model that asks a free developer
community to produce vast numbers of prototypes based on
existing RSS feeds and open source software. This movement
provides a possible model and possible interfaces that could
benefit digital humanities work. OSCON (Open Source
Convention) offers another possible model. While academics
struggle to find the appropriate tools for digital humanities and
appropriate standardization practices within the context of
academic restraints (budget and promotion issues in particular),
we might consider how to engage the broader open source developer community and how to rewrite and use open source
software for our digital humanities projects.
This panel will explore various approaches to engaging with
open source software and development. As is fitting for a panel
that looks to engage a non-academic audience in the
development of our projects, we have put together three papers
from diverse contributors: a linguist interested in the use of
open source software for text mining; a literary scholar
interested in testing open source programs such as Ruby on
Rails and Google maps, in a digital archive; and a computer
programmer who has actively worked with the open source
development community, and OSCON in particular, over the
past 10 years. The panel participants will discuss the use of
particular pieces of software, benefits and limitations of such
approaches, ways to open digital humanities work to a broader
development community, and how to engage the open source
development community in our work.
Text Mining and Computer-Assisted Thematic
Analysis: Using Open-Source Software to Discover
Knowledge from Unstructured Documents
Dominic Forest
During the last decades, many humanities research initiatives
have been concentrating their efforts on digitalizing large
amounts of text (British National Corpus, Brown corpus,
Gallica project, Oxford Text Archive, etc.). Simultaneously,
technologies were developed for more accurate and faster
digitalization processes, and for more flexible and robust digital
document encoding (XML), storing and retrieval. In the last
years, some results of these digitalization initiatives have been
made available on various supports and in many digital formats.
Digitalization initiatives partly motivated the need for robust
large-scale text analysis software.
The development of text analysis tools has been influenced
(and has often integrated) concepts and techniques from many
academic fields. Among the main disciplines involved in the
conception and development of text analysis software, we find
linguistics (natural language processing), information science,
and computer science. However, in the last years important
research developments in the general field of text analysis
emerged from the areas of data analysis, artificial intelligence
and machine learning. Initially, the concepts and techniques
developed in these areas where mostly applied on numerical
data. This led to a rapidly growing research area known as “data
mining”. The primary objective of data mining tools is to rapidly
extract valuable (and previously unknown) information from
large sets of quantitative data. Many researchers, mostly familiar
with the information retrieval community, saw the potential of
this emerging technology in the domain of text analysis. The
application of data mining concepts and techniques on large
textual databases quickly became an active interdisciplinary
research field known as “text mining”.
Although some more or less efficient commercial text mining
software is becoming available, these tools suffer from two
main problems in an academic research context. First,
commercial text mining suites are very expensive and therefore
have not been fully integrated in academic practices. Secondly,
and more importantly, these suites have been conceived in order
to mainly assist business-related text mining tasks. They are
not adapted to the variety and specificity of humanities
academic research. Nevertheless, very recent academic projects
have started to explore the potential of text mining techniques
applied to large humanities and social sciences corpora. These
research initiatives are mainly based on open-source text mining
modules.
The objectives of our talk are 1) to report an experiment in
which we use open- source text mining tools to assist thematic
analysis from unstructured texts and 2) identify the benefits
and limits of using open-source text mining tools in the context
of text analysis tasks in the humanities.
Our talk will be divided in three main parts. In the first part,
we will present the fundamental concepts and techniques of
text mining. More specifically, we will present how two
fundamental knowledge discovery processes (text clustering
and text categorization) can be assisted – and in some cases
accomplished automatically – using state-of-the-art text mining
techniques.
The second part will be dedicated to the presentation of a
specific text mining methodology to assist thematic analysis
of unstructured documents. This methodology is composed of
four main steps: 1) text pre-processing, 2) text transformation
using vector-space model, 3) knowledge discovery (using
clustering or categorisation processes), and 4) thematic analysis.
In this part, we will present each step in details and we will
present open-source text mining tools that can accomplish each
process.
In the third part, we will report the main results obtained using
this four-step methodology on a Belgian newspaper corpus.
In conclusion, based on experiments using open-source text
mining technology, we will identify and discuss some benefits
and limits of open-source text mining tools in the context of
humanities text analysis. An Open-Source Approach to Digital Humanities:
Testing the Limits of Open Source Ideology in the
19th-Century Concord Digital Archive
Amy Earhart
The 19th-Century Concord Digital Archive collaborators are
exploring the utilization of a wide variety of open source tools
in archive construction as well as following a distribution
method that will encourage the participation of an open source
development community. As the 19th-Century Concord Digital
Archive nears beta test (Summer 2007), it seems an appropriate
moment to consider the impact of the open source ideology of
the project. A core value of many of the currently produced
digital humanities projects is an open and free archive. This
approach generally implies that users experience the free use
of archive materials delivered through the web and, in rare
instances, the development of open source tools that may be
applicable to projects that use similar standards, such as
TEI/XML. What has been given little attention, both in practice
and in theory, has been a careful consideration of how
non-academic open source programs might be leveraged for
digital humanities projects, particularly digital archives, and
what the positive and negative impact of such an approach
might be. Using the Digital Concord project as a test case, the
paper examines the possibilities offered by contemporary
web-based technologies. This paper will discuss two such
possible non-academic open source tools as well as planning
practices. The presentation will then extrapolate from the
experience of archive development to generalize some of the
ways that non-academic open source approaches might
positively and negatively impact digital humanities.
Of the various non-academic open source applications in use
in the construction of the Digital Concord archive, this paper
will focus on the two: Ruby on Rails and Google Map Hacks.
Ruby on Rails
Ruby on Rails has produced some of the hottest buzz of recent
open source technologies. Rails is a web application framework
that not only offers amazing capabilities, but has attracted some
of the best and brightest of the open source development
community. It is particularly useful to digital humanities work
because it shows remarkable promise for rapid development
of real world applications and works well with TEI/XML,
relational databases, and RDF, a means of relating disparate
information. And, given the interest in Rails, the development
community has begun development of a broad set of beta
programs that, with some modification, might prove helpful to
our current digital humanities work. During the presentation,
I will highlight some of the uses that Rails offers and discuss
positive and negative impacts of the approach.
Google Map Hacks
In an attempt to provide multiple user interfaces and remedy
the lack of responsiveness to a user experience with some digital
humanities work, the Digital Concord project has sought to
develop multiple interfaces. One such interface is visual and
based on a manipulation of Google Maps. While GIS is a
well-developed tool and has been applied to some digital
humanities projects, there are other tools that might be of greater
use to digital humanists due to cost, development, and specific
disciplinary needs. Rather than rely on GIS tools, the Concord
archive experiments with programs that are under development
in an open source community in the expectation that such an
approach will offer a way for academics to tap into a previously
unexplored group of participants and developers. The advances
made by individuals interested in such maps are ongoing;
projects chronicled at the blog “Google Maps Mania” <http
://googlemapsmania.blogspot.com/> show the
possibilities for incorporating Google maps with a variety of
data. Using the Concord interface as a starting point, the paper
will discuss multiple possibilities for Google Map Hacks.
Designing Humanities Data for Open Source
Developers
While the use of such open source technologies might have a
very real impact on digital projects, the potential for engaging
the open source community in digital humanities offers an even
greater opportunity. Models from Moodle to the BBC backstage
project <http://backstage.bbc.co.uk/> suggest that
the possibilities for engaging a broad developer community in
digital work is possible with the right set of data and appropriate
interface. While the Digital Concord project has no illusions
of generating the type of interest as seen in previously
mentioned projects, there are ways that the project can open
data and create a community of interest around digital
humanities work that could have positive repercussions. The
paper will close with a discussion of potential strategies that
individual archives might take to interest the open source
community.
Interfacing with the Open Source Community:
An OSCOM approach
James Smith
Academic progress seems to depend on a person's ability to
contribute to society as measured by attributable work such as
articles and monographs in areas that are interesting to society.
These attributions depend on copyrights and patents, the keys
to establishing and protecting intellectual property ownership.
The various print journals and publishing houses act as a
peer-reviewed intermediary. The academy establishes social interest in a project by the project's ability to attract financial
support.
Digital Humanities (DH) relies heavily on what is now being
called Web 2.0: technologies that allow nearly seamless
cross-platform client/server applications (e.g., AJAX and
Comet). Most of these technologies have roots in the Open
Source Community (OSC). Much of the talent is also in the
OSC. Until this talent is brought into DH, DH is in danger of
being a field looking in from outside, watching technology
advance as it tries to catch up. Yet, DH is rooted in the academy
and must meet the expectations of the academy.
The Open Source Community has its roots in the academy, but
has lived for a while in the wild among amateurs in the classic
sense. Many people participate in open source because they
enjoy doing so, though some participate because of their
employment. The OSC is built on a stereotype of the academy:
information is free and everyone is able to work on problems
they find interesting. Just as professors enjoy working in a
University or College with other smart people in their field,
OSC members are attracted to projects with smart people. OSC
members also value the openness with which they can develop
and discuss projects.
The Open Source Community has not abandoned the notion of
attribution and social relevancy. The openness of a project's
development creates a trajectory along which the project travels.
This trajectory is a measure of the talent behind the project and
is apparent to most who are interested in the project.
Misappropriating a particular release of a project captures only
a single point on that trajectory. Because any particular release
of a project does not bring with it any of the talent behind the
project, `stealing' from the project is worth much less.
Attribution in the OSC is much more than just the name beside
the copyright or on the patent.
An Open Source project is socially relevant if it is widely used
and has a strong community. There are no financial
requirements for a project to be successful. Sorceforge.net hosts
Open Source projects at no cost to the project not because any
particular project is worth the cost, but because the OSC itself
is socially relevant. People contribute their time to a project
not necessarily because they get paid but because they enjoy
the project and, if they make significant contributions, they can
become a well known talent in the OSC.
The problem, then, is how to balance the requirements of the
academy against the need to create an environment that is
inviting to the open source community. By openly involving
the open source community, DH can access a wide variety of
talent which will be involved in various projects not because
they are being paid to help, but because they love the project.
At the same time, DH can maintain the attribution required for
academic progress.
One academic field of study that is leading the way with
technology is Physics with the development of the world wide
web at CERN to aid in sharing documents to the electronic
pre-print server (<http://xxx.lanl.gov/> ). The Physics
community can do this because it is small enough that its
members know each other. Reputation is built much as it is in
the Open Source Community: by personal experiences between
members of the community. Formal peer review plays a
secondary role. By reviewing the output of a physicist, another
can see the pattern and tell if something new fits that pattern.
If any company represents the commercial potential of Digital
Humanities, it is Google. They have been able to attract some
of the top talent in the industry by providing a work
environment that resembles the Open Source Community in
many aspects. Some of the more interesting projects for DH
have come from employees' `play time.' Google has brought a
lot of smart people together under one roof, much as a
University or College might do.
By looking to other academic fields and the Open Source
Community, Digital Humanities can create a new environment
encouraging rapid evolution of ideas without sacrificing the
need for attribution, reputation, or social relevance.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Conference website: http://www.digitalhumanities.org/dh2007/

References: http://web.archive.org/web/20070810143343/http://digitalhumanities.org/dh2007/DH2007.detail.html http://web.archive.org/web/20080703194728/http://www.digitalhumanities.org/dh2007/abstracts/titles.xq

Series: ADHO (2)

Organizers: ADHO

Open Source and Digital Humanities

1. Amy Earhart

2. Dominic Forest

3. James Smith

ADHO - 2007