Digital Editions for Corpus Linguistics: A new approach to creating editions of historical manuscripts

paper
Authorship
  1. 1. Alpo Honkapohja

    University of Helsinki

  2. 2. Ville Marttila

    University of Helsinki

  3. 3. Samuli Kaislaniemi

    University of Helsinki

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
One relatively unexplored area in the expanding fi eld of digital
humanities is the interface between textual and contextual
studies, namely linguistics and history. Up to now, few digital
editions of historical texts have been designed with linguistic
study in mind. Equally few linguistic corpora have been
designed with the needs of historians in mind. This paper
introduces a new interdisciplinary project, Digital Editions for
Corpus Linguistics (DECL). The aim of the project is to create
a model for new, linguistically oriented online digital editions
of historical manuscripts.
Digital editions, while on a theoretical level clearly more
useful to scholars than print editions, have not yet become
the norm for publishing historical texts. To some extent this
can be argued to result from the lack of a proper publishing
infrastructure and user-friendly tools (Robinson 2005), which
limit the possibilities of individual scholars and small-scale
projects to produce digital editions.
The practical approach taken by the DECL project is to develop
a detailed yet fl exible framework for producing electronic
editions of historical manuscript texts. Initially, the development
of this framework will be based on and take place concurrently
with the work on three digital editions of Late Middle and Early
Modern English manuscript material. Each of these editions—a
Late Medieval bilingual medical handbook, a family of 15thcentury
culinary recipe collections, and a collection of early
17th-century intelligence letters—will serve both as a template
for the encoding guidelines for that particular text type and
as a development platform for the common toolset. Together,
the toolset and the encoding guidelines are designed to enable
editors of manuscript material to create digital editions of
their material with reasonable ease.
Theoretical background
The theoretical basis of the project is dually grounded in the
fi elds of manuscript studies and corpus linguistics. The aim
of the project is to create a solid pipeline from the former
to the latter: to facilitate the representation of manuscript
reality in a form that is amenable to corpus linguistic study.
Since context and different kinds of metatextual features are
important sources of information for such fi elds as historical
pragmatics, sociolinguistics and discourse analysis, the focus
must be equally on document, text and context. By document we
refer to the actual manuscript, by text to the linguistic contents
of the document, and by context to both the historical and
linguistic circumstances relating to text and the document. In
practice this division of focus means that all of these levels are
considered equally important facets of the manuscript reality
and are thus to be represented in the edition.
On the level of the text, DECL adopts the opinion of Lass
(2004), that a digital edition should preserve the text as
accurately and faithfully as possible, convey it in as fl exible a
form as possible, and ensure that any editorial intervention
remains visible and reversible. We also adopt a similar approach
with respect to the document and its context: the editorial
framework should enable and facilitate the accurate encoding
and presentation of both the diplomatic and bibliographical
features of the document, and the cultural, situational and
textual contexts of both the document and the text. In keeping
with the aforementioned aims, the development of both the
editions, and the tools and guidelines for producing them, will
be guided by the following three principles.
Flexibility
The editions seek to offer a fl exible user interface that will be
easy to use and enable working with various levels of the texts,
as well as selecting the features of the text, document and
context that are to be included in the presentation or analysis
of the text. All editions produced within the framework will
build on similar logic and general principles, which will be
fl exible enough to accommodate the specifi c needs of any
text type.
Transparency
The user interfaces of the editions will include all the features
that have become expected in digital editions. But in addition
to the edited text and facsimile images of the manuscripts,
the user will also be able to access the raw transcripts and
all layers of annotation. This makes all editorial intervention
transparent and reversible, and enables the user to evaluate
any editorial decisions.
Expandability
The editions will be built with future expansion and updating in
mind. This expandability will be three-dimensional in the sense
that new editions can be added and linked to existing ones,
and both new documents and new layers of annotation or information can be added to existing editions. Furthermore, the
editions will not be hardwired to a particular software solution,
and their texts can be freely downloaded and processed for
analysis with external software tools. The editions will be
maintained on a web server and will be compatible with all
standards-compliant web browsers.
Technical methods
Following the aforementioned principles, the electronic
editions produced by the project will reproduce the features
of the manuscript text as a faithful diplomatic transcription,
into which linguistic, palaeographic and codicological features
will be encoded, together with associated explanatory notes
elucidating the contents and various contextual aspects of the
text. The encoding standard used for the editions will be based
on and compliant with the latest incarnation of the TEI XML
standard (P5, published 1.11.2007), with any text-type specifi c
features incorporated as additional modules to the TEI schema.
The XML-based encoding will enable the editions to be used
with any XML-aware tools and easily converted to other
document or database standards. In addition to the annotation
describing the properties of the document, text and context,
further layers of annotation—e.g. linguistic analysis—can be
added to the text later on utilising the provisions made in the
TEI P5 standard for standoff XML markup.
The editorial strategies and annotation practices of the three
initial editions will be carefully coordinated and documented
to produce detailed guidelines, enabling the production of
further compatible electronic editions. The tools developed
concurrently with and tested on the editions themselves
will make use of existing open source models and software
projects—such as GATE or Heart of Gold, teiPublisher and
Xaira—to make up a sophisticated yet customisable annotation
and delivery system. The TEI-based encoding standard will also
be compatible with the ISO/TC 37/SC 4 standard, facilitating
the linguistic annotation of the text.
Expected results
One premise of this project is that creating digital editions
based on diplomatic principles will help raise the usefulness
of digitised historical texts by broadening their scope and
therefore also interest in them. Faithful reproduction of the
source text is a requisite for historical corpus linguistics,
but editions based on diplomatic transcripts of manuscript
sources are equally amenable to historical or literary enquiry.
Combining the approaches of different disciplines—historical
linguistics, corpus linguistics, history—to creating electronic
text databases should lead to better tools for all disciplines
involved and increase interdisciplinary communication and
cooperation. If they prove to be successful, the tools and
guidelines developed by DECL will also be readily applicable
to the editing and publication of other types of material,
providing a model for integrating the requirements and desires
of different disciplines into a single solution.
The fi rst DECL editions are being compiled at the Research
Unit for Variation, Contacts and Change in English (VARIENG)
at the University of Helsinki, and will form the bases for three
doctoral dissertations. These editions, along with a working
toolset and guidelines, are scheduled to be available within the
next fi ve years.
Since the aim of the DECL project is to produce an open
access model for multipurpose and multidisciplinary digital
editions, both the editions created by the DECL project
and the tools and guidelines used in their production will
be published online under an open access license. While the
project strongly advocates open access publication of scholarly
work, it also acknowledges that this may not be possible due
to ongoing issues with copyright, for example in the case of
facsimile images.
The DECL project is also intended to be open in the sense that
participation or collaboration by scholars or projects working
on historical manuscript materials is gladly welcomed.
References
DECL (Digital Editions for Corpus Linguistics). <http://www.
helsinki.fi /varieng/domains/DECL.html>.
GATE (A General Architecture for Text Engineering). <http://
gate.ac.uk>. Accessed 15 November 2007.
Heart of Gold. <http://www.delph-in.net/heartofgold/>.
Accessed 23 November 2007.
ISO/TC 37/SC 4 (Language resource management).
<http://www.iso.org/iso/iso_technical_committee.
html?commid=297592>. Accessed 15 November 2007.
Lass, Roger. 2004. “Ut custodiant litteras: Editions, Corpora
and Witnesshood”. Methods and Data in English Historical
Dialectology, ed. Marina Dossena and Roger Lass. Bern: Peter
Lang, 21–48. [Linguistic Insights 16].
Robinson, Peter. 2005. “Current issues in making digital
editions of medieval texts—or, do electronic scholarly
editions have a future?”. Digital Medievalist 1:1 (Spring 2005).
<http://www.digitalmedievalist.org>. Accessed 6 September
2006.
TEI (Text Encoding Initiative). <http:/www.tei-c.org>.
Accessed 15 November 2007.
teiPublisher. <http://teipublisher.sourceforge.net/docs/index.
php>. Accessed 15 November 2007.
VARIENG (Research Unit for Variation, Contacts and Change
in English). <http://www.helsinki.fi /varieng/>.
Xaira (XML Aware Indexing and Retrieval Architecture).
<http://www.oucs.ox.ac.uk/rts/xaira/>. Accessed 15
November 2007.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None