Digital Editions for Corpus Linguistics: Encoding Abbreviations in TEI XML Mark-up

poster / demo / art installation
  1. 1. Alpo Honkapohja

    University of Helsinki

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This poster will present the Digital Editions for Corpus
Linguistics (DECL) system for encoding manuscript
abbreviations in TEI-conformant XML. First, I
will briefly describe the DECL project, its presenting its
aims and editorial policies. Secondly, I will go through
the problems resulting from the silent expansion of abbreviations,
an approach some digital editions derive
from traditional editing. And finally, I will describe the
possibilities of TEI P5 for encoding them, as well as the
DECL application of the guidelines, and what benefits
they have for the type of historically oriented, corpus
searchable editions we are compiling. The examples
will come from a digital edition of Trinity College Cambridge
MS O.1.77, a pocket-sized late medieval medical
handbook in Middle English and Latin, which I am editing
for my PhD thesis.
Digital Editions for Corpus Linguistics
DECL is a project, based at the Research Unit for Variation,
Contacts and Change (VARIENG) at the University
of Helsinki, which aims at developing online editions
that combine the accurate description of historical documents
with the flexibility of search tools developed for
linguistic computing. It was formed by three postgraduate
students at VARIENG in 2007, who shared a dissatisfaction
with extant tools and resources, and aimed
to develop a more versatile and user-friendly model for
digitised manuscripts of historical texts. The tools and
framework are designed to meet the needs of small scale
research projects and individual scholars. They are based
on and compatible with version P5 of the TEI guidelines.
On the level of editorial principles, DECL editions adopt
the opinion of Lass (2004) that digital editions should
preserve the text as faithfully as possible, convey it in as
flexible form as possible, and ensure that any editorial
intervention remains visible and reversible, formulating
it into three central principles of Transparency, Flexibility
and Expandability. DECL editions aim to offer the
user diplomatic transcriptions of the manuscripts into
which linguistic, palaeographic and codicological features
will be encoded. Additional layers of contextual, codicological and linguistic annotation can be added to
the editions using standoff XML tagging.
One of the most ubiquitous problems encountered in editing
medieval manuscripts, is how to represent the numerous
abbreviations in them. There is no established
standard for encoding these abbreviations in digital format,
and many digital editions still follow the practice
inherited from traditional book editions of expanding
them, either silently or in italics. From the point of view
of historical linguistics this is somewhat problematic, especially
in the light of some recent discussion over what
is required of an edition or corpus in order to constitute
reliable data (cf. i.a. Bailey 2004; Curzan and Palmer
2006; Dollinger 2004; Grund 2006). Most vocal in his
criticism of existing practices has been Lass (2004),
who demands that in order to serve as valid data for the
historiography of language, a digital edition or a corpus
should not contain any editorial intervention that results
in substituting the scribal text with a modern equivalent.
Expanding abbreviations substitutes a symbol used by
the scribe with a modern reading of it, which may, in
the vast majority of cases, be obvious, and supported
by research, but, by definition, also contains an element
of editorial interpretation. In some cases this may have
an impact on the data. For example, the irregularity of
spelling of Middle English may result the editor to make
decisions over which combination of letters a particular
abbreviation stands for in text in which the abbreviated
word may appear in several spelling variants.
The TEI P5 module for encoding glyphs and non-standard
characters offers a few alternative ways of annotating
them. The abbreviations may be annotated by <g>
tag, indicating that they are glyphs, in which case they
are defined by the gaiji module in the TEI header. Or the
<am> and <ex> tags can be used to indicate the abbreviated
sign and its editorial expansion. In these cases, the
<choice> element may also be used to indicate that
some of the elements are alternatives to each other. Or
the whole word may simply be annotated as <abbr> to
show that it contains abbreviations. The editor may also
use the <expan> tag, indicating items which have been
expanded without recording the abbreviation symbol.
The DECL guidelines uses an application of this that
marks both the symbol, and its content, but does not require
multiple elements inside a single word, as they can
cause internal difficulties with stand-off tagging. We use
the <abbr> element to mark that a word contains an
abbreviation, the <g> element to tag the content of each
abbreviation, and give the abbreviation symbol used for
it as its attribute.
<abbr>su<g ref=”#crossed-p”>per
The expanded part of the abbreviation, which is in fact
editors reconstruction and in some cases may up for debate
gets enclosed in the <g> tags, and thus also marked
as editorial - which is in accordance with the DECL
principle of transparency.
Aims and Benefits
The XML code can be dynamically processed via XSLT
transformation scripts to create documents which display
either the abbreviations or expanded words according
to the needs of the user, and DECL editions will also
offer the user a customisable online interface, capable of
displaying both. In addition to visual presentation and
browsing, the interface will also offer corpus search and
analysis functions, which can be extended to searches
on the specified elements or attributes. Following the
principles of open source and open access, they users of
DECL editions will have full access to the code and may
download and alter it, meaning that it is possible to alter
the editorial decisions if the user is not satisfied with
In the poster I will give an illustrated presentation of how
the process of encoding abbreviations progresses from
manuscript images, via TEI XML code to its various
forms of presentation.
Bailey, Richard W. (2004). The need for good texts:
The case of Henry Machyn’s
Day Book, 1550–1563, Studies in the history of the English
language II: Unfolding conversations (Topics in
English Linguistics 45): 217–228.
Curzan, Anne and Palmer, Chris C. (2006). The importance
of historical corpora, reliability, and reading,
Corpus-based Studies of Diachronic English: 17–34.
DECL (Digital Editions for Corpus Linguistics).
Dollinger, Stefan. (2004). ‘Philological computing’ vs.
‘philological outsourcing’ and the compilation of historical
corpora: A Late Modern English test case, Vienna English Working Papers (VIEWS), 13(2): 3–23.
Grund, Peter. (2006). Manuscripts as sources for linguistic
research: A methodological case study based on
the Mirror of Lights, Journal of English Linguistics, 34:
Lass, Roger. (2004). ‘Ut custodiant litteras: Editions,
Corpora and Witnesshood’, in: Marina Dossena and
Roger Lass (eds.) Methods and Data in English Historical
Dialectology (Linguistic Insights 16): 21–48.
TEI (Text Encoding Initiative). <http:/>.
VARIENG (Research Unit for Variation, Contacts
and Change in English). <

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None