Management of Data for Building Electronic Editions of Historic Manuscripts

paper
Authorship
  1. 1. Alex Dekhtyar

    Computer Science - University of Kentucky

  2. 2. Ionut Emil Iacob

    No affiliation given

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The process of preparing electronic editions [SGK00] is long, time-consuming, and arduous for an editor or
editors. Some of the work cannot be automated—editors, for example, must scrutinize every single letter of a
document numerous times to come to fully informed decisions concerning script, meaning, and spelling
[Hay01]. The success of the ARCHway project, designed to alleviate unnecessary tedium of the editorial
process lies, in major part, in correct manipulation of the data that forms the electronic edition.
At the outset of the process of preparing an electronic edition, an editor takes raw data in the form of
digital images of the manuscript folios and precise transcriptions of them and proceeds to encode the
transcriptions with descriptive markup by:
• identifying the folios and folio lines, and the prose and/or verse lines;
• associating folio images with the text that they contain;
• creating a full glossary entry for every word in the text;
• recording a wide range of features of the manuscript, such as (among many other things) its
script, its legibility, any damage to the manuscript, any technological means used to restore
damaged readings, as well as editorial emendations and conjectural restorations.
This encoding must be stored in a way that
1. allows efficient retrieval of information (e.g., “Display all characters written by the second
scribe that are visible only under ultra-violet light”); 2. ensures efficient and convenient manipulation of the data;
3. provides support for editorial work by more than one person at the same time.
The challenges the editorial process presents can be broken into three broad categories:
• management of XML markup produced during the editorial process [BSK01],
• maintenance of associations between XML markup and manuscript images [YS01]
• support for a multi-user environment
MANAGEMENT OF XML MARKUP FOR ELECTRONIC
EDITIONS
Because the editor and research team will record a highly diverse and extensive set of manuscript features, the
markup of the edition text is bound to be complex. Different features subject to description may not
conveniently follow valid or well-formed XML nesting patterns in the actual manuscript, making the XML
document of the edition quite eccentric with potentially clashing hierarchies. There are two approaches to
overcoming this problem. One is to maintain an extremely complicated DTD (or XML Schema) [BPS00,
BM01] for all markup and use specific, at times cumbersome, markup conventions to overcome clashing
hierarchies in the tagging. While this approach has the benefit of storing all annotation in the same XML
document, the ad hoc solutions adopted to keep the markup well-formed reflect negatively on the clarity of
the XML and may adversely affect subsequent searches and retrieval of information. The tagging software for
the editor may also become too tied to a particular XML Schema to be of generic use.
Another approach is to separate markup for different features into different DTDs (or XML Schemas)
and maintain parallel markup of the edition text automatically. This approach results in clear, concise, easily
maintained DTDs or X-Schemas. It also is an excellent stepping stone for supporting simultaneous work of
many editors on one edition: for example, paleography markup created by one editor and damage markup
created by another at the same time can be stored separately and thus will create no conflicts in data. This
approach shifts most of the data maintenance burden from the shoulders of the editorial team to the software
that supports the process. It is currently being implemented in ARCHway.
BUILDING ELECTRONIC EDITIONS AROUND IMAGES
The images of manuscript folios [Bse01] are, by far, the most important component of the Electronic Editions
as they provide the opportunity for researchers to see and study the actual manuscript. The lion’s share of the
editor’s time in preparation of the edition is spent studying the images and creating annotations in the form of
searchable descriptive markup based on close scrutiny of the manuscript images. We must therefore ensure
that the XML markup in our electronic editions is pervasively associated with the images or parts of images
on which the markup is based. To support this association, we introduce some supplemental data into the
edition dataset.
First come folio layouts, which store spatial associations between different images available for the
same folio. These images include full folio images under different lighting conditions as well as
higher-resolution images of certain important folio fragments. An individual layout is created for every
manuscript folio. Next, we establish a more detailed association between the manuscript text and its recorded
features and the folio layouts. This association is achieved by introducing indexing conventions that tie parts
of XML documents and the folio layouts. These conventions can be implemented either by creating and
maintaining special “linking” XML markup or by storing linking information in index structures (such as
quad-trees and inverted indices). The final solution adopted for the project will be the combination of the two
approaches shown in tests to provide the right balance between convenience of use and efficiency of retrieval
of information.
HELPING RESEARCH TEAMS WORK TOGETHER
Creation of an electronic edition of a manuscript can be significantly sped up if all members of a research
team, rather than a single editor, can work on parts of it simultaneously. As a collaborative process leads to
situations where different researchers contribute to the editing process at the same time, edition production
software must support concurrent work and be capable of helping the editor resolve data conflicts and prevent
resulting loss of data. Otherwise known in database research as concurrency control [EGLT76], this problem
presents a number of interesting challenges for our framework, stemming from the complexity of the edition
data set and the need to provide as much flexibility as possible to the editor and the research team.
While leaving concurrency control to the editor and the research team may well be the easiest
solution to this problem, it is excessively prone to human error and puts an unacceptable burden on the editor.
It is instead the task of the back-end of the EPT workbench to automatically assure data integrity at every step
of the editorial process and provide adequate means for concurrency control. This support is facilitated, in
part, by having multiple DTDs or XML Schemas for editorial markup of the manuscript. Beyond that, we
must design and implement specific concurrency control protocols to work with the electronic edition dataset. These protocols must assure consistency of all changes made to the XML documents and provide significant
flexibility for the editor and research team to work simultaneously.
REFERENCES
[BM01] P. V. Biron and A. Malhotra, eds. “XML Schema Part 2: Datatypes. W3C Recommendation.” 2 May
2001. <http://www.w3.org/tr/xmlschema-2/>.
[BPS00] T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler, eds. “Extensible Markup Language
(XML) 1.0 (Second Edition). W3C Recommendation.” 6 October 2000.
<http://www.w3.org/tr/rec-xml>.
[BSe01] M. S. Brown and W. S. Seales. “The Digital Atheneum: New Approaches for Preserving, Restoring,
and Analyzing Damaged Manuscripts.” Proceedings of the First ACM/IEEE-CS Joint Conference on
Digital Libraries. New York: ACM Press, 2001. 437–443.
[BSK01] M. S. Brown, W. B. Seales, K. Kiernan, and J. Griffioen. “3D Acquisition and Restoration of
Medieval Manuscripts.” Communications of the ACM: Special Issue on Digital Libraries. May 2001.
[EGLT76] K.P. Eswaan, J.N. Gray, R.A. Lorie, I.L. Traiger. “The Notions of Consistency and Predicate
Locks in a Database System,” Communications of the ACM, vol 19, No. 11, Nov. 1976.
[Hay01] D. Hayes. “Glossing Damaged Manuscripts: an Example from AElfric's Lives of Saints.” Digital
Resources for the Humanities (DRH01). University of London, London, UK. 10 July 2001.
[SGK00] W. B. Seales, J. Griffioen, K. Kiernan, C. J. Yuan, and L. Cantara. “The Digital Atheneum: New
Technologies for Restoring and Preserving Old Documents.” Computers in Libraries 20:2 (February
2000), 26-30. <http://www.infotoday.com/cilmag/feb00/seales.htm>.
[YS01] C. J. Yuan and W. B. Seales. “Guided Linking: Efficiently Making Image-to-Transcript
Correspondence.” Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries.
New York: ACM Press, 2001. 471.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None