Legacy Data Migration: A pilot study on the methodological feasibility of conversion and enhancement of electronic resources.

  1. 1. James Cummings

    Oxford Text Archive - Oxford University

  2. 2. Monica Langerth Zetterman

    Digital Literature - Uppsala University

In this paper we will describe a pilot study based on two different subsets of electronic resources to be used in the Virtual Corpus system developed at the Oxford Text Archive (OTA). The Virtual Corpus system is designed to make the OTA more useful for researchers by enabling the selection of texts for a corpus on basis of metadata categories in the TEI header resource description. Currently these categories include such fields as language, date, genre, author etc. (Berglund & Wynne, 2003). In order to make the Virtual Corpus system (VC) even more useful the texts would benefit from data enhancement. This pilot study's aim is to evaluate the necessary procedures for enhancing the metadata available in the TEI header with further categories and also to explore the possibilities for migration of legacy data in a wide range of formats into a TEI-conformant XML format.


Founded in 1976 by Lou Burnard, the OTA has over twenty years experience of serving the research and teaching needs of electronic text users within the scholarly community. We have witnessed and encouraged the widespread acceptance of digital resources within academia. Formerly the preserve of small research-oriented groups of specialists, electronic text is now the common currency of academics. More recently the OTA has become the hosting subject centre for the UK Arts and Humanities Data Service: Literature, Languages, Linguistics.

The OTA currently holds several thousand electronic texts and linguistic corpora, in a variety of languages. Its holdings include electronic editions of works by individual authors, standard reference works such as the Bible and mono-/bilingual dictionaries, and a range of language corpora. The OTA does not produce digital resources, and instead relies upon deposits from the wider community as the primary source of high-quality materials.

While the deposited resources of the OTA may be of a very good quality, they were often originally deposited in any number of highly individualistic markup schemes.

In this case study we are evaluating the proposed process for conversion of legacy data and viability of enhancement. There are many issues at hand, for example how do we categorise formats? How much time and effort is involved when evaluating the formats and any markers or markup used in the texts? Two subsets of the OTA's holdings were chosen in order to give a smaller fixed amount of material to evaluate. One subset contains biographies and shorter works while the other has sixteenth century English drama. These subsets were determined by the examination of the Library of Congress subject headings, which the OTA adds to the TEI Headers for each of its resources.


When evaluating legacy data one has to consider a number of issues, related to viability and usability of the texts. Besides, in any endeavour of metadata enhancement and format conversion one has to consider a range of issues, such as evaluating the amount of work involved or finding appropriate methods for a conversion from a number of formats and unconventional markup schemes into XML format.

Our aim is to evaluate if individual texts should be considered for ehancement and migration to XML. Part of this decision relies on accurate information concerning their format, the quality of markup if any, the uniqueness or significance of the text, and what features have been encoded.[1] However, before considering any format conversion we also need to check whether the there is a better version of text is freely available elsewhere.[2] Whether a currently available text is able to be considered 'better' than the version originally deposited with the OTA is in itself problematic. The availability of a significantly encoded XML text does not mean that the quality of the textual edition itself is of higher quality. An increase in functionality should not be coupled with a downgrading of textual integrity or academic merit.

The pilot study

On the selected subsets of texts we performed a number of checkpoints in our evaluation and enhancement process: we checked the retrieval restrictions, the text format and for the existence of a text elsewhere that could be considered a better version. If a text not is freely available elsewhere in a better form, the viability enhancing the existing resource will be evaluated. If it is unfeasible to enhance the text, it will be flagged as having limited usability. When the text is in XML/SGML format and evaluated as suitable for enhancement, the new metadata is added, and it will be flagged as "completed" in the VC.

However, if the text is in an unknown or un-encoded format, it has to be carefully analysed in terms of the quality of any embedded markers or markup. Thus, the evaluation will also consider workload and time spent on a number of aspects concerning analysing the texts. The document analysis focuses on whether the embedded markup/markers are sufficiently consistent and distinct for conversion into XML format.

One of the tasks in the analysis is to identify a format for the text or to check whether the format is the same as stated in the TEI Header and examine any variance from that format.[3]

Electronic resources in the OTA are prepared for computer analysis and/or retrieval in a number of ways. The level of encoding varies and while some texts are in well-formed XML or SGML formats, many are in plain text/ASCII with little or no markup embedded in the text.

Some plain text resources are not encoded at all and some of the older texts are in upper case and others are sparsely marked up, through either conventional schemes or individual markup schemes. These may use the same markup to denote significantly different aspects of the texts, for example italic words, stage directions or numbered lines.

A significant number of the earlier electronic texts are prepared in COCOA (named after an early general-purpose concordance programs from the 1960's). The Oxford Concordance Program (OCP) used COCOA as the referencing scheme and an expanded version of COCOA was also used by the text retrieval program called TACT (Oxford University Computing Service (1988); Bradley, 1996).

The COCOA method uses angle brackets for enclosing references in texts. In addition, extra characters are sometimes used to mark certain features, such as italic words, proper names, diacritics, editorial marks, grammatical categories or foreign words (Oxford University Computing Service, 1988 pp. 11-21). However, the additional characters used for marking these aspects of the text frequently vary between different resource creators.

Many of these texts are able to be converted through a variety of electronic means, but because of their lack of consistency, they would still need to be individually proofread. In the case of the numerous Shakespeare resources held by the OTA, it is unlikely that the majority of them will be deemed of significant academic merit over the many versions freely available elsewhere. Thus, little effort should be expended in the enhancing such texts, unless the OTA resource is a uniquely interesting version without any distribution restrictions.

While the texts are being examined so closely, the addition of new metadata has been undertaken. The categories of metadata added are those of the gender of the author of the original text, the birth country of this author, and the original publishing/printing date of the text that the electronic edition is based on.


It is intended that this evaluation will lead to the conversion and enhancement of a significant proportion of the OTA's holdings. The benefits implicit in the conversion to XML include the ability for more sophisticated virtual corpus manipulation as well as more detailed search and retrieval options. The increase in the amount of metadata applied in a consistent and coherent manner within the TEI Header also will enable and increase the functionality for user manipulation with these texts.

The paper will conclude with an examination of the possibilities and pitfalls for the future.


1. Berglund, Y & Wynne, M. (2003). Virtual Corpora from the Oxford Text Archive. Presented at The 24th International ICAME Conference (April 23, 2003 -- April 27, 2003, Guernsey, UK). http://www.rdues.liv.ac.uk/icame2003/
2. Bradley, J. (1996). TACT Design. Computing in the Humanities Working Papers ( May 1996). Found at http://www.chass.utoronto.ca/epc/chwp/bradley/index.html
3. Morrison, A., Popham, M. & Wikander, K. (2000). Guide to Good Practice 1: Creating and Documenting Electronic Texts. Oxford University: Oxford Text Archive,
4. Oxford University Computing Service (1988). Oxford Concordance Program. Users' manual. Version 2. Susan Hockey & Jeremy Martin. Oxford University.

1. Cf. The OTA Guide to Good Practice where several different aspects on markup and reusability are elaborated (Morrison, Popham & Wikander, 2000).
2. See the Oxford Text Archives Collections Policy (Version 1.1) on evaluating viability for management, preservation and distribution. Available at: http://ota.ahds.ac.uk/publications/ID_AHDS-Publications-Collections-Policy.html
3. See TEI guidelines at http://www.tei-c.org/P4X/HD.html and the Oxford Text Archive Collections Policy found at: http://ota.ahds.ac.uk.

