Code, Comments and Consistency, a Case Study of the Problems of Reuse of Encoded Texts

Claire Warwick; George Buchanan; Jon Rimmer; Ann Blandford; Jeremy Gow

Authorship

1. Claire Warwick

School of Library, Archive and Information Studies - University of Sheffield
2. George Buchanan

Dept of Computer Science - University of Swansea
3. Jon Rimmer

School of Library, Archive and Information Studies - University College London
4. Ann Blandford

UCL Interaction Centre - University College London
5. Jeremy Gow

UCL Interaction Centre - University College London

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
It has long been an article of faith in computing that when a resource, a program or code is being created, it ought to be documented. (Raskin, 2005) It is also an article of faith in humanities computing that the markup should be non-platform-specific (e.g. SGML or XML). One important reason for both practices is to make reuse of resources easier, especially when the user may have no knowledge of or access to the original resource creator. (Morrison et al, nd, chapter 4)
However, our paper describes the problems that may emerge when such good practice is not followed. Through a case study of our experience on the UCIS project, we demonstrate why documentation, commenting code and the accurate use of SGML and XML markup are vital if there is to be realistic hope of reusing digital resources.
Background to the Project
The UCIS project (www.uclic.ucl.ac.uk/annb/DLUsability/UCIS) is studying the way that humanities researchers interact with digital library environments We aim to find out how the contents and interface of such collections affect the way that humanities scholars use them, and what factors inhibit their use. (Warwick, et al., 2005) An early work-package of the project was to build a digital text collection for humanities users, delivered via the Greenstone digital library system. We chose to use texts from the Oxford Text Archive, (OTA) because this substantial collection is freely available and contains at least basic levels of XML markup. However this task
was to prove unexpectedly difficult, for reasons that
extend beyond the particular concerns of UCIS.
Findings
On examination of a sample of the files, we found that although they appeared to be in well formed XML, there were many inconsistencies in the markup.
These inconsistencies often arise from the electronic history of the documents. The markup of older (Early and Middle) English texts is complex, and many of the
problems stem from succeeding revisions to the underlying content. One common early standard was Cocoa markup, and many of the documents still contain Cocoa tags which meant that the files would not parse as XML.
In Cocoa, the (human) encoder can provide tags that
indicate parts of the original document, their form and clarity.
These tags were retained in their original Cocoa
form which was mistaken for potential TEI tags by the processing software. Many characters found in earlier
English were encoded using idiosyncratic forms where modern (Unicode or SGML Entity) alternatives now exist.
The earlier, Cocoa, form may render the modern electronic encoding unparsable in either XML or SGML.
Another problem with Cocoa markup is that it was
never fully standardised, and tags are often created or used
idiosyncratically. (Lancashire, 1996) This complicates a number of potential technical solutions (e.g. the use of XML namespaces). Some content included unique tags such as “<Cynniges>”: not part of any acknowledged
hybrid of the original standard. The nature of this is unclear.
It may be an original part of the text, (words actually
surrounded by ‘<’ and ‘>’), a Cocoa tag, or a TEI/SGML/XML
tag. The distinction of forms known to a modern TEI/XML
document is straightforward; the distinction between
Cocoa and SGML/XML is not possible in this context.
Even parts of the same document used the same tag
inconsistently. For example, distances (e.g. “ten lines of space”) may be rendered in numeric form (‘10 lines’) or textual form ‘ten lines’) and distance units may be given in full or abbreviated.
One common character notation was ‘&&’ to represent the ‘Thorn’ character (in upper case) and ‘&’ to represent the same character in lower case. This was interpreted
as a SGML/XML entity, but parsers were unable to
successfully interpret the original scheme. Furthermore, as the SGML/XML format was used in other parts of the document, even a bespoke parser could not successfully
disambiguate the intention of every occurrence of the ‘&’ character. Thus, content is effectively lost. Other characters remain in forms such as ‘%’ for ‘&’ or ‘and’ –because of the original special use of ‘&’. Such
characters thus remain unintelligible to an SGML or XML
document reader. Given these complications, it is often impossible for a computer to determine the proper form of the document without human intervention, making automatic processing and indexation impossible.
As Giordano (1995) argues, ‘No text encoded for
electronic interpretation is identifiable or usable unless it is accompanied by documentation’. Yet in none of these cases did we find that markup decisions had been
documented, nor was the code commented. The OTA
supplied each file with a TEI header, which provides some basic metadata about its creation. However, the header was intended to act as the kind of metadata that aids in resource discovery, rather as code books were used to find a specific social science dataset on a magnetic tape. The <encodingdesc> element is not mandatory, and was intended to explicate transcription practices rather
than detailed markup decisions. (Giordano, 1995) We certainly did not find any examples of attempts to
elucidate markup schemes in the headers. Documentation
was also not available for any of the files we looked at. Though the OTA strongly encourage depositors to document their work, they do not mention markup specifications as an element of basic documentation, so even documented files might not have provided the information we needed. (Popham, 1998). We were therefore forced to attempt to reconstruct decisions made from visual examination of each file.
Despite the help of the OTA with cleaning up the data, the task proved so large that we had to abandon the use of these files. We have therefore used commercially produced resources, with the permission of Chadwyck Healey limited. The advantage of using their material for our project was that the markup is consistent, has been documented and conforms to written specifications.
Conclusions
It is to be hoped that simply by drawing attention to some of the problems that may occur in reuse, our work will cause resource creators to take seriously the importance of documentation and consistency. We have reported a case study of one UK-based repository, but since the OTA is one of the most reputable sources of good quality electronic text in the world, our findings should be of interest to the creators and users of other electronic texts well beyond this particular example. Not all electronic texts are of such high quality, nor are they
always collected by an archive, and so such considerations become even more important when texts are made available
by single institutions such as libraries, university
departments or even individual scholars.
One of the objectives of the Arts and Humanities Data Service (the organisation of which the OTA is a part) since its foundation has been to encourage the reuse of digital resources in humanities scholarship. Yet our
experience has shown that the lack of consistency and documentation has made this task almost impossible. The advantage of markup schemes such as XML should be that data is easily portable and reusable irrespective of the platform within which it is used. Yet the idiosyncratic uses of markup that we found have almost negated this advantage.
The creators of the resources probably thought only of their own needs as researchers and were happy with markup that made sense to them. It is still common for projects that use TEI to create their own extensions,
without necessarily documenting them. Unlike computer
scientists, whose collaborative research practices make them aware of the importance of adhering to standards and conventions that make their code comprehensible, humanities scholars are rewarded for originality, and tend to work alone. Research paradigms do not oblige scholars to think about how their work might be reused,
their data tested, or their resource used to further
research collaboration. One recommendation that follows
from our work is that humanities scholars should at least take advice from, and ideally collaborate with, computer scientists or technical specialists, whose collaborative research practices make them aware of the importance of adhering to standards and conventions that make their code comprehensible.
This might not matter if the creators of a resource are its only users, but given the intellectual and monetary cost of resource creation, their authors ought at least to
be aware of the possible implications of applying
idiosyncratic markup without comments or documentation. This paper provides the evidence of just such consequences.
References
Giordano, R. (1995) ‘The TEI Header and the
Documentation of Electronic Texts.’ Computers and the Humanities, 29 (1): 75-84.
Lancashire, I. (1996) Bilingual Dictionaries in an
English Renaissance Knowledge Base. Section 3.
Computers in the Humanities Working Papers.
University of Toronto. Available at http://www.chass.
utoronto.ca/epc/chwp/lancash1/lan1_3.htm
Morrison, A. Popham, M and Wikander, K. (no date) Creating and Documenting Electronic Texts: Guide to Good Practice, AHDS publications. Available at http://ota.ahds.ac.uk/documents/creating/
Popham, M. (1998) Oxford Text Archive Collections
Policy - Version 1.1, AHDS Publications. Available
at http://ota.ahds.ac.uk/publications/ID_AHDS-
Publications-Collections-Policy.html
Raskin, J. (2005) ‘Comments are more important than code’. Queue 3 (2): 64-66.
Warwick, C., Blandford, A., Buchanan, G. &
Rimmer, J. (2005) ‘User Centred Interactive Search in the Humanities.’ Proceedings of 5th ACM/IEEE-CS joint conference on Digital libraries. New York: ACM press. p. 400.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Code, Comments and Consistency, a Case Study of the Problems of Reuse of Encoded Texts

1. Claire Warwick

2. George Buchanan

3. Jon Rimmer

4. Ann Blandford

5. Jeremy Gow

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006