Integrating TEI and EAD to Create Usable and Re-usable Archival Resources

paper
Authorship
  1. 1. Susan Hockey

    University College London

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The LEADERS (Linking EAD to Electronically Retrievable Sources) Project follows the terminology of the archives and records community in using the term ‘archives’ to indicate a sub-set of records which are preserved because they have long-term value. Records are created by organisations or individuals in the course of professional or personal activities and they form evidence of the activities which gave rise to them. Archival records provide a means of understanding human history, they are a basis for corporate memory and a source of community and personal identity and offer pathways to learning for people of all ages and backgrounds. (National Council on Archives, 2002) The work of the archivist is concerned with providing the means through which individuals can access archives. Users of archives need to be provided with access tools that will describe the contents of archive collections. Such access tools are often called ‘finding aids’ and they are produced through a process of ‘capturing, collating, analysing, and organising any information that serves to identify, locate and interpret the holdings of an archival institution and explain the contexts and record systems from which the holdings were selected’ (Definition from the Society of American Archivists). Encoded Archival Description (EAD) has brought many benefits to custodians and users of archives. It has provided a means for archivists to structure finding aids using technology that is independent of proprietary hardware and software platforms and it has enabled Internet delivery of those structured finding aids. However, as a stand-alone tool, it cannot give users access to the actual content of archival material. Therefore, there is real potential for the development of a resource that enables the integration of EAD with other tools that do allow for remote use of the contents of archival collections. The Text Encoding Initiative (TEI) enables electronic texts of all kinds to be searched and presented to users in a variety of different ways. When EAD and TEI are brought together alongside digitised images of archival material the potential benefits for users are vast. Within a single environment the user can find items in archival collections; learn about their contexts; view representations of the items themselves; and read, study, analyze and manipulate their content. At present the two encoding systems, EAD and TEI, are often used independently of each other and no generalised environment exists for linking them. Where links do occur in online archival finding aids, they normally point to digitised images of transcriptions without provision for any analysis or manipulation of the source. The LEADERS Project is developing a generic computer-based toolset that can integrate EAD encoded finding aids with TEI encoded transcripts and digitised images of archival material. This toolset will enable users to link directly from finding aids to digital documents and to manipulate those documents in various ways. This session will concentrate on three aspects of the LEADERS Project’s research. The first paper will discuss how the needs of the user community have been taken into account by LEADERS. This discussion will be set in the context of user needs evaluations conducted at the Public Record Office (United Kingdom) with a focus on how the LEADERS work on users is relevant to the PRO and to the archival profession as a whole. The second paper will highlight the presence of overlaps between EAD and TEI. It will define the problems that these overlaps can cause in a system that seeks to bring the two frameworks together, and will explore the solutions that are being employed by the LEADERS team. The final paper will explore the technical choices that have been made in building a demonstrator application from the LEADERS toolkit, in particular it will focus on how the TVS (Transport Validation and Services) Model has been used as the basis for the system’s architecture. FOCUSING ON THE NEEDS OF THE END USER COMMUNITY This paper shows how the theoretical and practical work of the LEADERS project is supporting service developments at the Public Record Office (PRO), the National Archives of England, Wales and the United Kingdom, and, with an annual 300,000 on-site visits and 80 million page downloads from its website (http://www.pro.gov.uk) the most heavily used archival resource in the world. Since 1994 the Public Record Office has re-invented itself as a customer focused organization, and since 1998 has held the prestigious Chartermark (recognition from the UK Government indicative of high quality services). Service enhancements introduced in response to the stated wishes of customers include extended opening hours, the provision of more expert services, faster retrieval times for documents, opening the research library to readers, self service scanning, reader orientation tours and a cybercafe. New online services which have evolved since 1995 include the flagship PROCat, a sophisticated EAD-based archive catalogue containing 9.3 million items; PRO Online which delivers images of popular records; the 1901 Census Online website; searchable research guides; and online educational resources for adults and school students. PRO users are consulted regularly through on-site and on-line surveys and through a number of focus groups and panels, and feedback from comments forms, complaints and new user surveys is regularly tracked and fed back into any immediate remedial action and programmes of service development. The system is well-established and effective for onsite surveys; because they were in the main cutting edge and designed to grow and cater for new markets, the PRO’s e-services were to a degree speculative in their design (although the award-winning Learning Curve for schools relied on feedback from teachers and school pupils from its inception). The PRO is now beginning the process of further consultation and will re-design many parts of its site to meet the information-seeking needs of its online users more closely. Here the findings of the LEADERS user survey, which it is currently running, will be invaluable. Major consultations on what users regard as priorities for service developments for the future have since 1998 produced consistent results: topping the wish list is access via high quality catalogues to the images of the documents and with them, searchable transcripts. UK government guidelines and the rules of funding bodies such as the New Opportunities Fund have in addition since 2001 mandated full transcriptions (and where appropriate translations) of documents to appear alongside the images in online learning resources. In large-scale imaging projects this approach would however add exponentially to the costs of creating the resource. Hence the PRO is not at present providing full transcripts in its PRO Online website, and nor is this currently linked directly to PROCat. The LEADERS work linking EAD and TEI will consequently be very valuable, since it offers a toolkit to link directly and enhance the functionality of these two online resources in line with stated customer priorities. The PRO has generally in the past followed a simple segmentation model categorizing users as personal interest and family history researchers; professional researchers; the educational sector; and others. Since 1999 it has worked with the Public Services Quality Group for archives and local studies in carrying out an annual survey of archive users across the UK. Here are the results for 1999 in terms of research segments; these results are typical. Table 1: Research Interests of UK Archive Users in 1999 Since 2001 the PRO has evolved this segmentation model as the basis for tailored service development, the user segments being academics, family historians and personal interest researchers, people in formal education and leisure historians, many of this last category are non-users (PRO Keeper’s Report, 2002). The LEADERS work on the segmentation of the user market and evaluation of user needs will however add a new layer of sophistication which will help the PRO to fine tune its service developments more closely. The current LEADERS questionnaire segments researchers under education/training, personal leisure, professional/occupational, personal obligation and others—the professional and personal obligation data correlated with research topics will enable the PRO and other archives to tailor research leaflets specifically to the very precise needs of people in some microsegments, where appropriate. In addition LEADERS examines research interests under names, places and topics, and looks also at levels of familiarity with research interest—even the initial findings supporting the relative popularity of topic-based research suggest that the archival community may need to rethink its approach to the front ends of knowledge management systems and resources. How does the LEADERS research fit within the broader framework of academic work on information-seeking behaviour? A number of academic studies on how users of information access resources online have been undertaken on both sides of the Atlantic. A key issue in this debate is the changing expectations of archive and information users in a digital and web-enabled age (Cox, 1998). Here there is much work available on information seeking behaviour by users largely of digital libraries, but encompassing obstacles to access to both library and archive collections (Seaman, 1997; Peterson Bishop, 1998; Mates, 2000). But although Margaret Hedstrom has published an important clarion call to make archival resources on-line more accessible to users (Hedstrom, 1998), this is unusual: most such work in the archive-specific domain is detached from the real world of the coal face and focuses on narrow issues. While some useful findings emerge from it, it must be said that at times these are obvious from the point of view of the practitioner. For example, one recent study expresses disappointment that many online users display a ‘disappointing reluctance’ to use sophisticated search mechanisms (Large et al, 1999); a second that many archive users are searching for a specific item (Bearman, 1989/1990); a third establishes that the vast majority of users of photographic collections want to be able to search and retrieve images by subject (Collins, 1998); and a fourth, that searchers in online catalogue want indexes, glossaries and help functions to help them navigate through hierarchically structured material (Dugg & Stoyanova, 1998). Others establish that using search engines to try to locate archival finding aids usually produces unmanageably large sets of results (Feeny, 1999; Tibbo & Meho, 2001). All this supports the conclusion that many on-line archival finding aids do not take their users’ needs sufficiently into account (Rosenbusch, 2000). This is where the LEADERS work in extrapolating from the results of research on user needs to the building of a practical application is of such significance, since it feeds back the fruits of academically rigorous methodology into the development of a demonstrator system and toolkit which will have a practical use and value to the archival community in the UK. Once this demonstrator is built it will be further trialled in a real environment and modifications made to take account of changes both in technology and the ever-expanding expectations of users. The archive profession in the UK will also be able to use the results of research based on the customer segmentation model to help it tailor services more precisely to the needs and requirements of each subset. For Access to Archives (A2A), for example, an EAD-based catalogue of archival series drawn from numerous repositories across England and currently containing 4 million records from 199 record repositories, the findings will enable existing work with the user panel to enhance searchability focused on the needs of each user segment to be further developed and refined (http://www.A2A.pro.gov.uk). And the toolkit integrating TEI and EAD will support the enhancement of A2A by adding digitized searchable texts and images to some of the series. Similarly work to integrate and enhance the website of the new National Archives of the UK, bringing together the Public Record Office and Historical Manuscripts Commission from April 2003, will focus on user needs and search patterns and will be able to use the results of the LEADERS research alongside the views of user panels and online surveys. INTEGRATING EAD AND TEI: THE RESOLUTION OF METADATA OVERLAPS The two encoding systems, EAD and TEI, were designed to serve different purposes, yet there is a degree of overlap between them. In the context of archival material we can begin to explore the overlap by considering the TEI and EAD in relation to the original archival document, as both encoding systems seek to represent the original in different but convergent ways. The archive profession in the UK will also be able to use the results of research based on the customer segmentation model to help it tailor services more precisely to the needs and requirements of each subset. For Access to Archives (A2A), for example, an EAD-based catalogue of archival series drawn from numerous repositories across England and currently containing 4 million records from 199 record repositories, the findings will enable existing work with the user panel to enhance searchability focused on the needs of each user segment to be further developed and refined (http://www.A2A.pro.gov.uk). And the toolkit integrating TEI and EAD will support the enhancement of A2A by adding digitized searchable texts and images to some of the series. Similarly work to integrate and enhance the website of the new National Archives of the UK, bringing together the Public Record Office and Historical Manuscripts Commission from April 2003, will focus on user needs and search patterns and will be able to use the results of the LEADERS research alongside the views of user panels and online surveys. Original archive records can be described as ‘objects of study’. They provide evidence of and information about the functions and activities that gave rise to them and are therefore used by individuals to aid learning and research of varying kinds. When archival material is classified in this way, then the purpose of EAD is to describe those objects of study. It is a metadata standard designed to provide a structure for holding data about the original material. The archive profession in the UK will also be able to use the results of research based on the customer segmentation model to help it tailor services more precisely to the needs and requirements of each subset. For Access to Archives (A2A), for example, an EAD-based catalogue of archival series drawn from numerous repositories across England and currently containing 4 million records from 199 record repositories, the findings will enable existing work with the user panel to enhance searchability focused on the needs of each user segment to be further developed and refined (http://www.A2A.pro.gov.uk). And the toolkit integrating TEI and EAD will support the enhancement of A2A by adding digitized searchable texts and images to some of the series. Similarly work to integrate and enhance the website of the new National Archives of the UK, bringing together the Public Record Office and Historical Manuscripts Commission from April 2003, will focus on user needs and search patterns and will be able to use the results of the LEADERS research alongside the views of user panels and online surveys. TEI on the other hand is primarily a content encoding framework for the creation of new ‘objects of study’. The objects of study created by TEI are usually derived from one or more other (original) objects of study. When TEI is used to encode archival material, the archive document is the original object from which the new object is derived. This means that in order for the TEI object to be understood the encoder must not only transcribe and encode the contents of the original, but must also provide metadata that can put the text in context. This metadata, provided in the <teiHeader>, must describe the newly created object of study (e.g. provide a description of the electronic file and the encoding process) but must also describe the original object. Furthermore, the actual encoding that surrounds the data within the TEI object can also be viewed as a kind of metadata because the aim of the encoding is to delimit parts of the text and describe what those delimited parts represent. Therefore, although the TEI is primarily concerned with content encoding, it also must include metadata to put the object into the context of why and how it has been created, what it has been derived from, and what the data within the objects represents. Therefore, the overlap between the two frameworks occurs in relation to metadata that: • Identifies, locates and gives details about the creation of the original object • Describes the physical characteristics of the original object • Provides contextual information about the creator and the participants within the original object • Interprets/describes the data in the object
The general solution to areas of overlap employed by LEADERS is the development of a Schema which contains namespaces to reference the relevant parts of the respective EAD and TEI DTD. It makes sense for us to work on the general principle that metadata about the creation of the electronic transcript is best described using <teiheader> elements, whereas information relating to the original object (which is the source of the TEI transcript) is best described using EAD elements. However, one particularly contentious area for consideration in the integration process is where metadata for interpreting and categorising the data within the object is held. When the basis for a TEI encoded document is a ‘literal’ transcription of an original source then metadata that seeks to categorise and interpret the data within the text is equally applicable to the original document and the new electronic transcript. There is no clear dividing line between the two, and so the general principle being advocated by LEADERS becomes difficult to apply. Furthermore, both TEI and EAD offer the encoder a range of options over where and how such categorisations and interpretations are made. This flexibility means that as well as comparing EAD and TEI against each other, they must also be examined individually to assess the variety of encoding options contained within them. Conducting this assessment is vital because the decision over where to place this metadata will ultimately have an impact on search and retrieval possibilities across the EAD finding aid and the TEI transcript. Within a TEI encoded transcript it is possible to mark up the content of a text within a range of different elements. The degree of content encoding and the range of elements used to do so varies according to the purpose behind the encoding process. In the development of our demonstrator application we have taken a preliminary decision to explicitly encode all instances of names, dates, places, and publications as they appear in the documents we are using as test bed material. This explicit markup allows the content within the elements to be used as index terms for the transcribed documents, and it opens up possibilities for various types of user analysis to be conducted across the content marked up within the elements. As well as being able to categorise and interpret data by enclosing it in specific elements within the flow of the text, TEI also allows the encoder to categorise and classify the content using elements provided by the <teiheader>. Index terms can be created (preferably using a recognised classification scheme/taxonomy) and placed in the <keywords>, <classcode> or <catref> elements (within <textclass> in <profileDesc>). Furthermore, if the encoder has included the additional tagset for language corpora in the DTD then other classification elements such as <channel> and <domain> also become available. There is a degree of overlap between content encoding that can occur within the text and categorisations that can be placed within the <teiheader>. This overlap leads to a consideration of whether it is acceptable or useful for the same information to be tagged up both in the header and the flow of the text. One argument may follow that the dual functionality offered by content encoding within the text makes it a more obvious choice for containing categorisations and interpretations on the data. If for example, every instance of a name is marked up within the TEI <name> element, then the name element can act as an index term and can also be used in user analysis. If names are tagged up in this way what advantage is there in having the name repeated again as an index term within the header? This point is particularly relevant to documents like a parish register where there are so many names occurring within the flow of the text, that transferring such information into the header would be an arduous task. However using the content encoding for indexing is not altogether straightforward. Encoding index terms within the flow of the text becomes more complicated when you want to assign a term to a wide span of data that overlaps with other hierarchical divisions. For example, a particular subject classification may span over several paragraphs. Although various encoding techniques can be used to overcome the problems of overlapping hierarchies it may be easier simply to put the classification term in the <header> rather than trying to apply it to a particular chunk of data. Furthermore, the dual functionality of content markup demands that every single instance of the data types that have been selected for special encoding will be marked up. For example, if the encoder has decided to mark up names within the text, every single instance of a name will be tagged. If a particular person has been referred to several times in a single document then each instance will be enclosed in a name element. If the content encoding on names is acting as the document’s index terms then the implementer must consider how a search engine should deal with documents with multiple instances of the same name within them. Care must be taken to avoid the situation where the same document appears on the user’s hit list because it contains multiple instances of the search term used. EAD has a similar dual approach to metadata that interprets and categorises data within the original documents. Within EAD, index terms into the finding aid can be provided through a <controlaccess> element. However, index terms such as names, dates, and geographical locations can also be marked up within the narrative of the finding aid descriptions when they occur in descriptive content holding elements such as <scopecontent> and <bioghist>. However, the argument for encoding index terms within the flow of an EAD finding aid is unconvincing. This is because such content encoding in EAD does not have the dual function of also opening up possibilities for user analysis and manipulation of the data within the markup, as is the case with TEI. The reason for this goes back to the fact that an EAD finding aid describes objects of study and is
not an object of study in itself, and therefore there is no demand for close analysis of the finding aid as a ‘text’. When the argument of dual functionality is non-existent then enclosing index terms within a specific element like <controlaccess> seems preferable. However, this issue is complicated by the hierarchical nature of the EAD framework. An EAD encoded finding aid is organised into multiple levels of description where the collection is first described as a whole, and then in component parts which get more specific at each level of description. At the lowest level (item), it is the archive record that is being described. The levels of description within EAD follow the rule of inheritance where the lower levels can ‘inherit’ descriptive data placed above them in the finding aid tree. If ‘inheritance’ is applied to index terms within the <controlaccess> element then it follows that terms should be placed at the highest level of description where they can be said to apply to that level and all levels below them. The rule of inheritance then dictates that all lower levels of description have their own <controlaccess> terms plus all the terms placed above them in the descriptive hierarchy. In practice this means that a sophisticated search engine needs to be employed to traverse the hierarchy and map the relationships between index terms and levels of description. No real research has been conducted into the effect or usefulness of levels of description within EAD on index terms and their searchability and so it is an area that LEADERS needs to explore further. Our research is complicated further by adding another layer to the problem which is the effective searchability across both EAD and TEI as an integrated unit. Solutions to overlaps between EAD and TEI and our final integration method must avoid repetition of information. Archivists and other related professionals will not appreciate repeating information that has been recorded in one place a second time. Furthermore, the potential for confusion in a system that is trying to integrate the two encoding frameworks is increased when the same data is held in different places according to different principles. It is also important that both the EAD finding aids and the TEI transcripts are not integrated to the extent that one cannot be re-used independently of the other. Each should be able to be recovered and used in other systems/applications for other purposes as stand-alone digital objects, otherwise the potential for the re-use of data that comes when working with non-proprietary tools will not be fully exploitable. Finally, as a primary use of metadata is to facilitate ‘resource discovery’, our integration solution must support meaningful search and retrieval and presentation of results. DEVELOPING A GENERIC TOOLKIT: ARCHITECTURE AND TECHNOLOGY ISSUES The LEADERS project is charged with developing a generic XML-based toolkit for use on multiple projects and with a wide variety of archival source materials encoded using EAD and TEI. This paper discusses the issues involved in designing and implementing such a generic toolkit. The generic goal has been a key influence on the criteria applied to the technical choices made on the project. When considering a generic development as opposed to a one-off project, criteria such as availability, support, sustainability and issues of flexibility versus standardisation become relevant. The relative immaturity and the openness of the XML environment, has lead to a proliferation of tools and utilities developed and supported by individuals or ad hoc groupings. Whilst these may be both innovative and acceptable for use on a one-off project, the requirements for a generic toolset mean that we need to focus on products which are within the technical mainstream and have the backing of a stable organisation. With regard to standardisation, both EAD, and more fundamentally the TEI have been designed for maximum flexibility. The assumption behind the design of TEI and supporting technologies, and to some extent behind EAD as well, has been that these tools would be used on single projects with a particular aim or objective and a homogenous set of source materials. This flexibility is desirable when viewed from the perspective of supporting the widest possible use on the widest range of individual projects. However the consequence is that each project using the TEI needs to define its own DTD and make its own rules for interpreting the TEI when encoding. For a generic toolkit, the rules need to be tightened up so that the tools for transforming and exploiting the resulting encoded material can be standardised. The project has also had to choose between the use of Schemas and DTDs, in particular in view of the need to combine TEI, EAD and the NISO MIX Schema for the visual images associated with the encoded materials. Unlike DTDs, Schemas offer the use of namespaces, central to the combination of schemes required by LEADERS. Also Schemas support data type validation, a central requirement to support re-usability of the encoded materials and associated stylesheets and applications. Having chosen to develop a schema, different Schema languages such as RELAX-NG, Schematron and W3C were assessed and with reference to previously published evaluations and applications of the tools in question. A basic feature comparison was undertaken, and previous experiences evaluated within their contexts. Most experiences relate to the use of tools on a specific project or series of projects. For LEADERS the generic aspect takes on a major importance. We have to be conscious of the fact that our toolkit is designed for others to use in a range of differing circumstances. Therefore criteria such as sustainability, support tools available and compatibility with other areas of the XML family of standards play a more important role in selection than Mates, B.T., Adaptive Technology for the Internet: Making Electronic Resources Available to All, American Library Association, Chicago and London, 1998 National Council on Archives, Changing the Future of Our Past, NCA, London, 2002 Peterson Bishop, Ann, ‘Measuring Access, Use and Success in Digital Libraries’, The Journal of Electronic Publishing, vol.4, 1998, at <http://www.press.umich.edu/jep/04-2/bishop.html> Public Record Office, Keeper’s Report 2001/02, PRO, London, 2002 Rosenbusch, A., ‘Are Our Users Being Served? A Report on Online Archival Databases’, Archives and Manuscripts, vol. 29, 2000 Seaman, D., ‘The User Community as Responsibility and Resource: Building a Sustainable Digital Library’, D-Lib Magazine, July/August 1997, at <http://www.dlib.org/dlib/july97/07seaman.html> Shaw, Elizabeth J. ‘Rethinking EAD: balancing flexibility and interoperability’, New Review of Information Networking, vol. 7, 2001 Tibbo H. R. and Meho L.I., ‘Finding Finding Aids on the World Wide Web’, The American Archivist, vol. 64, 2001

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None