Cross - Collection Searching : A Pandora's Box or the Holy Grail?

  1. 1. Susan Schreibman

    University Libraries - University of Maryland, College Park

  2. 2. Gretchen Gueguen

    University Libraries - University of Maryland, College Park

  3. 3. Jennifer O'Brien Roper

    University Libraries - University of Maryland, College Park

While many digital library initiatives and digital humanities centers still create collection-based
projects, they are increasingly looking for ways of
federating these collections, enhancing the possibilities of discovery across media and different themed-research.
Facilitating access to these objects that are frequently
derived from different media and formats, while belonging
to different genres, and which have traditionally been described in very different ways, poses challenges that more coherently-themed collections may not.
In the last few years it has become increasingly evident to those in the digital humanities and the digital library communities, and the agencies which fund their research, that providing federated searching for the immensely rich digital resources that have been created over the past decade is a high priority. Several recent research grants
speak to this issue, such as the Mellon-funded NINES:
A Network Funded Initiative for Nineteenth Century Electronic Scholarship , or The Sheet Music Consortium .
While digital objects organized around a specific theme
or genre typically provide opportunities for rich metadata creation, providing access to diverse collections that seem
to have little in common (except that they are owned
by the same institution) often poses problems in the
compatibility of controlled vocabulary and metadata schema.
While this problem has been noticed on much larger
scales before and addressed by initiatives such as z39.50
and the Open Archives Initiative’s Protocol for Metadata Harvesting, addressing the problem within a library’s or center’s own digital collections is a vital part of making such initiatives successful by leveraging cross-collection discovery through the internal structure of the metadata scheme as well as a consistent approach to terminology. This presentation will explore the issues surrounding creating an archive of cross-searchable materials across
a large spectrum of media, format, and genre at the
University of Maryland Libraries . It will examine the way
some of these interoperability problems can be addressed
through metadata schema, targeted searching, and
controlled vocabulary.
Description of Research
This paper will be based on the research done at the University of Maryland Libraries using two ongoing projects. The first project utilizes The Thomas
MacGreevy Archive , a full-text digital repository
(following the Text Encoding Initiative (TEI) Guidelines ),
to explore the development of metadata and descriptors to facilitate searching across individual collections which
are described at different levels of granularity. The
second project involves using the knowledge based on the research carried out for the more cohesive MacGreevy Archive for the more diverse repository the UM Library
is developing utilizing Fedora as its underlying repository architecture.
The Thomas MacGreevy Archive is being explored
as a microcosm from which to examine issues of
searchability of content divided into collections that
cannot be described using a single controlled vocabulary and has different modes of display. The necessity for cross-collection searches has arisen due to the Archive expanding its content from digitized versions of books and articles, to two collections of correspondence (one relatively small collection of seven letters, the other quite large, c 150 letters), and making images that are currently
available only via hyperlink from within texts
individually discoverable. Preliminary findings involving this research were shared at the joint 2005 ACH/ALLC Conference at University of Victoria.
Another issue that the MacGreevy Archive can model is problems of controlled vocabularies across different
collections. The current controlled vocabulary descriptors use a faceted approach to describe articles and books written on such topics as art, music, and literature. Since
both the correspondence and images differ in form and function from the existing objects in the archive, the
current controlled vocabulary descriptors are not granular enough to capture either the variety of themes common to letters, or the additional descriptors to describe what the images are of and about.
The experience gained in exploring the more homogenous
MacGreevy Archive is being applied to the much more diverse collections and formats being housed in the
Fedora repository in which rich collection-specific
controlled vocabulary across multiple formats is being
developed at the same time as a vocabulary which
provides users the opportunity of discovery across all collections. While specific controlled vocabularies exist that would adequately describe each collection, they are generally too specific for materials outside that collection. On the other hand, Library of Congress Subject Headings (LCSH), while sufficiently broad in scope, are unwieldy
in form, taking a post-faceted approach by combining
several smaller descriptors into a predefined string.
These long strings cause multiple problems in searching (including not being amenable to Boolean searching) but are ubiquitous within university libraries, forming
the underlying basis for the vast majority of online
catalogues. LCSH descriptors will be necessary, however,
for those objects that will occur concurrently in the
library catalog.
In exploring cross-collection search capabilities in this larger, more diverse environment, the use of controlled vocabulary for subject access is only one possible source
of commonalities. Within the Fedora repository, the University of Maryland Libraries will create a metadata
scheme that represents a hybrid of the elements and
concepts chiefly found in qualified Dublin Core , and the Visual Resources Association Core < >. This scheme and hybrid approach was first used by the University of Virginia, and the UM Libraries is refining that element set and list of required elements specifically with cross-collection searching in mind. By requiring elements that include information such as the century and geographic focus of individual objects, designers aim to define and render searchable the broader topics that objects from disparate collections of narrow focus may have in common. Designers must
also use or define standard vocabularies to be used to
populate these broader elements to ensure successful cross-collection discovery.
Previous Research
The integrated design of online information retrieval systems has been studied most prominently by Marcia Bates (2001). However, most research in this field takes a more atomized approach, focusing solely on one aspect of design: metadata schemes for instance, or GUI design for search screens. Other research has examined
the particular needs and searching habits of users,
particularly in humanities disciplines, when faced with online search interfaces. Prominent among these has been the work of Deborah Shaw (1995) and the series of reports from the Getty Online Searching Project
(Bates, 1994, 1996; Bates 1993, 1995; Siegfried 1993). The majority of this research was carried out in the late 1990s and follow-up has been more in specific
application of digital library systems than in respect to user-oriented, integrated design, such as Broughton (2001).
Another area of inquiry relevant to this research involves the use of faceted classification systems for web-based discovery. The most recent white paper produced by the NINES project neatly summarizes current research (NINES 2005). KM’s ‘The Knowledge Management Connection’ discusses faceted classification within the context of information-intensive business environments.
Denton (2003) discusses how to develop a faceted
classification scheme, while Bates (1988) surveys the
various approaches to subjection description for web-based
This paper will build on previous research as
mentioned above in the development of a controlled
vocabulary, metadata schema, and faceted classification scheme which provides for both rich collection-specific discovery, as well as federated searching across collections.
