A General Framework for Feature Identification

Neal Audenaert; Richard Furuta; Eduardo Urbina

Authorship

1. Neal Audenaert

Center for the Study of Digital Libraries - Texas A&M University
2. Richard Furuta

Center for the Study of Digital Libraries - Texas A&M University
3. Eduardo Urbina

Center for the Study of Digital Libraries - Texas A&M University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Large digital libraries typically contain collections of heterogeneous resources intended to be delivered
to a variety of user communities. A key challenge for these libraries is providing tight integration between
resources both within a single collection and across
multiple collections. Traditionally, efforts at developing digital collections for the humanities have tended toward two extremes [5]. On one side are huge collections such as the Making of America [14, 15], Project Gutenberg [18], and Christian Classics Ethereal Library [13] that have minimal tagging, annotation or commentary. On the other side are smaller projects that closely resemble
traditional approaches to editorial work in which editors carefully work with each page and line providing markup and metadata of extremely high quality and detail, mostly
by hand. Projects at this end of the spectrum include
the William Blake Archive [21], the Canterbury Tales Project [11], the Rossetti Archive [19], and the Cervantes
Project [12]. These extremes force library designers
to choose between large collections that provide an
impoverished set of services to the collection’s patrons on the one hand and relatively small, resource intensive projects on the other. Often, neither option is feasible.
An alternative approach to digital humanities projects
recasts the role of the editor to focus on customizing and skillfully applying automated techniques, targeting
limited resources for hand coding to those areas of the collection that merit special attention [6]. The
Perseus Project [16] exemplifies this class of projects.
Elucidating the internal structure of the digital resources
by automatically identifying important features (e.g.,
names, places, dates, key phrases) is a key approach to aid in the development of these “middle ground” used to establish connections between the resources and
to inform visualizations. This task is complicated by the
heterogeneous nature of digital libraries and the diversity
of user community needs.
To address this challenge we have developed a framework
based approach to developing feature identification
systems that allows decisions about details of document
representation and features identification to be deferred
to domain specific implementations of the framework.
These deferred decisions include details of the semantics
and syntax of markup, the types of metadata to be
attached to documents, the types of features to be identified,
the feature identification algorithms to be applied, and
the determination of which features are to be indexed.
To achieve this generality, we represent a feature
identification system as being composed of three layers,
as diagramed in Figure 1. The core of the system is a “Feature
Identification Framework” (FIF). This framework
provides the major structural elements for working
with documents, identifying features within documents,
and building indices based on the identified features.
Implementations customize components of the framework
to interface with existing and new collections and to
achieve domain specific functionality. Applications
then use this framework, along with the appropriate set
of customized modules, to implement visualizations,
navigational linking strategies, and searching and
filtering tools.
Figure 1: Three layered approach to designing a feature
identification system
The document module implements the functionality
needed to represent documents, manage storage and
retrieval, provide an interface to searching mechanisms
and facilitate automatic feature identification. It provides
the following features: 1. Multiple types of documents (e.g., XML, PDF, RTF,
HTML, etc) can be supported without modifying the
APIs with which the rest of the system will interact.
2. Arbitrary syntactical constraints can be associated
with a document and documents tested to ensure
their validity. Notably, this helps to ensure that
the markup of identified features does not violate
syntactic or semantic constraints.
3. Metadata conforming to arbitrary metadata
standards can be attached to documents.
4. Storage and retrieval mechanisms are provided that
allow documents persistence to be managed either
directly by the framework or by external systems.
5. Large documents can be broken into smaller
“chunks” for both indexing and linking. Work in this
area is ongoing.
The feature module builds on this base to provide
the core toolset for identifying the internal structure of
documents. Our design of this component reflects the
highly contextualized nature of the feature identification
task. The relevant features of a document can take many
forms (e.g., a person or place, the greeting of a letter, a
philosophical concept, or an argument against an idea)
depending on both the type of document and the context
in which that document is expected to be read. Equally
contextualized are the algorithms used to identify
features. Dictionary and statistically based methods
are prevalent, though other techniques focusing on the
semi-structured nature of specific documents have also
yielded good results [3, 1, 4, 9, 2, 7]. Ultimately, which
algorithm is selected will depend heavily on the choice
of the corpus editor. Accordingly, our framework has
been designed so that the only necessary property of
a feature is that it can be identified within the text of a
document and described within the structure provided by
the document module.
For applications using the framework to effectively
access and present the informational content, an indexing
system is needed. Given the open ended nature of both
document representation and the features to be identified,
the indexing tools must inter-operate with the other
customized components of the framework. We accomplish
this, by utilizing adapters that are implemented while
customizing the system. These adapters work with the other customized components to specify the elements of
each document to index.
To demonstrate and test this framework, we have
implemented a prototype for a collection of official
records pertaining Miguel de Cervantes Saavedra (1547-
1616) originally assembled by Prof. Kris Sliwa [10].
This collection contains descriptions, summaries, and
transcriptions in Spanish of nearly 1700 documents
originally written from 1463 to 1681. These documents
bear witness to the life of both Cervantes and his family
and include inventory lists, birth and death certificates,
and court testimonies.
Our application provides two primary points of access
to the collection; a timeline navigator and a browsing
interface. Following Crane, et al. [7], we have utilized
proper names (people and places) and time as the two
primary dimensions for structuring the documents in
this collection. The timeline navigator, shown in Figure
2, displays a bar chart showing the distribution of the
documents over time. Selecting a bar takes the reader to
a more detailed view of the time period. Once the chart
displays documents as single years, clicking on the bar
for a single year brings up a display listing all documents
from that year. The browsing interface, shown in Figure
3, allows readers to browse lists of both the people and
the places identified within the collection. Upon selecting
an item to view, a page presenting the resources available
for that person or place is displayed. Currently, this includes
a list of all documents in which the specified person has
appeared and a bar graph of all documents in which that
individual has been found as shown in Figure 4.
Figure 2: Timeline interface to the Sliwa collection Figure 3: Browsing interface to the Sliwa Collection
Figure 4: Browsing documents for Francisco de Palacios
Once the user has selected an individual document to
view, through either the timeline or browsing interface,
that document is presented with four types of features
identified and highlighted. Identified people and places
are used to automatically generate navigational links
between documents and the pages presenting the resources
for the people and places identified within a document.
Dates and monetary units are identified and highlighted
in the text.
One challenge with any framework based system is to
ensure that the framework is not so general that customizing
it requires more time and effort than writing an equivalent
application from scratch. Our experience developing the
Sliwa collection prototype suggests that our framework
offers significant benefits. With the framework in place,
we were able to develop and integrate new features in
days; sometimes hours. Moreover, as sophisticated,
general purpose features (e.g., pattern matching,
grammatical parsers, georeferenced locations) are
implemented, it becomes possible to customize and apply
these features in new collections via a web-based interface
with no additional coding involved. Custom document formats are more complex to implement, but can serve in a wide variety of applications. The current implementation sufficient for most XML formats and work is underway
to more fully support TEI encoded documents. Our
approach provides strong support for the general
components of a feature identification system thereby
allowing individual projects to focus on details
specific to the needs of particular collections and user
communities.
We are currently working to apply this framework to a number of other projects, including diaries written
during early Spanish expeditions into southern Texas [8],
scholarly comments about the life and art of Picasso
from the Picasso Project [17], and the Stanford
Encyclopedia of Philosophy [20]. This will include
further enhancements to the framework itself including
support for feature identification that utilizes the structure
of the document (including other identified features)
in addition to the text and better support for accessing “chunks” within document in addition to the document as a whole. For the long term, we also plan to explore ways in which this framework can be used assist and shape editorial practices.
References
[1] Bikel, D. M., R. Schwartz, and R. M. Weischedel,
1999. An Algorithm that Learns What‘s in a Name.
Machine Learning, 34(1-3): p.211-231.
[2] Callan J., and T. Mitamura. 2002. Knowledge-based
extraction of named entities. In Proceedings of the eleventh international conference on Information and knowledge management. McLean, Virginia, USA: ACM Press
[3] Chinchor, N. A. 1998. Overview of MUC-7/MET-2.
In Proceedings of the Seventh Message Understanding Conference (MUC-7). Fairfax, Virginia USA. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_toc.html.
[4] Cohen, W. W., and S. Sarawagi. 2004. Exploiting
dictionaries in named entity extraction: combining
semi-Markov extraction processes and data
integration methods, In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining. Seattle, WA, USA: ACM Press
[5] Crane, G. 2000. Designing documents to enhance the
performance of digital libraries: Time, space, people and a digital library of London. D-Lib Magazine,
6(7/8).
[6] Crane, G., and J. A. 2000. Rydberg-Cox. New
technology and new roles: the need for “corpus
editors.” In Proceedings of the fifth ACM conference on Digital libraries. San Antonio, TX USA: ACM Press.
[7] Crane, G., D. A. Smith and Wulfman, C. E. 2001.
Building a hypertextual digital library in the humanities:
a case study on London. In Proceedings of the
first ACM/IEEE-CS joint conference on Digital
libraries. Roanoke, VA USA: ACM Press.
[8] Imhoff, B., ed. 2002. The diary of Juan Dominguez de Mendoza’s expedition into Texas (1683-1684): A critical edition of the Spanish text with facsimile
reproductions. Dallas, TX: William P. Clements
Center for Southwest Studies, Southern Methodist University.
[9] Mikheev, A, M. Moens and C Grover. 1999. Named Entity recognition without gazetteers, In Proceedings
of the ninth conference on European chapter of the Association for Computational Linguistics. Bergen, Norway: Association for Computational Linguistics.
[10] Sliwa, K.. 2000 Documentos Cervantinos: Nueva
recopilación; lista e índices. New York: Peter Lang.
[11] “The Canterbury Tales Project,” De Montfort
University, Leicester, England. http://www.cta.dmu.ac.uk/projects/ctp/index.html. Accessed on May 25, 2002.
[12] “The Cervantes Project.” E. Urbina, ed. Center
for the Study of Digital Libraries, Texas A&M
University. http://csdl.tamu.edu/cervantes. Accessed on Feb 7, 2005.
[13] “Christian Classics Ethereal Library”, Calvin
College, Grand Rapids, MI. httphttp://www.ccel.org/. Accessed on Sept 8, 2005.
[14] “Making of America.” University of Michigan http://www.hti.umich.edu/m/moagrp/. Accessed on Sept 8, 2005.
[15] “Making of America.” Cornell University. http://moa.cit.cornell.edu/moa/. Accessed on Sept 8, 2005.
[16] “Perseus Project” G. Crane, ed. Tufts University. http://www.perseus.tufts.edu/. Accessed on Sept 9, 2005.
[17] “The Picasso Project”, E. Mallen, ed. Hispanic Studies Department, Texas A&M University. http://www.tamu.edu/mocl/picasso/. Accessed on Feb 7, 2005.
[18] “Project Gutenberg.” Project Gutenberg Literary Archive Foundation. http://www.gutenberg.org/. Accessed on Sept 9, 2005.
[19] “The Rossetti Archive.” J. McGann, ed. The Institute
for Advanced Technologies in the Humanities,
University of Virginia. http://www.rossettiarchive.org/. Accessed on Feb 7, 2005.
[20] “Stanford Encyclopedia of Philosophy.” Stanford University. http://plato.stanford.edu/. Accessed on Nov 14, 2005.
[21] “The William Blake Archive” M. Eaves, R. Essick, and J. Viscomi, eds. The Institute for Advanced
Technology in the Humanities. http://www.
blakearchive.org/. Accessed on Sept 9, 2005.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

A General Framework for Feature Identification

1. Neal Audenaert

2. Richard Furuta

3. Eduardo Urbina

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006