Rijksuniversiteit Groningen (University of Groningen)
Rijksuniversiteit Groningen (University of Groningen)
University of Amsterdam
The Early American Imprints Series are a microfi che collection
of all known existing books, pamphlets and periodical
publications printed in the United States from 1639-1800,
and gives insights in many aspects of life in 17th and 18th
century America, and are based on Charles Evans’ American
Bibliography. Its metadata consist of 36,305 records, which are
elaborately described (title, author, publication date, etc) with
numerous values, and have been compiled by librarians in the
format MARC21, which we encoded in XML.
The Semantic Web for History (SWHi) project at the Digital
Library department of the University Library Groningen has
worked with the MARC21 metadata of this dataset. Our
project aims to integrate, combine, and deduce information
from this dataset to assist general users or historians in
exploring American history by using new technology offered
by the Semantic Web. Concretely, we developed a semantic
search application for historical data, especially from a Digital
Library point of view.
Semantic Web technologies seem ideally suited to improve and
widen the services digital libraries offer to their users. Digital
libraries rely heavily on information retrieval technology. The
Semantic Web might be used to introduce meaningful and
explicit relations between documents, based on their content,
thereby allowing services that introduce forms of semantic
browsing supplementing, or possibly replacing keyword-based
searches. This is also a theme we address in SWHi project. Semantic Web and Information
Retrieval
Digital libraries increase their amount of metadata each
day, but much of these metadata is not used as a fi nding aid.
We have used these metadata to create a Semantic Web
compatible “historical” ontology. Ontologies are the backbone
of the Semantic Web, and also play a pivotal role in the SWHi
application. One of the ideas of the Semantic Web is that
existing resources can be reused. This was also one of the key
ideas of our project.
Our ontology is constructed by using the existing ontology
PROTON as its skeleton, and is enriched with other schemas.
We reuse existing topical taxonomies created by librarians
and other experts. We also extracted topic hierarchies from
the metadata. To describe objects with more semantics and
create relationships between objects, we also enriched the
ontology with the vocabularies Dublin Core and Friend of a
Friend (FOAF). For example, we link instances with time and
space, but also topics and persons. This is useful for discovering
historical social networks, exploring gazetteers, and clustering
instances together by topic for faceted search, etc. We
aligned the library metadata with schemas and vocabularies.
Information is extracted from the metadata to populate the
ontology. All of this is eventually processed and encoded in
XML with RDF/OWL. In addition to the PROTON’s basic
ontology modules, we get 152 new classes from this mapping.
And in total, we get 112,300 ontology instances from the
metadata.
We also combine Information Retrieval technology with
Semantic Web techniques. We use the open-source search
engine SOLR to index ontology instances, parse user input
queries, and eventually retrieve matching ontology instances
from the search results. It supports faceted search and has
been been designed for easy deployment.
During the indexing stage of the data, we apply inference
from the ontology as a propagation for the importance of
the different metadata records and fi elds. Using the ontology,
instances with higher relevance can have higher position in the
order. For example, a person who is known by many people
and created many documents would get a higher score. We
use Sesame for storage and retrieve using the RDF query
language SPARQL.
User Access
On top of our storage and retrieval components, we developed
novel techniques for our semantic search application to
visualize information and offer users to browse for that
information in an interactive manner. We let users search for
information semantically, which means that information is
linked together with certain relations. Results are clustered
together based on such relations which allows faceted search
by categories, see fi gure 1. We provide context to relevant
nuggets of information by enabling users to traverse related
RDF graphs. We visualize interconnected results using network
graphs with the TouchGraph tool.
Illustration 1: Result list for query “saratoga”.
We picked up this idea from the ‘berrypicking’ model: a user
searches, picks a berry (a result), stores it in his basket, view
the relations between the berries in the basket, and the search
iteration continues. The purpose is to fi nd new information
from the collected berries (results).
As we are dealing with historical data, chronological timeline
views are also presented using SIMILE’s Timeline, which lets
users browse for information by time. Besides time, we
also offer users to search by location using Google Maps.
Geographical entities, mostly locations in the US, are aligned
with Google Maps.
In summary, we have four modes of visualization which gives
users multiple views of the results: plain list, faceted search,
timeline, and map. We have found that translating such an
interface to an on-line environment offers interesting new ways
to allow for pattern discovery and serendipitous information
seeking. Adding information visualization tools like interactive
and descriptive maps and time-lines to the electronic fi nding
aid’s interface could further improve its potential to augment
cognition, and hence improve information access.
Conclusions
We presented the Semantic Web for History (SWHi) system,
which deals with historical data in the form of library fi nding
aids. We employed Semantic Web and Information Retrieval
technologies to obtain the goal of improving user access to
historical material. Bibliography
Grigoris Antoniou, Frank van Harmelen. A Semantic Web
Primer. The MIT Press, 2004.
Berners-Lee, T., Hendler, J., and Lassila, O. The semantic web.
A new form of web content that is meaningful to computers
will unleash a revolution of new possibilities. The Scientifi c
American, 2001.
Ismail Fahmi, Junte Zhang, Henk Ellermann, and Gosse Bouma.
“SWHi System Description: A Case Study in Information
Retrieval, Inference, and Visualization in the Semantic Web.”
The Semantic Web: Research and Applications, volume 4519 of
Lecture Notes in Computer Science, pages 769-778. Springer,
2007.
Dieter Fensel, Wolfgang Wahlster, Henry Lieberman, James
Hendler. Spinning the Semantic Web: Bringing the World Wide
Web to Its Full Potential. MIT Press, 2002.
Lucene SOLR. http://lucene.apache.org/solr/
Semantic Web for History (SWHi). http://evans.ub.rug.nl/swhi
SIMILE Timeline. http://simile.mit.edu/timeline/
Junte Zhang, Ismail Fahmi, Henk Ellermann, and Gosse Bouma.
“Mapping Metadata for SWHi: Aligning Schemas with Library
Metadata for a Historical Ontology.” Web Information Systems
Engineering -- WISE 2007 Workshops, volume 4832 of Lecture
Notes in Computer Science, pages 102-114. Springer, 2007.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Oulu
Oulu, Finland
June 25, 2008 - June 29, 2008
135 works by 231 authors indexed
Conference website: http://www.ekl.oulu.fi/dh2008/
Series: ADHO (3)
Organizers: ADHO