The Social Networks and Archival Context Project: Developing a Prototype Historical Resource and Access System

Daniel Pitti; Brian Tingle; Ray Larson; Krishna Janakiraman

Authorship

1. Daniel Pitti

Institute for Advanced Technology in the Humanities (IATH) - University of Virginia
2. Brian Tingle

School of Information - University of California Berkeley
3. Ray Larson

School of Information - University of California Berkeley
4. Krishna Janakiraman

School of Information - University of California Berkeley

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Social Networks and Archival Context Project

Pitti, Daniel, Institute for Advanced Technology in the Humanities, University of Virginia, dpitti@Virginia.edu
Larson, Ray, University of California, Berkeley. School of Information, ray@ischool.berkeley.edu
Janakiraman, Krishna, University of California, Berkeley. School of Information, krishna.j@berkeley.edu
Tingle, Brian, California Digital Library, brian.tingle.cdlib.org@gmail.com

This session will present an interim report on the findings and a demonstration of the Social Networks and Archival Context (SNAC) project.

SNAC is exploring the feasibility of using existing archival descriptions to create a prototype socio-historical resource and resource discovery tool that will enhance access to and understanding of cultural resources in archives, libraries, and museums. Beginning in May 2010, with funding from the National Endowment for the Humanities, the two-year project is using advanced technology in three primary ways: to derive descriptions of people from descriptions of their records; to match and merge the derived records with library and museum authority records; and to build a prototype socio-historical resource and access system use the resulting matched and combined records.

Leveraging the new standard Encoded Archival Context–Corporate Bodies, Persons, and Families (EAC-CPF), the project is deriving EAC-CPF records from nearly 30,000 EAD-encoded finding aids made available to the project by the Library of Congress and three archival consortia: Online Archive of California; Northwest Digital Archive; and Virginia Heritage. Names of record creators and of people documented in the records and record descriptions are being used to create EAC-CPF records. Co-occurrence of names in finding aids is also being recorded in order to document social and professional relations among named entities and to interrelate (or link) the related records with one another. The resulting EAC-CPF records are matched against one another and against several million authority records represented in the Library of Congress Name Authority File (NACO/LCNAF), the Getty Vocabulary Program's Union List of Artist Names (ULAN), and the Virtual International Authority File (VIAF), a collaboration between several national libraries. Unique data in matching records is being merged or combined into a single EAC-CPF record, with the ultimate goal of having one record for each unique person, corporate body, for family. Finally, the project is developing a prototype public access and historical resource system based on the unique EAC-CPF records created from the processing.

SNAC is a collaboration between three institutions: the Institute for Advanced Technology in the Humanities (IATH), University of Virginia; the California Digital Library (CDL), University of California; and the School of Information, University of California, Berkeley (SI/UCB). IATH is the lead institution and is responsible for overall project management and for deriving EAC-CPF records from EAD-encoded finding aids. SI/UCB is responsible for the matching and merging of authority records. CDL is responsible for developing a prototype public access and historical resource system based on the data produced by the other two partners.

The three papers presented in the session will cover the following topics: A comprehensive overview of SNAC and a detailed description of the derivation processing; a description of the theoretical and application challenges of matching data from heterogeneous sources; and a description of the methods and technology being adapted or developed to create the prototype public system. Finally, the session will present a brief demonstration of the prototype system.

Overview and Methods

Pitti, Daniel, Institute for Advanced Technology in the Humanities, University of Virginia, dpitti@Virginia.edu

Introduction
Archivists have a long history of describing the people who—acting individually, in families, or in formally organized groups—create and are documented in archival records. They research and describe the artists, scholars, political leaders, scientists, government agencies, soldiers, universities, businesses, families, and others who create and are represented in the items that are now part of our shared cultural legacy. However, because archivists have traditionally described records and the people documented in them in a single apparatus, the finding aid, this biographical-historical information is tied to specific resources and institutions. Currently there is no system in place that aggregates these descriptions of individual persons, families, and corporate bodies, and interrelates them to reveal the professional and social relations that existed among the described entities.

Leveraging the new standard Encoded Archival Context–Corporate Bodies, Persons, and Families (EAC-CPF), the Social Networks and Archival Context (SNAC) project is exploring the feasibility of using existing archival descriptions in new ways in order to enhance access to and understanding of cultural resources in archives, libraries, and museums. Beginning in May 2010, with funding from the National Endowment for the Humanities, the two-year project is using advanced technology to derive descriptions of people from descriptions of their records, to match and merge the derived records with library and museum authority records, and use the resulting records to build a prototype socio-historical resource and access system.

Data and Data Contributing Institutions
Over 28,000 EAD-encoded finding aids have been made available to the project by the Library of Congress (918+) and three archival consortia: the Online Archive of California (13,932+); Northwest Digital Archive (5,160+), and Virginia Heritage (8,390+). The three consortia represent findings from over 200 individual repositories.

A key criterion in selecting the finding aid sources was geographical proximity. Because archives commonly emphasize local history, it is surmised that the geographic proximity of the archives would yield a higher rate of co-referencing (for example, a correspondent in one finding aid is the creator in another), and thus provide corroborating evidence of social-professional relations. Another very important consideration was encoding consistency, and thus the project emphasized consortia with high quality control standards.

Three institutions are contributing authority records that will be matched and merged with the derived EAC-CPF records. The Library of Congress has made the NACO/Library of Congress Name Authority File (LCNAF) available for project use (3.8M personal and 900K corporate name records). The Getty Vocabulary Program has contributed the Union List of Artist Names (ULAN) (293K personal and corporate names). Finally, OCLC Research has made available the subset of the Virtual International Authority File (VIAF) that intersects with the LCNAF available to the project. The VIAF, for now, only contains personal names.

Separating Description of People from Description of Records
There are several interrelated intellectual and practical rationales for separating the descriptions of people from the description of the records that document their lives. These rationales are based on archival processing efficiencies, the intellectual quality and depth of resource description, and enhanced access to primary humanities resources for all users.

Cooperative Authority Control. Authority control is labor-intensive, and sharing the creation, maintenance, and use of authority data improves catalogers’ productivity. Sharing descriptions of creators and people and organizations documented in archival records saves time and labor.

Integrated Access to and Context for Cultural Heritage. Integrated, union access to archival authority records can be used to locate and identify people, organizations, and families, and these records in turn can lead to cultural heritage resources through links to descriptions or dynamically generated searches of catalogs. An archival authority record can provide not just contextual information for understanding archival resources, but also access to and context for understanding all that constitutes the human record and our cultural heritage.

Biographical/Historical Resource. In addition to name control, archival authority records provide biographical/historical data about the creator, such as when and where the creator existed, significant activities and functions performed by the creator, and other significant dates, places, and events. This historical information can be used as an independent resource that can assist users in identifying and learning about the described entity.

Social/Historical Context. People live and work with other people, both as individuals and as members of families and organizations. These social and professional relations are reflected in records created by them and consequently in the descriptions of the records. Archival authority control records provide a potential means to systematically gather and document these social and professional relations in links that interrelate descriptions of people, organizations, and families. This documentation can provide convenient access to the broad social-historical contexts within which corporate bodies, persons, and families were active, and convenient, navigable access to related or complementary resources.

Methods and Processing Overview
There are three ways in which technology is being used in the SNAC Project. First, using Extensible Markup Language–Transformation (XSLT), the project is deriving EAC-CPF records from EAD-encoded finding aids for record creators and people referenced in the description The co-occurrence of names in the description of a single collection documents either a social-professional or intellectual relation between the named entities. Some occurrences can specifically be identified as correspondents, thus confirming a social-professional relation.. Initially the project is focusing on extracting names, biographical-historical data, occupations, dates of existence, and languages used that are clearly and specifically encoded in the finding aids. Later in the project, natural language processing techniques will be used to experiment with extracting names and other targeted descriptive data that appear in the description but that are not encoded specifically. Second, the extracted EAC-CPF records are being matched against one another and against NACO/Library of Congress Name Authority File (LCNAF) records, Union List of Artist Names (ULAN) authority records, and finally Virtual International Authority File (VIAF) aggregated authority records. Unique data in matching records is being merged or combined into a single EAC-CPF record. Finally, the project is developing a prototype public access and socio-historical resource system based on the collection of unique EAC-CPF records created and interrelated.

This paper will address the first step in the processing, deriving EAC-CPF records from archival finding aids. The matching and merging and development of the prototype system will be addressed in separate papers.

Deriving EAC-CPF Records
The principal technologies involved in the derivation process are XSLT 2.0 and XPath 2.0, with relatively heavy use of regular expressions and customized functions. Initially in the project, the focus is on identifying and deriving individual records from the following EAD tag components: <persname>, <corpname>, and <famname> that occur within <origination>, <controlaccess>, and <unittitle>. Personal, corporate, and family names derived from <origination> and <controlaccess> are generally formulated according to strict cataloging rules (AACR2, for American archives and libraries), though challenges are presented by names that are poorly formulated (for example, in direct rather than inverted order, and intermixed with non-name data, subject subdivisions or uniform titles). While many names occurring within <unittitle> are tagged as such, many occur without being tagged as names. The tagged names found in <unittitle>, like those found in <controlaccess>, are irregularly formatted. Regular customized functions and named templates, many incorporating the use of regular expressions, are used to isolate and normalize the name strings, and to create unique lists of names.

For each unique name string found, an EAC-CPF record is created. For records derived for creators, additional descriptive data for dates of existence, occupation, subject headings assigned to records, languages used, and biographical-historical information is extracted into the corresponding EAC-CPF records. Additionally, all unique referenced names are related to the EAC-CPF record for the creator, and for each an EAC-CPF record is also created, and related to the record for the creator. Because of co-referencing among finding aids (for example, the same person corresponded with two more record creators), the resulting set of records derived from any set of finding aids contains more than one EAC-CPF record describing the same entity. Thus while duplicate entries are not created for named entities found in a particular finding aid, duplicate or, more accurately, matching records are created in the processing of all of the finding aids.

Conclusion
While SNAC has only been underway for five months, the Library of Congress, Online Archive of California, and Northwest Digital Archive finding aids have been processed to derive nearly 160,000 EAC-CPF records. The matching and merging processing at SI/UCB began in September 2010, after several weeks devoted to acquiring and indexing several million VIAF, ULAN, and LCNAF authority records in preparation for the matching and merging processing. The initial release of the prototype public historical resource and access system was in December 2011. Though many challenges remain, the early results suggest that the data and the methods and techniques being applied are highly effective. The deriving, and matching and merging processing will continue to be refined, and the prototype public system will continue to be refined and new features developed.

Matching and Merging EAC-CPF Records

Larson, Ray, University of California, Berkeley. School of Information, ray@ischool.berkeley.edu

Janakiraman, Krishna, University of California, Berkeley. School of Information, krishna.j@berkeley.edu

Introduction
Our interests in cultural heritage, history, and the social sciences are fundamentally about human activities. Understanding the circumstances of people’s actions—who, what, when, how, and why—illuminates their lives and the events that they experienced. While much information of interest to scholars is already available in the collections of cultural institutions such as archives and libraries, there is a significant gap in the information infrastructure for dealing with information about people. Standards for the computerized handling people’s names have been developed in libraries (such as MARC Authorities). With the development of the EAC-CPF (Encoded Archival Context–Corporate Bodies, Persons, and Families) standard, a similar capability is just now becoming available to archival collections. The Social Networks and Archival Collections (SNAC) project aims to start bridging that gap and to connect the information about corporate bodies, persons and families in the library world with those entities in the archival world.

This paper reports on current experiments in matching and merging entities in collections of EAC-CPF records with those in library authority files and other sources. EAC-CPF records represent the entities (which can be individuals or groups of individuals) mentioned in archival description records and can be derived from the EAD (Encoded Archival Description) records that encode description of archival collections. EAD records, created by archivists and librarians, serve as vital finding aids. Information on entities in these records is an invaluable reference for humanities scholars, particularly since entities may be referenced and represented in multiple archival descriptions.

EAC-CPF records encode extensive information about an entity, drawn from various parts of the source records. In addition to basic identifying information (name, type, occupation(s), and existence dates), they include an entity’s relationship(s) with other entities, resources, and works. Since EAC-CPF records are derived independently from each EAD record, there can be multiple records representing the same entity in multiple EAD collections. A key problem, then, is to identify multiple EAC-CPF records that represent the same entity and merge them together into a single record.

SNAC has been given an extensive collection of library Name Authority records, including the Library of Congress Name Authority File (LCNAF), the Virtual International Authority File (VIAF) from OCLC Research, and the Union List of Artists Names (ULAN) from the Getty Vocabulary Program. The VIAF database combines name authority files from a number of libraries worldwide, including the Library of Congress, La Bibliothèque nationale de France, Deutsche Nationalbibliothek, and the Vatican Library. SNAC’s current implementations use either an exact string match criteria or the alternate name information from the name authority files to match entities in the EAC-CPF collection.

Related Work
Our problem is similar to the well-studied entity name disambiguation problem, where the task is to identify the correct entity, under a given context, from a set of seemingly identical entities. Standard approaches use statistical learning techniques, either performing supervised learning and train classifiers that predict the relevance of an entity given a context or performing unsupervised learning and design clustering techniques that cluster similar entities together. As an example of the former, Bunescu and Pasca (Bunescu and Pasca, 2006) suggest a method that trains Support Vector Machines (SVM) classifiers to disambiguate entities using the Wikipedia corpus. The classifier was trained using features extracted from the title, hyperlinks linking other entities, categories assigned to the entity and Wikipedia’s redirect and disambiguate pages. Bagga and Baldwin (Bagga and Baldwin, 1998) and Mann and Yarowsky (Mann and Yarowsky, 2003) are examples of the latter technique, where similar entities are clustered using features extracted from entity’s biographical information, words from sentences surrounding the entity in texts and entity’s social network and relationships. Other techniques involve using gazetteers and name authority files as external references to aid the disambiguation process. Smith and Crane (Smith and Crane, 2001), for example, use gazetteers to disambiguate geographic place names.

SNAC’s focus is more precisely a clustering problem rather than a classification problem, since we want to group, in an unsupervised way, EAC-CPF records belonging to the same entity. While some of the work mentioned above uses sophisticated techniques to discover entities in the text, the EAC-CPF standard provides direct access to the name and other information about the entity. This, combined with the availability of the name authority files, allows use of simpler algorithms based on exact string match and authority file look up for matching entities.

Implementation
We have 158,079 EAC-CPF records — 114,639 persons, 41,177 corporate bodies and 2,263 family names — derived from Library of Congress, the Online Archive of California, and Northwest Digital Archive EAD records. The records were parsed with the EAC-CPF specification to extract information on an entity’s name, type, and relations, stored in a relational database. Preferred and alternate names from the VIAF name authority files were indexed using the Cheshire II information retrieval engine (Larson et.al, 1996), which uses a probabilistic information retrieval algorithm to find the top n VIAF records and their associated names given an entity name. We mapped each EAC-CPF entity to names from top n VIAF records (currently the top five VIAF records). These mappings are also represented in a relational database.

The primary approach is based on the simple hypothesis that two EAC-CPF records belong to the same entity if the entity names exactly match. This simple technique reduced the total number of unique records to 129,915. This meant that EAC-CPF records with different names belonging to the same entity were not matched. This was often due to the presence of existence dates in the name fields (e.g., “Einstein, Albert, 1879-1955” will not match “Einstein, Albert”).

Our second technique uses the hypothesis that entities referring to the same name authority record must be the same. Our database and index setup made this easy. For a given pair of entities, we search the entity VIAF names mapping table for alternate names for the first entity. If the second entity’s name appears in the first entity’s list of possible alternate names, we consider the two entities to be the same. Match results are parameterized by how liberal the search is: including names from all the top five ranked VIAF records would result in a higher number of matches but fewer accurate matches. Using names from higher ranked VIAF records would give a lower number of matches with better accuracy. However, evaluation of the technique showed that using the best matching or top ranked VIAF record reduced the number of unique EAC-CPF records to 124,657, or 5248 records less than what was achieved using the exact name match technique. This was contrary to our expectations and further evaluation suggested that it was a result of either subtle differences in the way the names are spelled and punctuated or the use of names that are not present in the authority files. Using lower ranked VIAF records reduced the number of unique records but introduced serious matching errors and was not a viable option.

Conclusions
We have described our current implementations that match and merge EAC-CPF records belonging to the same entity. Our implementations use exact name match as the criteria or use the name authority files as an external reference to disambiguate and find matches. Our current technique finds mostly accurate matches; false matches occur only when there are two different entities with the same name or when the name authority file has inaccurate information. Although it is possible for different entities to have the same name, the use of existence dates can differentiate the names and given that the VIAF record combines information from institutions with rigorous standards, it is unlikely that the records will have inaccurate information.

However, our current approach still fails to identify many possible matches. The main reasons seem to be subtle variations in names and punctuation or use of names that are not present in the authority files. To handle spelling issues, we plan to experiment on using string comparison algorithms (such as “edit distance” algorithms) and use comparison results as features for a clustering algorithm. We will also experiment with other information about the entities, such as their biographies, relations with other entities, works produced, etc., and external sources such as DBPedia. However, because entity description records are created for individual archives, it is possible that this additional information is non-redundant and therefore not useful for matching purposes.

References:
Bagga, A. Baldwin, B. 1998 “Entity-based Cross-document coreferencing using the vector space model, ” Proceedings of the 17th International Conference on Computational Linguistics, 1 79-85

Bunescu, R. Pasca, M. 2006 “Using encyclopedic knowledge for named entity disambiguation, ” Proceedings of EACL, 6

Larson, R. R. McDonough, J. O’Leary, P. Kuntz, L. Moon, R. 1996 “Cheshire II: Designing a next-generation online catalog, ” Journal of the American Society for Information Science, 47 7 555-567

Mann, G. S. Yarowsky, D. 2003 “Unsupervised personal name disambiguation, ” Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 4 33-40

Smith, D. Crane, G. 2001 “Disambiguating geographic names in a historical digital library, ” Research and Advanced Technology for Digital Libraries, 127-136

Paper Three: The Social Networks and Archival Context Project: Developing a Prototype Historical Resource and Access System

Brian Tingle, California Digital Library, brian.tingle.cdlib.org@gmail.com

Introduction
While authority file interfaces for library authority control is reasonably well understood, archival authority records provide more detailed description and possibly extensive entries to related persons, corporate bodies, and families, and related resources. This paper will explore the opportunities and challenges faced in designing and implementing a public interface to access the unique archival authority record aggregations created by the SNAC project.

The prototype access system (link) strives to enable humanities researchers to make use of archival context records that describe individuals, families, and formally organized groups to find resources and identify social and professional relations that are hard to discover using existing research tools and techniques. Existing methods require searching and exploring dispersed archival finding aids and using the descriptions found to further search and explore dispersed finding aids. The SNAC access system does this work for the researcher, bringing this information together in a searchable database and exposing the discovered social graph as open linked data.

The prototype was made available for public testing on December 17, 2010. From December 17, 2010, through February 17, 2011, the site had 47,857 visits and 104,223 page views, with 91.37% of traffic coming from Google searches. An iterative development model is being utilized, and user feedback on the prototype will help us to identify and prioritize ongoing development activities. New features are released into the prototype site on an ongoing basis, as they are developed. Grant-funded interface development will cease in April 2012 with the completion of the project. Source code for the access interface is also available as open source software (link) .

A copy of the social graph derived from the research data was released as a graphML file (link) on February 17, 2011, under the Open Data Commons Attribution License (link) . Analysis is ongoing to figure out the optimal way to represent the social graph in a conventional RDF model (such as with the FOAF ontology).

Technical Infrastructure
The semantics and structure of the records based on the EAC-CPF schema has been the starting point of the initial prototype. The prototype is being built as an application of the eXtensible Text Framework (XTF) and will support search, display, and navigation of EAC-CPF records (link) . The Tinkerpop Graph Processing Stack (link) is also being utilized by the project to load the relationships recorded in the derived EAC records into a graph database and exposing it though a REST API to the interactive graph visualizations in the prototype interface. The Tinkerpop stack is compatible with linked data technologies through the RDF Sail interface, and will be used as the platform to expose the project’s data as linked data when the semantic modeling is complete.

Current and Future Development
The access system being developed resembles library systems based on authority control, but the EAC-CPF archival authority records are far more complex than those created by the library community. In addition to entry control (authorized and alternative names), the archival authority records frequently have biographical-historical data, occupation, dates of existence, languages used, as well links to related people and resources. An additional challenge is presented by relative quantities of descriptive data found in each record. Some records have as many as 50 or more alternative names, scores of subject headings, more than 50 related persons, families, or corporate bodies, and many linked archival finding aids or titles. Other descriptions are quite brief, based on the name occurring in one finding aid and failing to match an authority record. Finding the right method for displaying and facilitating navigation of this data presents many challenges.

The initial prototype focuses on searching, browsing and displaying the EAC-CPF records as formatted web pages for researchers. Both full text (description less control data) and specific component searches are supported. Full text searches are weighted to give preference to matches in the <identity> section of the record, where all forms of the name of the entity discovered in the derivation, matching, and merging processes are listed. Limiting a search to the <identity> section restricts the retrieval to just the forms of name found in the section, and thus excludes matches in other parts of the description, such as in entries for related named entities. As users enter searches, authorized forms of names are suggested. Users can browse the top occupations and subjects in addition to an alphabetical index of all names in the database. The alphabetical index feature is likely to become less useful as the number of records increases. (At this date, there are over 123,920 named entities.) Search results can be narrowed down to names that have a particular occupation or subject term associated, or restricted to entity type, such as person, corporate body, or family.

In the initial prototype, a wide variety of data and links are displayed to the user. For each EAC-CPF record, the following descriptive components are displayed: authoritative name, alternative names, dates of existence, sex, affiliated countries, occupations, subject terms used in describing related archival records (for record creators), and biographical/historical description (either as prose or a chronological list). In addition to the above biographical/historical data, the following linked information will also be displayed: related persons, corporate bodies, and families; descriptions of related archival records (that is, finding aids within which the name was discovered), published work by or about the described entity. Links are also provided to a matching Virtual International Authority File (VIAF) record, when one is identified as matching.

Though the outbound links to finding aids and VIAF records are currently implemented, the internal "links," for now, are implemented as searches.

While the derivation, matching, and merging processing continues, the persistence of any given EAC-CPF record cannot be ensured: a record may be merged into another record in subsequent processing, so it is difficult to assign persistent identifiers or addresses in links to related entities. Once the deriving, matching, and merging is complete (late in the project), links to associated persons, corporate bodies, and families will directly retrieve the related records.

The list of titles for resources by and about the described entity that have been gathered in the record is currently inactive. We anticipate eventually using entries in this list to query WorldCat. When the project is complete, additional links will be made to descriptions of named entities in DBpedia and WorldCat Identities, where matching entries are found. Also under consideration is offering users the opportunity to use authoritative and alternative name entries to search a selection of archive, library, and museum access systems, and public resources such as Google, Bing, and Flickr.

Another objective of SNAC is to employ a display and navigation tool to graphically display and facilitate the navigation of the social and professional networks discovered and documented in the EAC-CPF records and their relations to one another. Visualization of abstract networks is a well-studied problem and there are many tools available that support the graphML file format that the project has used to represent the historical social graph released under an open license. The project’s objective is to develop a visualization and navigation interface that will make it possible for humanities researchers to explore and discover social-professional relations and related resources that would be difficult to explore and discover using simple lists. Experimentation with appropriate graph visualization and navigation interfaces is ongoing at

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2011

"Big Tent Digital Humanities"

Hosted at Stanford University

Stanford, California, United States

June 19, 2011 - June 22, 2011

151 works by 361 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: https://dh2011.stanford.edu/

Series: ADHO (6)

Organizers: ADHO

The Social Networks and Archival Context Project: Developing a Prototype Historical Resource and Access System

1. Daniel Pitti

2. Brian Tingle

3. Ray Larson

4. Krishna Janakiraman

ADHO - 2011

"Big Tent Digital Humanities"