Loughborough University
Loughborough University
University College London
University College London
The nineteenth-century newspaper was a messy object, filled with an ever-changing mix of material—literary, factual and the suspiciously plausible—in an innumerable number of amorphous layouts. Working with digitised newspapers is no different. Each database contains a theoretically-standardised collection of data, metadata, and images, but the precise nature and nuance of this data is often occluded by the automatic processes that encoded it. Moreover, no true universal standard has been implemented to facilitate cross-database analysis, encouraging digital research to remain within existing institutional or commercial silos. Where common standards have been asserted, such as the minimum standards for Europeana or Chronicling America, they have been standardised at only a very low resolution, with significant variance in the range and interpretation of the metadata within their direct collaborations as well as by independent programmes following their example. These irregularities make the data highly vulnerable to misinterpretation by both end users and also those updating the collections in the future.In order to better explore global exchanges (for example, scissors-and-paste journalism) in the nineteenth-century press, Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 attempted to integrate and make interoperable the metadata used to store digitised newspapers in a variety of linguistic and institutional contexts. This paper will demonstrate how we excavated institutional decision-making from a variety of sources in order to understand the archaeology of digitised newspaper metadata, its vocabulary and structures, and how they related to the conceptions of the newspaper object by both modern end-users and the original nineteenth-century producers. It will explore how computational thinking and data management processes can be combined with discussions of the historical evolution of the newspaper to restore and integrate a narrative that is generally lost in the creation of digital archives: how the strategies and decision-making processes that shaped the composition and structure of the data have, and will continue to, impact user experience and the conclusions drawn from these materials. The Ontologies work package of Oceanic Exchanges had a simple remit: to catalogue and map the metadata terminology used by newspaper databases to one another and to an internal ontology, to support research into reprinting. Because complete documentation was not available for any of our collections, we retro-engineered the implementation of these vocabularies, beginning with document type definitions (DTDs) and schema specifications, complementing them with internal and public documentation on the cataloguing standards used. Some cases also required us to rely upon grey literature—discussions by users about how to manipulate the data— and direct examination of records. Finally, building upon previous research by team members and new interviews, we were able to develop a longitudinal understanding of how the data has been augmented or repackaged by institutions over the past twenty years. Although most of the databases used variants of the METS/ALTO standard, these were not implemented in a way that would allow for simple equivalencies. The variance in terminology, and in the interpretation of the correct range of inputs for a given field, arose from the use of a hodgepodge of different vocabularies, including variants of Dublin Core, METS/ALTO, MPEG-21, PREMIS, as well as other bespoke or proprietary taxonomies. Overlapping and ambiguous vocabularies were also structured inconsistently, with some combining data at the article, page or issue level and others separating the metadata and content for these elements into multiple files. Our initial attempts to account for both internal structures and field equivalencies across these databases made the level of irregularity strikingly clear. Figure 1: Map of all metadata fields from our samples (each one represented by a different colour), with connecting lines showing the internal hierarchy of each, broken down by metadata of physical object, digital object, metadata pertaining to both, and text data. Unmapped blue boxes represent an overflow of repetitive administrative technical metadata. Moreover, the interpretation and implementation of these fields was inconsistent within collections owing to the turnover of staff during the digitisation process as well as the long history of metadata being drawn from existing library catalogues. Such layering is particularly evident in the metadata associated with Trove, the National Library of Australia’s collections, which includes end-user annotations, categorisations and text corrections—layers which are valuable to humanities researchers but which remain in unintegrated grey literature and derived data for the other collections. The level of publically-available documentation about how to interpret both authorised and user-generated fields varied widely, and interviews and internal documents made it clear that consistent implementation of guidelines was unlikely across time. Working with these collections, therefore, requires a creative and flexible interpretation of these standards and an understanding of the history and character of the specific digital files. After working with such disparate source materials, we concluded that the narrative of creation, archiving and digitisation might be most robustly and sustainably documented through a decentralised and layered medium, namely Linked Open Data. This decision is not without controversy. First, although the scale of periodical material makes it particularly tempting for large-scale analysis, the majority of newspaper metadata is in XML format, which presents specific challenges for semantic data modelling. More philosophically, the possibilities and problematics of the semantic web have been theorised since the term was coined in 2001; in particular, the importance of making and sustaining connections to humanistic forms of knowledge representation has been regularly emphasised. Oldman, Doerr and Gradmann highlight the possibility of linked data combining "digital infrastructure, computer reasoning, interpretation, and digital collaboration", but warn of leaving a "mechanical meaningless shell" if the endeavour is seen as an end in itself, or as a purely scientific exercise. Indeed, linguistics and literary scholars have raised numerous concerns about tool-driven research questions and banal quantification for computation’s sake. Likewise, Berry and Fagerjord have claimed that linked data involves a fragmentation that "privileges knowledge divided into non-narrative shards of information", seemingly putting it in direct opposition to the idea of reclaiming lost narratives of creation and use.This paper will, therefore, explore the implications of competing claims surrounding the value of Linked Open Data within the specific domain of digitised periodicals, particularly when working with enriched metadata and data roundtripping (the process of integrating derived data back into the original collections), and demonstrate how combining institutional histories, interviews, metadata and historical narrative detail in a decentralised and layered structure can restore lost narratives and data provenance in a sustainable way that is intelligible across disciplines and at multiple resolutions–whether focusing on the textual content of the issue, the technical details surrounding digitisation, or the computational representation of the physical layout and materials.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Carleton University, Université d'Ottawa (University of Ottawa)
Ottawa, Ontario, Canada
July 20, 2020 - July 25, 2020
475 works by 1078 authors indexed
Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.
Conference website: https://dh2020.adho.org/
References: https://dh2020.adho.org/abstracts/
Series: ADHO (15)
Organizers: ADHO