Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning

paper, specified "long paper"
Authorship
  1. 1. Benjamin Charles Germain Lee

    University of Washington

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The millions of digitized historic newspaper pages within

Chronicling America

, a joint initiative between the Library of Congress and the National Endowment for the Humanities, represent an incredibly rich resource. Historians, journalists, genealogists, students, and members of the public explore the collection regularly via keyword search. But how do we navigate the abundant visual content in
Chronicling America? This question is motivated by the fact that visual culture within newspapers has proven to be a capacious source for humanists. Within periodicals studies, scholars have utilized the visual content in newspapers to investigate topics as far ranging as the evolution of comedic sensibilities within comic strips to hidden editorial practices embedded within newspaper layout (Cole, 2020; Barnhurst and Nerone, 2002).
, This collective body of work is bolstered by new methodologies being employed within the digital humanities to extract and analyze visual content in historic newspapers (Piper, Wellmon, and Cheriet, 2020; Fyfe and Ge, 2018; Wevers and Smits, 2020).
,, In this talk, I will present my project,
Newspaper Navigator, created in collaboration with
LC Labs
, the
National Digital Newspaper Program, and IT Design & Development at the Library of Congress, as well as Professor
Daniel Weld at the University of Washington. In particular, I will discuss four distinct phases of
Newspaper Navigator to extract and analyze the visual content within
Chronicling America and beyond.

First, I will describe extracting visual content, including photographs, illustrations, comics, editorial cartoons, maps, headlines, and advertisements, from 16.3 million pages in
Chronicling America, resulting in the

Newspaper Navigator dataset
. To accomplish this, I finetuned an object detection model of thousands of bounding box annotations of visual content from the
Beyond Words crowdsourcing initiative launched by LC Labs in 2017. I then made a full pass over 100TB of image and XML data in order to construct the dataset. The Library of Congress and I released the resulting
Newspaper Navigator dataset to the American public in May, 2020, as the largest dataset of its kind ever produced. In pursuit of the Library’s mission of improving access, we placed the dataset and all code into the public domain for unrestricted re-use. We published a
paper describing the dataset and its construction at the 2020 ACM Conference on Information Knowledge & Management (Lee et al., 2020). 

Second, I will discuss the
Newspaper Navigator public
search application for 1.5 million photos from the dataset. While caption-based keyword search for images provides much utility, the approach also has fundamental limitations: for example, how do historians search for photographs with distinct visual motifs? This question is particularly relevant for cultural heritage collections, where OCR transcriptions are inevitably imperfect, further restricting the efficacy of keyword search. In the second phase of
Newspaper Navigator, I created and deployed the search application for 1.5 million photographs in the dataset based on the real needs that historians and other users had articulated to us surrounding these limitations. In addition to providing keyword search functionality, the search application enables users to iteratively train machine learning algorithms in order to retrieve visually similar photos according to topics or concepts of interest, such as baseball players. From an exploratory search perspective, I call this search functionality
open faceted search because it empowers users to create their own facets dynamically, facilitated by interactive machine learning algorithms that can train and predict over all 1.5 million photos in under a second. Unlike standard faceted search, open faceted search provides a path forward even when metadata is impoverished, making it extensible to a wide range of digitized collections. I first presented open faceted search in a
demo at the 2020 ACM Symposium on User Interface and Software Technology (Lee and Weld, 2020).

Third, I will discuss the
Newspaper Navigator
data archaeology, which I wrote to examine the ways in which a
Chronicling America newspaper page is altered and decontextualized during its journey from a physical artifact to a series of probabilistic photographs in
Newspaper Navigator. First released with the
Newspaper Navigator search application in order to provide scholars and the general public alike with a resource surrounding the ethical considerations and implications of this project, the data archaeology has appeared in revised form as an article in
Digital Humanities Quarterly (Lee, 2021). In this data archaeology, I studied the digitization journeys of four different pages in Black newspapers in
Chronicling America that reproduce the same photograph of W.E.B. Du Bois. In tracing the pages’ journeys, I unpacked how each step, from microfilming to OCR to image embeddings, propagates bias, marginalization, and erasure via the machine learning algorithms employed.

I will conclude by discussing
Newspaper Navigator research collaborations with scholars and educators across universities and cultural heritage institutions. With Devin Naar, I conducted the first study of the Ladino press at a macroscopic scale. Ladino, also known as Judeo-Spanish, is the language of the Sephardic Jewish people, and the Ladino press represents an invaluable source for studying Sephardic Jewish experiences across the world. In this collaboration, I have utilized
Newspaper Navigator to excavate the visual content from over 15,000 pages of Ladino newspapers. Many Ladino texts are not even keyword searchable due to the widespread failure of OCR engines to properly transcribe the language. My excavation of the visual content offers the first path forward to studying Ladino newspapers at scale and thus serves as a corrective to this algorithmic marginalization. My analysis of thousands of extracted photographs and advertisements reveals new contours to Sephardic Jewish experiences in modernity: in addition to uncovering photographs of individuals and communities, I have also identified an abundance of advertisements offering remedies for anxieties, whether medical, financial, or class-based. My findings are detailed in my chapter in
Jewish Studies in the Digital Age, currently in press with De Gruyter Press for publication in 2022 (Lee, 2022). 

Moreover, in an ongoing collaboration with periodicals scholars Jim Casey, Sarah Salter, and Joshua Ortiz Baco, I am studying the evolution of visual layouts of newspaper titles, with a particular focus on ethnic presses and how they served as vehicles for protest and community. Using the
Newspaper Navigator dataset, it is possible to directly quantify the similarity of layouts across millions of newspaper pages, enabling us not only to trace the technological developments of printing presses but also to uncover the hidden editorial practices embedded within layouts themselves. For example, we have identified clusters of newspaper titles with similar visual layouts, such as networks of African-American titles that feature illustrations and photographs of members of their communities in portrait poses in the center of their front pages. The editors’ choice of a shared visual grammar speaks to the ways in which visual culture and layout featured prominently into editorial practices. We presented our
first paper detailing this collaboration at the
Computational Humanities Research 2021 conference (Lee et al., 2021). Lastly, I have collaborated with professors of education Ilene Berson and Michael Berson to investigate uses of
Newspaper Navigator in the classroom, as detailed in our
article in
Social Education (Lee, Berson, and Berson, 2021).

I will conclude my talk by reflecting on possibilities for research at the intersection of machine learning, the digital humanities, and libraries.

Newspaper Navigator
Resources:

Newspaper Navigator
dataset:

https://news-navigator.labs.loc.gov/

Newspaper Navigator
search application:

https://news-navigator.labs.loc.gov/search

Newspaper Navigator
data archaeology:

https://hcommons.org/deposits/item/hc:32415

Newspaper Navigator
project description & other links:

https://bcglee.github.io/newspaper-navigator.html

Bibliography

Barnhurst, K. G. and Nerone, J. (2002).
The Form of News: A History. Guilford Press.

Cole, J. L. (2020). How the Other Half Laughs: The Comic Sensibility in American Culture, 1895-1920. University Press of Mississippi.

Fyfe, P. and Ge, Q. (2018). Image Analytics and the Nineteenth-Century Illustrated Newspaper.
Journal of Cultural Analytics,
3(1). DOI:
10.22148/16.026.

Lee, B. C. G. (2021). Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset.
Digital Humanities Quarterly,
15(4). DOI:
http://www.digitalhumanities.org/dhq/vol/15/4/000578/000578.html.

Lee, B. C. G. (2022). The Digital Humanities and the Ladino Press: Using Machine Learning to Extract and Analyze Visual Content in Historic Ladino Newspapers.
Jewish Studies in the Digital Age. De Gruyter Press.

Lee, B. C. G., Baco, J. O., Salter, S. H. and Casey, J. (2021). Navigating the Mise-en-Page: Interpretive Machine Learning Approaches to the Visual Layouts of Multi-Ethnic Periodicals.
Computational Humanities Research Conference 2021. DOI:
http://arxiv.org/abs/2109.01732.

Lee, B. C. G., Berson, I. R. and Berson, M. J. (2021). Machine Learning and the Social Studies.
Social Education,
85(2). pp. 88-92. DOI:
https://www.socialstudies.org/social-education/85/2/machine-learning-and-social-studies.

Lee, B. C. G., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K. and Weld, D. S. (2020). The Newspaper Navigator Dataset: Extracting Headlines and Visual Content from 16 Million Historic Newspaper Pages in Chronicling America.
Proceedings of the 29th ACM International Conference on Information & Knowledge Management. (CIKM ’20). New York, NY, USA: Association for Computing Machinery, pp. 3055–62. DOI:
10.1145/3340531.3412767.

Lee, B. C. G. and Weld, D. S. (2020). Newspaper Navigator: Open Faceted Search for 1.5 Million Images.
Adjunct Publication of the 33rd Annual ACM Symposium on User Interface Software and Technology. (UIST ’20 Adjunct). New York, NY, USA: Association for Computing Machinery, pp. 120–22 DOI:
10.1145/3379350.3416143.

Piper, A., Wellmon, C. and Cheriet, M. (2020). The Page Image: Towards a Visual History of Digital Documents.
Book History,
23(1). Johns Hopkins University Press: 365–97. DOI:
10.1353/bh.2020.0010.

Wevers, M. and Smits, T. (2020). The Visual Digital Turn: Using Neural Networks to Study Historical Images.
Digital Scholarship in the Humanities,
35(1): 194–207 DOI:
10.1093/llc/fqy085.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO