Supporting "Distant Reading" for Web Archives

paper, specified "long paper"
  1. 1. Jimmy Lin

    University of Maryland, College Park

  2. 2. Kari Kraus

    University of Maryland, College Park

  3. 3. Ricardo L. Punzalan

    University of Maryland, College Park

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In a recent essay on the stock footage libraries amassed by Hollywood studios in the first half of the 20th century, Rick Prelinger—moving image archivist at the Internet Archive—laments that “archives often seem like a first-aid kit or a rusty tool, resources that we find reassuring but rarely use” (Prelinger 2012). Although he doesn’t single them out by name, web archives are particularly vulnerable to this charge. User studies, access statistics, page views, and other metrics have in recent years told a consistent story: web content that has been harvested and preserved by collecting institutions, universities, and other organizations often lies fallow, and like Prelinger’s rusty tool may be notable more for its latent potential than for having served any real purpose (Hockx-Yu 2013; Kamps 2013; Huurdeman et al 2013). While the reasons for neglect are myriad, this paper focuses on one: the lack of tools to support a wide range of interactions with the content. We describe initiatives underway at the University of Maryland to partially redress the problem and highlight the need for qualitative user studies.

The Internet Archive’s Wayback Machine is perhaps the best-known and most widely available tool to browse captured content. Both the Internet Archive’s main public site and Archive-It, its subscription-based web archiving service, replicate the experience of viewing web pages on the live web, thus reifying a “close-reading” experience. First developed in the mid-1990s, the software came of age at the same time digital humanities scholars were building the first generation of web collections aimed at providing high-resolution digital facsimiles of literary and artistic works by Blake, Rossetti, Dickinson, Whitman, and others. The emphasis on accurate rendering and display is thus a hallmark of both the Wayback Machine and many early DH projects, the latter of which likewise self-identify as “archives,” albeit archives on a dramatically smaller scale.

Although the capabilities offered by the Internet Archive and other commercial services are significant, we believe considerable technical advances are needed if web archives are to fulfill their promise as tools of analysis as well as preservation. Within the field of DH, the big data vistas offered by scholars such as Matt Jockers and Ted Underwood provide both inspiration and models on which to base these efforts (Jockers 2013). Unlike the boutique digitization initiatives that characterize the early wave of DH archives of the 1990s and early 2000s, which were often devoted to the works of a single author, the new macroanalytic approaches are premised on mass-digitization of print heritage. The paradigm they embody, moreover, is not digitization in the service of verisimilitude—reproductions that show exact fidelity to their originals—but rather digitization that produces terabytes’ worth of intermediary copies that can be cleaned, normalized, segmented, tokenized, mined, and visualized to yield new insights about the cultural record writ large. Such a paradigm disrupts the usual data-information-knowledge continuum by taking the unitary wholes of creative expression—the “cooked” novels or poems or historical documents in print—and temporarily degrading them to a “raw” data state so that they can be analyzed at scale to make higher-order knowledge claims.

We believe that the technical infrastructure to support macroanalytics or “distant reading” on web archives today is inadequate. Existing tools were built before the coming of age of “big data” technologies and provide wobbly foundations on which to build analytical tools that scale to petabytes of data. As an example, the open-source Wayback Machine is implemented as a monolithic stack primarily designed to scale “up” on more powerful servers and expensive network-attached storage. Its architecture captures the ethos of “state-of-the-art” software engineering practices of the late 1990s. Not surprisingly, the field has advanced by leaps and bounds in the last decade and a half. In the 2000s, Google published a series of seminal papers describing solutions to its data management woes, which involve analyzing, indexing, and searching untold billions of web pages. Instead of scaling “up” on more powerful individual servers, the strategy entailed scaling “out” on clusters of commodity machines (Barroso et al., 2013). Before long, open-source implementations of these Google technologies were created, bringing the same massive data analytic capabilities to “the rest of us.” These systems form the foundation of what we know as “big data” today, and provide the backbone of data analytics infrastructure at Facebook, Twitter, LinkedIn, and many other organizations. Three key systems are:

The Hadoop Distributed File System (HDFS), which is a horizontally-scalable file system designed to store data on clusters of commodity severs in a fault-tolerant manner (Ghemawat et al. 2003). The largest known HDFS instance (by Facebook) holds over 100 petabytes.
Hadoop MapReduce, which is a simple yet expressive programming model for distributed computations that works in concert with data stored in HDFS (Dean and Ghemawat, 2004). MapReduce models analytical tasks in two distinct phases: a “map” phase where computations applied in parallel, followed by a “reduce” phase that aggregates partial results.
HBase, which is a distributed store for semi-structured data built on top of HDFS that allows low-latency random access to billions of records. Google’s Bigtable (Chang et al., 2006), from which HBase descended, powers Gmail, Google Maps, as well as the company’s indexing pipeline.
Modern big data technologies provide a technical path forward and an accompanying research agenda that does for web archives what macroanalytics or so-called “distant reading” has begun to do for digitized corpora in DH. As a first step in this effort, we are developing Warcbase, an open-source platform for storing, managing, and analyzing web archives built on the three technologies discussed above. The platform provides a flexible data model for organizing web content as well as metadata and extracted knowledge. We have built a prototype application that provides functionality comparable to the Wayback Machine in allowing users to browse different versions of resources in a web archive (typically as WARC or ARC files). Since Warcbase takes advantage of proven open-source technologies, we are confident of the infrastructure’s ability to scale in a seamless and cost-effective manner.

Yet Warcbase is only the beginning. We believe that our prototype—and, more generally, the technologies described above—will provide new capabilities that support innovative uses of web archives. Responsive full-text search on massive collections of web pages, one of the first items on a scholarly wishlist, is within reach: the tools exist in various open-source projects, awaiting integration. Longitudinal analyses of web pages such as tracking the frequency of person or place names become possible if we integrate off-the-shelf natural language processing tools. Yet another possibility is topic modeling on a massive scale; a separate project at the University of Maryland has built Mr.LDA, an open-source Hadoop toolkit for scalable topic modeling (Zhai et al., 2012). To provide a hint of what’s possible, we have been working with Congressional archives from the Library of Congress to explore topic modeling and large-scale visualizations of archived content, the results of which we will share during the conference presentation.

Why are large web archives so underused? It is surely not due to a lack of culturally significant material. Valuable content, ripe for exploration, ranges across topics such as electronic literature, alternate reality games, digital tools for human rights awareness, the Arab Spring uprising, and Russian parliamentary elections, to name just a few. Restrictive access regimes are partially to blame, but that alone does not provide a sufficient explanation. We believe that the issue, to a large extent, is a technological form of circular reasoning: scholars do little because the right tools don’t exist, and tool builders are hesitant to build for non-existent needs and users. Progress is necessary to understand the essential activities, methods, and questions of researchers. Interviews with current web archive users are a start, but breakthroughs will require deep collaborations between scholars and technologists. The end goal is a comprehensive set of tools for researchers in the digital humanities and beyond to analyze and explore our digital cultural heritage.

Archive-It: Web Archiving Services for Libraries and Archives.

Barroso, Luiz Andre, Jimmy Clidaras, and Urs Holzle. (2013). The datacenter as a computer: an introduction to the design of warehouse-scale machines (second edition). Morgan & Claypool Publishers.

Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. (2006). “Bigtable: A distributed storage system for structured data.”Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI).

Dean, Jeffrey and Sanjay Ghemawat. (2004). “MapReduce: Simplified data processing on large clusters.”Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI).

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. (2003). “The Google File System.”Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP).

Hockx-Yu, Helen. (15 February 2013) “Scholarly use of web archives.”

Huurdeman, Hugo, et al. (2013). “Sprint methods for web archive research.”WebSci 2013 Proceedings of the 5th Annual ACM Web Science Conference:182-190.

Jockers, Matthew. (2013). Macroanalysis: digital methods and literary history. Urbana-Champaign: University of Illinois P.

Kamps, Jaap. (1 August 2013). “When search becomes research and research becomes search.”SIGIR'13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH). Dublin, Ireland.

Moretti, Franco. (2013). Distant reading. London: Verso.

Moretti, Franco. (2007). Graphs, maps, trees: Abstract models for literary history. London; New York: Verso.

Prelinger, Rick. (2012). “Driving through Bunker Hill.” In Kraus, K. and Levi, A. (Eds.). Rough Cuts: Media and Design in Process. MediaCommons: The New Everyday.

Zhai, Ke, Jordan Boyd-Graber, Nima Asadi, and Mohamad Alkhouja. (2012). “Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce.” Proceedings of the 21th International World Wide Web Conference (WWW).

The term “distant reading” was coined by Franco Moretti in Graphs, Maps, Trees (2007) and has undergone further elaboration in his newest book (2013). See the “Works Cited” section for full bibliographic information.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO