Library of Congress, United States of America
Library of Congress, United States of America
Innovation is sparked in the absence of access to a service or resource. Or, in the case of libraries, merely access itself. As a core value of librarianship, access stands paramount to the mission of libraries and, therefore, so does the necessity to innovate in the face of a gap in access. The Web Archiving Team at the Library of Congress found ourselves facing such a gap when we stood on the edge of our digital cliff and peered out over it and into the three Petabytes (PB) of web archives swimming in our lake of data. In this talk we introduce the ways in which we bridged that gap by publicly providing basic but rich technical metadata about web resources in the form of crawl indexes (CDX’s). First, we will briefly describe the methodology behind the Library’s harvesting practices and the difficulties they have made in presenting data in bulk to researchers. Then, we will touch on the technical aspects: our approach, research, tools used, and results. Finally, we will discuss the impact this work has on the digital humanities community, and invite researchers to experiment for themselves.
The Library of Congress web archives are organized among 80 thematic and event-based collections, and contain websites representing a broad range of subjects, languages, file formats, and topic areas, with a mix of crawling and access permissions based on the country of publication and type of website. Areas such as: government; non-profit and for profit organizations; journalism and news; and creative sites are collected from the United States and throughout the world. Active Library subject specialists maintain and continually refine collections such as the Indian Political and Social Issues Web Archive collection which has been collecting content in English, Hindi, Urdu, Marathi, and Bengali for two years, and the United States Elections Web Archive collection, which has been collecting campaign material in English and Spanish during every national election season since 2000.
Although the archive’s contents are nominated, cataloged, and made available on loc.gov according to the event and thematic collections, the harvesting takes a different form. Crawls are performed weekly, monthly, and quarterly, and each crawl acts as a bucket for any seeds in the collections set to crawl at that particular frequency. Harvesting in this ‚bucket-style‘ allows the Web Archiving Team to streamline many aspects of crawling over 14,000 seeds (or ‚websites‘) at any given time and allows harvesting to reflect the specific website’s frequency of change.
This also means that the container files (Web ARChive or ‚WARC‘ files) holding the harvested web objects were collected ‚bucket-style‘ and may contain objects from 100 different websites, representing 20 different collections with varying access permissions in a single container file. Hence the issue with providing bulk data to researchers. Even after the Web Archiving Team received access to cloud services in 2018, and could experiment with access at scale, WARCs still could not be presented to researchers as-received because of the mixed permissions. We looked to existing tools and organizations for inspiration to find a path that would work for us. We ultimately adapted work from Common Crawl’s public Github repository (
https://github.com/commoncrawl/cc-pyspark
,
https://github.com/commoncrawl/cc-mrjob
). We also took cues from Archives Unleashed (Ruest et al., 2021) and ArchiveSpark (Holzmann et al., 2016) by utilizing EMR (Elastic Map Reduce) and Spark to process the CDX’s.
The high-level process is very simple: transform the CDX’s into a format that is better suited for EMR and data processing, then query to reduce down to the desired output. The idea is modeled after Common Crawl’s approach in which they transform their CDX’s into Parquet format for enhanced compression and easier consummation into DataFrame objects that are part and parcel of Spark and other such Big Data processes.
The ability to cleanly excise a part of the archive represented by its metadata is a huge step forward in providing bulk access to the archive. After making the first datasets publicly available (
https://labs.loc.gov/work/experiments/webarchive-datasets/
) we have seen use by students, information professionals, and researchers. Researchers use the metadata to load specific archive captures from the Library’s Wayback Machine instance, and ‚rehydrate‘ the text by scraping the archived captures to create a text-based dataset of their own. While this method requires some technical skill and processing power, we see countless opportunities to create web archive-based datasets for any discipline, given the Library’s broad collecting range, and look forward to fielding researcher requests.
Bibliography
Holzmann, H., Goel, V., Anand, A. (2016). ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation.
JCDL ‘16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. Newark, NJ, pp. 83-92.
Ruest, N., Fritz, S., Deschamps, R. et al. (2021). From archive to analysis: accessing web archives at scale through a cloud-based interface.
International Journal of Digital Humanities, 2: 5-24.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Tokyo, Japan
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)
Organizers: ADHO