Curating Just-In-Time Datasets from the Web

paper, specified "long paper"
  1. 1. Todd Suomela

    University of Alberta

  2. 2. Geoffrey Rockwell

    University of Alberta

  3. 3. Ryan Chartier

    University of Alberta

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Web is now deeply integrated into contemporary culture, and scholars interested in current phenomenon cannot afford to ignore it, however, collecting data from the web is not easy. The web is based on a mix of continually changing technical standards which make creating an archival copy of a web site for scholarly reference very difficult. Without such a copy there is no way for future scholars to question the interpretations we make today or reinterpret the phenomenon in light of new evidence tomorrow.
Researchers and organizations, such as the Internet Archive, are attempting to preserve portions of the web for future retrieval, but much of the web disappears quickly. A 2014 study of web links in scholarly papers found that 1 out 5 scholarly papers contains links to web URLs which no longer function or no longer exist (Klein et al., 2014). The need for humanists to recognize the value of web archives to historical research is especially acute. Researchers cannot engage recent cultural histories and ignore the culture of the web (Milligan, 2012).
The challenge of web archiving is especially acute for rapidly changing stories which track specific events. Discussions about controversial topics, such as GamerGate,

This paper is the result of a larger project investigating the discourse surrounding GamerGate, an internet controversy about feminism and gaming, which grew dramatically in 2014. The paper presents some of the methods used by our research group to study GamerGate. For a brief non-academic explanation of GamerGate see Hathaway, 2014
take place across multiple websites, use multiple forms of media, and occur in very different discourse communities. Underneath the different social worlds gathered around online communities there is an incredibly diverse set of technological platforms which require customized strategies for tracking and collecting data. This paper will:

Describe the key challenges for researchers wishing to collect just-in-time archives of web based cultural phenomenon.
Put current challenges in an historical context of differing goals between web developers and web archivists.
Propose some social and technical solutions to improve the situation, and
Introduce a set of tools to help researchers engaged in these areas.

Dynamic changes in online content present one of the unique challenges for gathering contemporary web discourse. Most internet users are familiar with the constantly updating nature of Twitter and other social media platforms. Social media platforms present a challenge for web archivists because of their technological structure and commercial ownership. The speed of updates on social media requires specialized tools to download, especially in large quantities. A researcher needs deep technological knowledge of these tools and the application programming interface (API) provided by the website in order to build a reliable and useful corpus. On the legal side, the terms of service affect the types of information that can be gathered by researchers and how that data can be analyzed or shared with other researchers.
Other commercial sites, such as news media web sites, often host comment threads where internet users can post their opinions on the topics covered in the main story. It is relatively trivial to download the main content of a news story posted on the web but collecting the comment stream may present a challenge because the comments may be hosted by another website service or may be displayed dynamically as a user scrolls further through a web page. In such cases the default web archiving tools may not be sufficient. Web discussion forums present yet another technical challenge.
Researchers often frame their questions about web phenomena by describing a topic that they wish to study. But the architecture of the web is built around the key idea of a web site, a particular set of files which may include many different types of media including text, images, and video, and is hosted by a particular business, institution, or individual. These web sites are identified by the Uniform Resource Locators people type into their browsers in order to navigate to a web page.
The tools used to archive the web are built on this technical background for dealing with URLs, APIs, REST, RSS, and other interfaces which human beings do not usually interact with. In the language of web archive software the unit of research is the seed, or base URL, from which data can be harvested. For the researcher the unit of work is the topic. Negotiating between these two conceptions of how online research should work is a major social challenge for any type of internet research. Researchers and web users just want to see the content, but automating the collection of that content means reproducing a complicated software experience which has gradually been built by web developers and web browsers over the past 25 years.
Humanities researchers have traditionally relied on stable or slowly changing content. Efforts by humanities scholars have been made to adapt to the changes in digital content represented by the web. Some universities have set up web labs for collecting and analyzing web data. One key task of these labs has been building subcollections from the overall web in order to further the study of particular topics (Arms et al., 2009). One of the key insights from our work is the need to continue building strong collaborations between multiple fields. Libraries and the Internet Archive need input from digital humanists in order to understand their research questions and digital humanists need to understand the technological challenges of web archiving in order to collectively design systems which will help future researchers. The web, however, is constantly changing at multiple levels, ranging from the technology used to deliver content, the processes of creating content, commenting on content, and the distribution of information. Archiving the web for humanities research calls for changing the conceptual image of stable sources, collaborating with new communities, and adopting new technologies.

The implication of the technological treadmill described above is that it becomes more and more difficult for a single researcher to adequately collect the web. There are two potential solutions to this problem: technological and social.
Computer scientists are working to build better web archive software which can integrate with social media in order to reduce the amount of administrative overhead needed to collect information on particular topics.

Some of these research groups are located at the
Center for the Study of Digital Libraries at Texas A and M;
Web Science and Digital Libraries Research Group at Old Dominion; and the
Digital Library Research Laboratory at Virginia Tech.

These tools will automate the selection of web sites to be archived, removing some of the human intervention needed to curate web materials. But simplifying the data gathering process today may make future explanations of the context of a collection more technical. For now researchers are dependent upon a mix of tools, often customized for specific uses, and mixing open source and commercial software.

In our research project we used a combination of open source tools, subscription services, and customized API calls. For gathering data from Twitter we used a program called twarc.

Github repository at
Customizations were made to improve the performance of the tool for our uses, which was tracking specific hashtags. The Twitter scraper was initially installed on a laptop belonging to a member of the research team, but when the number of tweets became too large for a laptop the program was moved to a cloud server provided by Compute Canada. The data from Twitter was stored in JSON and then transformed using standard libraries into files which could be analyzed for most frequent Twitter posters, most frequent URLs, and most frequent hashtags. Data from web pages was collected using the Archive-IT subscription service

Web site
provided by the Internet Archive and the wget

Web site
command line tool. Some specific websites, such as 4chan and 8chan, required the development of custom API interfaces to download material from relevant chat boards. Additional programs to download comments from YouTube are currently being tested. We plan to document our recipes for using these different tools on,

Web site
the methods commons for text analysis.

Archiving the web involves many different institutions and disciplines. The largest players are the Internet Archive and various national libraries; the Internet Archive operates as a non-profit and has the most comprehensive collection of digital materials from the web. Unfortunately, the Internet Archive collections are not primarily built for researcher access and can be especially difficult to work with if you are investigating topics which cover multiple URLs or lengthy time periods. Any research project using their collections requires significant human labor. Libraries and museums can step in to fill some of the gaps by using services such as Archive-IT, which provides more curatorial control over the collection development process and also has a more robust search interface. In order to improve these tools, humanists will need to build connections with other disciplines, such as information and library science, computer science, and archival studies. Only by working together and extending our disciplinary horizons can we build the collections which current and future digital humanists can use to study our current era.
One final social issue of importance are the legal and ethical implications of gathering large amounts of data from the web. We will not discuss these issues in great depth in this paper but they do need to be acknowledged because they constrain some of the actions which can be taken in web data gathering. In our project on GamerGate we have looked closely at the ethical implications of sharing data gathered from social media. The dataset we shared online includes an appendix on ethical issues related to data sharing.

Rockwell, G., Suomela, T., 2015, "Gamergate Reactions", V5 [Version]


Arms, W.Y., Calimlim, M. and Walle, L. (2009). EScience in Practice: Lessons from the Cornell Web Lab.
D-Lib Magazine, 15(5/6). doi:10.1045/may2009-arms

Hathaway, J. (2014). What is GamerGate, and Why? An Explainer for Non-Geeks.

Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K. and Tobin, R. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.
PLoS ONE, 9(12): doi:10.1371/journal.pone.0115253

Milligan, I. (2012). Mining the “Internet Graveyard”: Rethinking the Historians’ Toolkit.
Journal of the Canadian Historical Association, 23(2): 21-64. doi:10.7202/1015788ar

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2016
"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website:

Series: ADHO (11)

Organizers: ADHO