Leveraging Web Archiving Tools for Digital Humanities Research and Digital Exhibition

workshop / tutorial
Authorship
  1. 1. Scott Brian Reed

    Internet Archive

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Summary:
Web archiving is an important part of the digital preservation field. While most are familiar with the Wayback Machine available at archive.org, less are aware that there are a number of tools and services developed for organizations and individuals to create their own web archives, including the capability to search and analyze large data sets built around the WARC file format, an ISO standard for web archiving. In addition, web archives provide permanent URLs for citation and can show how a website has changed over time at a single URL, even if no longer available on the live web. In short, web archives can save a researcher’s life and provide very necessary preservation tools for archivists to manage content that is only posted on the web.

This workshop will introduce participants (15-20) to basic web archiving concepts and challenges. Using the Archive-It (www.archive-it.org) web application, participants will have a hands-on opportunity to build a collection of content archived from the web, which can include their own organization’s web presence, social media, digital exhibitions, data sets, or topical content publicly available on the web. Following the workshop participants will have a searchable archive available to them, including the option of downloading WARC files for long term preservation or research.

The target audience for this workshop includes interested humanities scholars researching the web and professionals responsible for digital library service or digital archives. No prerequisite knowledge of or experience with web archives is necessary, and the session does not require any programming or advanced technical knowledge of the web. The workshop will not be oriented towards those with deep knowledge of web archives or the WARC format, although there could be time allotted to a demonstration of another web archiving tool or project related to digital humanities and web archiving and this should be specified in the CFP (see below).

CFP:
In order to make the most of the 3 hour workshop and ensure that the curriculum is tailored toward participant interest, a CFP will be requested for all interested persons. It should include:

Description of participant research interest or professional projects.
Description of prior experience with using web archives or their own web archiving (if applicable).
5 to 10 websites to be archived as part of 1 or more collections of content, and links to the Robots.txt files if applicable. More information is here: https://webarchive.jira.com/wiki/display/ARIH/Robots+Exclusion+Protocol#RobotsExclusionProtocol-Whatistherobotsexclusionprotocol?
With permission from participants, URLs will be crawled as a test (no data archived) prior to the workshop so post crawl reports can be analyzed as part of the workshop curriculum.

The CFP process is not intended for competitive review but to ensure relevancy and preparedness of participants. It should be received at least 2 weeks before the workshop. CFPs will be reviewed by the instructor and Kristine Hannah, Director of Web Archiving Services at the Internet Archive.

Cost and Equipment Required:
There will be no additional costs associated with the workshop. It will require a meeting room with wireless internet and a projector with screen. Participants will bring their own wi-fi enabled laptop computers and there should be sufficient power outlets.

Agenda Outline:
15 minutes

Welcome and participant introductions including overview of content being archived

25 minutes

Overview of web archiving, including:

history of web archiving
common software tools (Heritrix, Wayback)
overview of WARC file format
overview of Internet Archive and Archive-It
60 minutes

Hands-on Archive-It web application training

creating a collection
adding and scheduling seeds to be crawled
starting and monitoring (crawls)
modifying the scope of the crawl
understanding Robots.txt
analyzing post crawl reports
quality assurance, including addressing web archiving limitations and challenges
30 minutes

Other Web Archiving tools (including WAIL)

30 minutes

Understanding the web archiving life cycle, including:

open source tools for researching WARC files
future steps for web archiving
sharing and reporting on future research projects utilizing web archives (group activity/share)
20 minutes

Flex time, conversation, and breaks

About the instructor:
Scott Reed has worked as a Partner Specialist with the Internet Archive since 2012, primarily supporting organizations and researchers using Archive-It to build collections of web content. In addition he is a volunteer with the GLBT Historical Society in San Francisco, CA . Prior to his work with Internet Archive, he has worked in various positions as a digital literacy and media instructor and project assistant for non-profit organizations and academic departments in California including the Feminist Studies department of the University of California, Santa Cruz.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None