This workshop introduces a curation-oriented web crawler called Hyphe. This software, developed with and for Social Sciences and Humanities scholars, aims at providing a method and a tool to build a research corpus from web content (web pages and HTTP links). It provides a web mining tool wrapped with a User Interface and curation features (defining web pages aggregates, filtering contents, expansion method) required by Social Sciences and Humanities scholars.
We will focus on using the web crawler and will not take the time to present web studies, digital sociology or digital methods in general. Participants should have basic knowledge of the web and already consider it as a legitimate field for scientific investigation (Ackland, 2013). Participants are encouraged to come with ideas regarding which websites would be interesting to study for their personal research agenda (a list of entry points).
Hyphe, a curation-oriented approach to web crawling for the social sciences
The web is a field of investigation for social sciences, and platform-based studies have long proven their relevance. However the generic web is rarely studied in and of itself, though it contains crucial embodiments of social actors: personal blogs, institutional websites, hobby-specific media… We realized that some sociologists see existing web crawlers as “black boxes” unsuitable for research though they are willing to study the broad web. Hyphe is a crawler which was developed with and for social scientists, with an innovative “curation-oriented” approach meant to address two of the main social science problems when working with web mining: how to build a corpus and how to delineate an actor’s presence (Jacomy et al., 2016).
The workshop will first introduce Hyphe’s software and methodological principles through a guided case study. The participants will be guided through their first use of Hyphe to build their own web corpus.
Part 1: Presentation of methodological approaches with a case study
We will start the workshop with a presentation of our software Hyphe and its methodological principles. It will be done through its application on a case study. We offer to map the Digital Humanities communities through the many websites used to present and organise associations, conferences, research projects, research labs… We will build such a corpus live during this first part to introduce the participants to the main concepts and practical steps one should meet when building a web corpus. The teachers will have prepared the corpus before the workshop with a series of most common use cases and issues. The subject of digital humanities is proposed first because these communities use web communication a lot, and secondly to better engage the participants with a subject they are familiar with.
Part 2: Hyphe practice
After this extensive presentation of Hyphe, participants will be invited to engage in practice themselves. Individually or as groups of two, they will be given access to their own corpus on an online version of Hyphe and will be invited to map web communities on their subject of research following Hyphe’s iterative curation process:
define the first actors web “boundaries” and start crawling them
observe the resulting network of actors (websites)
prospect the web for other potentially interesting actors by exploring most linked actors, filtering out irrelevant ones such as Google or Youtube
crawl these actors’ websites as well
adjust the “boundaries” of the new actors found to better represent their social reality
iterate over and over until obtention of a reasonably complete corpus
visualize as a network map and take a quick look at its structural properties (clusters, density…)
We will conclude this part with a discussion on methodologies to wrap-up the workshop.
This workshop is supported by DIME-WEB part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).
Ackland, R. (2013).
Web Social Science: Concepts, Data and Tools for Social Scientists in the Digital Age. SAGE.
Jacomy, M., et al. (2016). Hyphe, a Curation-Oriented Approach to Web Crawling for the Social Sciences. Cologne, Allemagne: AAAI https://spire.sciencespo.fr/hdl:/2441/6obemb2hsj9pboj9bbvc7sftne.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.