No comment : Addressing comment sections in web analysis

Paul Guille-Escuret; Florian Cafiero; Jeremy Ward

Authorship

1. Paul Guille-Escuret

VITROME - Aix-Marseille University, CNRS (Centre national de la recherche scientifique), Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)
2. Florian Cafiero

CNRS (Centre national de la recherche scientifique), Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)
3. Jeremy Ward

VITROME - Aix-Marseille University, CNRS (Centre national de la recherche scientifique), Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We provide an R-based method for extracting the commentary sections of a website - the contents of which can bias a corpus’ analysis, but can be interesting to study per se.Studying corpora of websites, through methods such as topic modeling or hyperlink analyses, is an increasingly adopted approach in the humanities (e.g. Severo et al., 2018, Romele et al. 2016, Berthelot et al. 2016), information science (e.g. Bounegru et al., 2017) or social science (Marres, 2015, Froio, 2018). Yet, one part of their content is very often neglected: the comments section.The biases induced from leaving the comments sectionEspecially when studying a corpus of websites focusing on controversial topics, commentary sections can induce many biases in the analyses. Comments can express a point of view radically different from the page itself. Hyperlinks present in the comments can point to contents that the owner of the website does not endorse, which can distort any network analysis. The vocabulary used in the comments can also bias content analyses such as topic modelling. It is thus key to eliminate these comments, or to keep them for a separate analysis. We exemplify this through a case study.Separating the comments from the page: a tedious taskRemoving or extracting the commentary sections from a set of websites is in fact a tedious task, thus rarely performed. Many languages can be used to encode the page: HTML 4.0 or 5.0, XHTML, Ajax, Ruby on Rails etc. Some standards obviously exist, for instance for blog platforms, but they are not widely adopted. And unexpected means to open a commentary section (e.g. considering the commentary sections as a subpart of a forum) can frequently occur.Aiming at exhaustivity: a necessityFocusing only on the easily retrievable commentary sections would induce important biases. The way the commentary section is encoded is in itself a socially-induced phenomenon, demonstrating the user’s literacy in web programming, or his financial means. Excluding very poorly encoded pages, or virtuoso contents written by expert programmers, could thus translate into excluding specific groups from any further analysis.A method for extracting commentsThe method we propose is not fully automated, and requires a direct identification of patterns delimiting comments sections and comments themselves in the code. Some patterns are relevant for many websites while others need to be carefully designed for a single use. We then provide an implementation with R of a code which carries out the rest of the procedure: after automated quality checks and potential improvements, links and contents coming from comments are subtracted, and the comment-free pages can be analysed. Comment sections themselves can be extracted for a separate analysis.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020

"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO

No comment : Addressing comment sections in web analysis

1. Paul Guille-Escuret

2. Florian Cafiero

3. Jeremy Ward

ADHO - 2020

"carrefours / intersections"