Using digitized newspaper archives to investigate identity formation in long-term public discourse

  1. 1. Hieke Huistra

    Utrecht University

  2. 2. Toine Pieters

    Utrecht University

This paper analyzes how digitized newspaper databases can be used in historical research on identity formation in public discourse. It discusses a new semantic text mining tool, Texcavator, which is currently being developed in the Dutch research program Translantis: Digital Humanities Approaches to Reference Cultures.1The paper presents a case study which combines the Texcavator tool with the publicly available Delpher2 and with traditional historical methods in order to analyze identity formation of health risk groups in Dutch public discourse in the twentieth century. In particular, it focuses on the construction of the identity of people with excess body weight. Although the case is built around the Texcavator and Delpher mining tools and the newspaper database of the Dutch national library, the paper aims to investigate techniques to combine close and distant reading that can be transferred to other tools and repositories as well.
Newspapers are valuable sources in historical research. Until recently, however, investigating them was cumbersome and time-intensive. The repositories of digitized newspapers now available in many countries solve many practical problems and offer wonderful opportunities, but they also introduce methodological problems of their own. (Bingham 2010; Nicholson 2013) Bob Nicholson has recently shown how digitalization enables us to approach newspapers bottom-up instead of top-down, but he stresses the difficulty of creating useful keyword searches for doing this. (Nicholson 2013, 66–67) Adrian Bingham has also pointed this out, and has furthermore highlighted the danger that keyword searches (as well as other text mining techniques) pluck individual articles out of their original context, ignoring their position on the page, surrounding articles, and illustrations. (Bingham 2010, 230) Furthermore, Johanna Drucker has indicated that digital humanities scholars often aim to reduce complexity and remove ambiguity, while these are two values humanities research has to cherish, not avoid. (Drucker 2009, 5–7; Collini 2012, 65–84)
This paper takes such warnings into account and shows how these problems are being addressed by researchers working with the digitized newspaper database of the Dutch national library, thereby offering more concrete versions of the rather general solutions (e.g., ‘we should not forget the article’s context’) that are often suggested. At present, this database contains over 10 million pages from more than 200 newspapers and periodicals published between 1618 and 1995.3It can be approached in two ways: through Texcavator (in development, not yet publicly available) and through the national library’s Delpher tool (publicly available).
The paper discusses a specific use case in which both tools are combined and used alongside traditional historical methods: researching identity formation in public discourse. It focuses on the identities of (health) risk groups, groups of people that are classified as ‘at risk’ with help of (health) risk factor classifications like the body mass index (BMI). For example, nowadays, people with a BMI above 25 are classified as ‘at risk’ because of their high body weight. This classification and the construction of this group is not a necessary outcome of biomedical research on the human body; instead it is historically contingent, strongly rooted in culture and practice. (Hacking 2007a, 2007b) The construction of these risk groups and the formation of their identity takes place for a significant part in public discourse. Digitized newspapers are valuable sources to study this identity formation: they provide a good entry into public discourse and typically span long time periods, enabling researchers to analyze the fluctuations in the identity of these groups (e.g., fluctuations between whether or not they are seen as (and see themselves as) ‘ill’).
The paper presents the first results of the investigation of the identity construction of the risk group ‘overweight people’ between 1890 and 1990. It focuses in particular on newspaper advertisements in the first part of this period — a choice based on distant reading of the corpus with help of Texcavator. The paper discusses how Texcavator and Delpher have been used, focusing in particular on the interaction between close and distant reading necessary to do this type of research. It shows how the direct connection between Texcavator and Delpher makes sure the researcher is constantly only one or two mouse clicks away from viewing the single articles in their original context — on the page, including illustrations, within the full issue of the periodical, as if going through newspapers on microfilm (or, depending on the size of the computer screen, leaving through them on broadsheet). Furthermore, it shows how Texcavator’s built-in visualization tools (time lines with number of articles diagrams, word clouds, named entity recognition) can be used to go back and forth between distant and close reading in order to build sophisticated queries that can easily be refined and modified within the tool.
In this way, the paper shows the challenges but also the new heuristic possibilities of doing historical research in digital repositories of newspapers.
