This paper discusses how automatic information extraction and linked data can facilitate long-term socio-spatial analyses in urban history, using as a case study a long-term spatial analysis of the development of urban nightlife and entertainment industry in Amsterdam. We demonstrate how automatic information extraction of civil and trade registries can corroborate and validate information found in these sources, using duplicate information from the sources themselves as well as other published corpora that contain overlapping information. We present a technique that greatly reduces the need for manual revision of the data. Such a technique is dearly necessary - often research projects resort to hiring people to manually go through all the data or resort to crowdsourcing, which is expensive, slow, and has the problem that the data has to be looked at by multiple non-experts in order to find ‘likely correct’ data (see e.g. Zheng et al. 2017). Previous efforts to do such a thing have focused on cleaning OCR results by using contextual probabilities or word frequency probabilities (e.g. Reynaert 2014; Mei et al. 2018; Jatowt and Nguyen 2018; Doush, Alkhateeb and Gharaibeh 2018), at times combined with semantic data (e.g. Woodfield et al 2018), which works on texts that are representative of written vocabulary but is not very apt for lists of names and places. Our technique combines and compares several archival sources that include similar information on, for instance, person and place names. Thereafter, a probability estimation disambiguates between one or multiple occurences of references to what is most likely the same entity. To keep track of and to document this process we model and publish the data in the new roar ontology, which allows us to model archival data, while keeping the provenance chain of decision making and entity disambiguation. Informed by recent developments in the field of digital history as well as sociological literature on urban nightlife, this paper applies these techniques to address the underrepresentation of small, short-lived, and informal cultural venues in quantitative studies on cultural life. Even though the nineteenth and twentieth century witnessed the rise of nightlife industries as we know them today (Baldwin 2012; Schlör 1998; Nasaw 1999, Erenberg 1981), the organization of small entertainment venues and their effects on the production and consumption of urban cultural life have received little systematic analysis beyond case studies on individuals and single venues, especially in the Dutch context. This is partly due to the large number of yet undigitized archival periodicals such as address books, almanacs and program listings. However, the advent of new methods for digitizing and analyzing documents allowed us to develop a new and less labour intensive technique for uncovering and structuring these valuable sources. Figure 1 - Address book example (1824). Usual structure is: [name] [(initials/prefix)] [street name] [house number] [occupation] [either telephone number or neighbourhood number]. In this paper we help to redress this issue of underrepresentation by using automatic information extraction and entity disambiguation to systematically trace and analyse the spatial and diachronic distribution of clubs, restaurants and bars in relation to more formal cultural venues such as theatres and concert halls, as well as the rise of new entertainment venues such as cinemas. To reconstruct this urban pleasurescape, we contrast established historiographical narratives with a diverse set of historical sources. We reconstruct the public sphere of food and drink consumption by digitizing address books, the ‘yellow pages’ of those times (see Figure 1 for an excerpt of such a book), that listed most businesses in Amsterdam, their locations, and proprietors’ names. The data in the address books is semi-structured and rather consistent over time, whereby it was possible to programmatically extract information. We extracted all persons that had occupations in the food and drink service industry such as ‘koffijhuishouders’ [coffee shop holders] and ‘tappers’ [tavern keepers], but also theatres, cinemas and music venues. Once digitized, an automated comparison of these books significantly reduces the need for manual correction of OCR errors, spelling variation, other artifacts and errors. Other sources, such as the Amsterdam Citizen Registry, are used to validate the automatic extraction from the address books. We validate the method by gathering as many sources with overlapping information as possible, normalizing its contents slightly (e.g. orthography of occupations and street names) and scoring each bit of information to deduce which data is likely correct and which is not by taking into account source reliability, temporal and geographic distance, and number of sources giving the same and dissenting information. This also provides an order of probability of necessity of manual revision that makes it possible to address the problems in an efficient manner; the data is most quickly improved correcting the certainly wrong information first, and leaving the possibly wrong for later. For address books of subsequent years alone, the match rate is around 50%, which means that half the data does not need checking. For larger gaps between subsequent address books the match rate drops, since the information in the books differs increasingly over time. Adding a second address book from the year preceding the earliest increases the number to 60%. After linking the street names in the books with georeferenced data on Adamlink, we could plot their locations and see the changing patterns over time, demonstrating how patterns of urban expansion impacted on the organisation of urban nightlife in the city of Amsterdam between 1820 and 1940. Finally we fit and publish the data in the roar model, thereby creating transparent linked data reconstructions of historic persona and businesses for others to use. This systematic digital historical approach has, moreover, the added value of making source bias explicit, as the comparisons show which information present in one is not present in another. The bottom-up approach reveals significant gaps in our current knowledge of urban nightlife. It also shows the value of using multiple sources that overlap in the information they carry in reducing labour and increasing accuracy, and introduces roar as data model for storing this type of information.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Carleton University, Université d'Ottawa (University of Ottawa)
Ottawa, Ontario, Canada
July 20, 2020 - July 25, 2020
475 works by 1078 authors indexed
Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.
Conference website: https://dh2020.adho.org/
Series: ADHO (15)