A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950

paper, specified "long paper"
  1. 1. Simon Hengchen

    University of Helsinki

  2. 2. Ruben Ros

    Utrecht University

  3. 3. Jani Marjanen

    University of Helsinki

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This project aims to mine two centuries worth of digitised newspapers in four languages, and to propose a methodologically sound, reusable approach to carry out quality historical research on the changing vocabulary relating to nationhood. The newspapers stem from different sources and countries, and are available in different formats. Massive digitized newspaper collections are increasingly used to address historical questions through mining textual data.

For recent examples and further discussion see for instance Bos & Gifford (2016); Brandzæg, Goring & Watson (2018);

Buntinx, Bornet and Kaplan (2017

). For a discussion of the role of digitization of newspapers for historical research see Cordell (2016); Milligan (2013).

They are more seldom used for comparative projects cross linguistic and national boundaries. In this paper, we address the methodological challenges the use of newspapers from different political contexts, languages and datasets poses, and lay out our approach to tackle a comparative study for the Netherlands, Finland, Sweden, and the UK.

Working with historical newspapers from different countries to look for the evolution of a concept poses several methodological challenges. A first problem is actually getting the data, and shaping it in a way that makes its use possible. For the UK, we use the Burney and Nichols collections and the British Library 19th Century Newspapers, both provided by Gale, and accessible through an API, OCTAVO.

et al


The full texts of the Dutch newspapers, as well as their corresponding metadata, are retrieved through the Delpher API.

accessed August 2018
. We would like to thank Dr. Steven Claeyssens at the Royal Library of the Netherlands. The original script used to query the API has been written by Juliette Lonij.

Finally, Finnish (including newspapers in Swedish from Finland) and Swedish newspapers are first queried through the KORP interfaces made available by the language banks of those two countries, respectively the Kielipankki


accessed August 2018.

and the Språkbanken,

accessed August 2018.

and then fetched via their API. The full datasets were used for the UK, Sweden, and Finland. The colonial newspapers were filtered
out of the Dutch dataset, so as not to bias the comparative analyses.

The justification is twofold: first, only the Dutch dataset has an extensive coverage of colonial newspapers. Second, Dutch colonial newspapers “showed a great uniformity” because “their news supply was unique and controlled by the official news agency, ANETA”. (Our translation and paraphrasing of Witte 1998:18)

In addition, Finnish newspapers published outside the historical borders of Finland were also disregarded from our analyses.

After getting access to the different data and shaping it in a way that a single pipeline can be reused for all languages and historical realities comes the trade-off between the computational, distant reading of the text, and the actual research question. We focus on the process of nation-building in Europe, and to achieve that goal we utilise several methods. Whilst historical processes or concepts do not appear as such in texts, and thus cannot be the object of a mere tallying across time, it is obvious that words do. We thus use words as a proxy to study the process of of nation-building, and carry that out in several ways. In doing this we also limit the study of nation-building to the development in which the nation became a self-evident frame for social and political affairs.

As a first
step in exploring
this idea, we look at how bigrams

We borrow this methodology from Hill
et al
(2018), who studied the public sphere in 18
century Britain.

starting with the adjective “national”

, and
. For the sake of clarity, the remainder of this abstract will use English terms as examples. In the case of bigrams containing “meaningless” words such as conjunctions (eg:
nationell och
nationale en
, “national and”), we expand the query until we arrive at the noun modified by the adjective. For English and Finnish, languages for which a surface form of the substantive shares the same spelling as the adjective, those occurrences of nouns are filtered out. Frequent compounds were decompounded and where needed harmonised, eg:

national biblioteket

kansallinen kirjasto
, etc.

behave in our datasets, in terms of absolute and relative frequencies. This paints a picture of how common the idea of something “national” is mentioned in newspapers in different countries at different periods. We complement this picture with an analysis of the creativity and productivity

The definitions of “productivity” and “creativity” are fluid within subfields of linguistics, as already discussed in Lyons (1977: 77). In this paper, we use “productivity” in its corpus linguistics sense, i.e. the proclivity of a linguistic unit to be (re-)used. “Creativity”, on the other hand, will characterise this unit’s

forms: in the case of a bigram, any new bigram following the construction “
national + _


of the
national + noun

bigram: by looking at how “creative” writers are with the linguistic unit, and by looking at how its use evolves across time, we have a glimpse at the vocabulary of the nation, and can identify key junctures in the transformation of this vocabulary. We notice that the French revolution, the political ruptures of 1848 and the Franco-German War of 1870 were particularly important for the diversification in the vocabulary of “national”, in all of our cases, but can also show how local political and publishing conditions produced local reactions. The differences also point out how events abroad affected domestic vocabulary, making the development a transnational one (cf. Bos and Gifford 2016). By focussing on bigrams, we can trace the domains in which the word “national” was used. In doing so we do not trace the theoretical development of the concept of nation or even the intentional processes of shaping Dutchness, Britishness, Swedishness,
or Finnishness, but rather focus on a much more implicit process in which the nation became a natural frame for conceptualizing the societal issues or -- to speak with Benedict Anderson (2006) -- an imagined community that became inescapable for citizens of any state.

Approaching bigrams is limited to mere counts, and whilst it hints at change, it does not qualify it. To remedy this weakness, we cluster bigrams by themes, in two different ways: on the one hand, domain experts assign a theme to the top-300 bigrams. Those themes, viewed in a diachronic way, add more colour to the simple tallying of bigrams, and of the creativity and productivity of the construction. Analyses based on our manual annotation point toward a hypothesis that, on a general level, the vocabulary of “national” tended to be focused on economic discourse in the late eighteenth century, but soon gained a stronger presence in relation to political issues and ultimately also entered the domains of culture and social affairs during the course of the nineteenth century.

The other approach, which we believe will be a useful one for researchers wanting to reproduce our methodology, is data-driven, and should help reduce a researcher’s bias. Clustering words semantically in a data-driven way is a challenge. Indeed, most current approaches rely on topic models to assign some “sense” to documents, but not directly on words. On the other hand, relying on external knowledge, such as Wordnet -- a database that groups lexical items into “synsets”, i.e. synonym sets -- proves itself to be difficult, due to the varying quality of the OCR. Additionally, Wordnet does not allow for a more fine-grained relationship between words than a dichotomous answer to the question “are they part of the same synset or not?”. To circumvent those problems, we train word embeddings on the full texts of our corpora, and then calculate the “semantic” distances between each of the top 1000 nouns that appear next to “national”. As such, we believe this approach is similar to the one proposed by Wevers
et al

(2015). We then use k-means clustering on the 0.5 million different distances

Prior to the distance calculation, the vectors have been unit-normalised, which allows us to use k-means clustering. Embeddings were trained on tokens with a frequency threshold of 300, a CBOW architecture, 100 dimensions, and with a window of 5. Furthermore, mimicking Kim
et al

(2014), the embeddings were trained on different time slices, where embeddings for slice

are initialised with the embeddings for slice
, hence bypassing the need for a “temporal” alignment of the vector space.

to generate semantic clusters of words. Each word is subsequently assigned a label -- its “centroid sense”, which allows us to look into the thematic distribution of “national” across time. An advantage of using word embeddings is that words with OCR errors also get distributional similarity. The weakness of such an approach is that only the primary sense of a word is captured, making the technique sensitive to frequency (Dubossarsky
et al

2018; Iacobacci
et al

2015). To make sure that the data-driven clustering is not mere chance, we attempt to replicate the results in two ways: first, we use SCAN  (Frermann and Lapata 2016), a dynamic topic model that infers, for a specific target word, word senses across time. Second, we calculate word mover’s distance (WMD, Kusner
et al

2015) between 11-grams (“KWICs”) in which “national” appears -- the same input as for
SCAN --, and cluster them using the same method as for the top words, in a move to go slightly beyond the “primary sense” limitation of the word embeddings. Data-driven clustering further confirms our hypothesis of the broadening domains of national, but, more importantly, also paints a clearer picture of how the politicization and culturalization of “national” took place. Further, it shows how the word national moved from being an abstraction of terminology such as “French”, “German”, “Dutch” to becoming indicative of a political community, and thus more often used similar to adjectives such as “public”, “common”, or “international” in referring to state institutions.

A further way of better qualifying our findings is using available metadata to zoom in on periods of instability and ruptures in creativity and productivity curves and tie these empirical findings to theories of semantic change (for a discussion see, Hengchen 2017: 11-24). Another way of making our analyses more precise includes a more linguistically- and culturally-aware preprocessing of the texts so as to go beyond the
national + noun

bigram: different cultures refer to a comparable reality differently (eg: Swedish

(“archives of the kingdom”) vs English “national archives”; the same word is used in Swedish from Finland, but not in Finnish, despite Finland

being a monarchy). To successfully implement more advanced preprocessing, future work will rely on the comparative findings of the present study.


Anderson, B.
(2006 [1983]).
Imagined Communities: Reflections on the Origin and Spread of Nationalism
, Verso, London.

Bos, M. van den
Gifford, H.
(2016). Mining Public Discourse for Emerging Dutch Nationalism,
Digital Humanities Quarterly


Brandzæg, S., Goring, P.
Watson, C.
Travelling Chronicles: News and Newspapers from the Early Modern Period to the Eighteenth Century
, Brill, Leiden.

Buntinx, V., Bornet, C. and Kaplan, F.
(2017). Studying Linguistic Changes over 200 Years of Newspapers through Resilient Words Analysis. Frontiers in Digital Humanities, 4 doi:10.3389/fdigh.2017.00002.

Cordell, R.
(2016). “What Has the Digital Meant to American Periodicals Scholarship?”
American Periodicals

26:1 (Spring).

Dubossarsky, H., Grossman, E.,
Weinshall, D.
(2018). Coming to Your Senses: on Controls and Evaluation Sets in Polysemy Research.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,

Frermann, L.,
Lapata, M.
(2016). A Bayesian model of diachronic meaning change. Transactions of the Association for Computational Linguistics, 4, 31-45.

Hengchen, S. (2017). When does it mean? Detecting semantic change in historical texts. Ph.D. thesis, Université libre de Bruxelles.

Hill, Mark J., Kanner, A., Marjanen, J., Vaara, V., Mäkeläa, E., Lathi, L., and Tolonen, M. (2018). Spheres of “public” in eighteenth-century Britain.
Proceedings of the 2019 Digital Humanities in the Nordic Countries Conference, Helsinki.

Iacobacci, I., Pilehvar, M. T.,
Navigli, R.
(2015). Sensembed: Learning sense embeddings for word and relational similarity. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

(Vol. 1, pp. 95-105)

Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S.
(2014). Temporal Analysis of Language through Neural Language Models.
ACL 2014
, p.61.

Kusner, M., Sun, Y., Kolkin, N. and Weinberger, K.
(2015). From word embeddings to document distances. In
International Conference on Machine Learning

(pp. 957-966).

Lyons, J
. (1977). Semantics (vols I & II).
Cambridge CUP

Milligan, I.
(2013). Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010.
The Canadian Historical Review

94(4), 540-569.

Tolonen, M., Marjanen, J., Kanner, A., Mäkelä, E., Lahti, L., Vaara, V., Roivainen, H., Tarkka-Robinson, L., Lähteenoja, V.
(2017). OCTAVO – Analysing Early Modern Public Communication. Poster presented in
Digital Humanities at Oxford Summer School 2017

Wevers, Melvin, Kenter, Tom, and Huijnen, Pim.
(2015). ‘Concepts Through Time: Tracing Concepts in Dutch Newspaper Discourse (1890-1990) Using Word Embeddings’, In:
Digital Humanities 2015, Sydney

Witte, R.
(1998). De Indische radio-omroep: overheidsbeleid en ontwikkeling, 1923-1942. Uitgeverij Verloren.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO