Cultural text mining: using text mining to map the emergence of transnational reference cultures in large public media repositories

paper, specified "short paper"
  1. 1. Jaap Verheul

    Utrecht University

  2. 2. Toine Pieters

    Utrecht University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is based on the scalable open source text analysis service xTAS (developed by the Intelligent Systems Lab Amsterdam). The text analysis service xTAS has been used successfully in computational humanities projects such as Political Mashup, WAHSP, BILAND, and DutchSemCor. Within the context of the Translantis project, xTAS, coupled to Elasticsearch, will be further developed. Future versions will include clustering concepts and sentiment mining of issues in public debates. Translantis researchers are using Texcavator to detect and track cultural references in large textual corpora.
Use case: mining transnational references in public discourse

In order to test the potential of cultural text mining, Texcavator will be used to analyze the role of reference cultures in debates about social issues and collective identities. The central use case of this project is the emergence of the United States in public discourse in the Netherlands from the end of the nineteenth century to the end of the Cold War. This concept of reference culture is be used to discuss long-term asymmetrical processes of cultural exchange involving dimensions of power and hegemony. The concept recognizes the fact that some cultures assume a dominant role in the international circulation of knowledge and practices, offering or imposing a model that others imitate, adapt, or resist.
Reference cultures are mental constructs that do not necessarily represent a geopolitical reality with an internal hierarchy and recognizable borders. These culturally conditioned images of trans-national models are typically established and negotiated in public discourses over a long period of time. However, the specific historical dynamics of reference cultures have never been systematically analyzed and hence are not fully understood. To explore these dynamics, this project asks three interrelated questions.
How can e-tools be used to map trends and changes in relation to the economic power, cultural acceptance, and scientific and technological impact of the United States as reference culture?
How does public discourse reflect and influence the emergence and impact of reference cultures?
How were ideas, products and practices associated with the United States valued in Dutch public discourse between 1890 and 1990?
We propose that the key to understanding the emergence and dominance of reference cultures is to chart the public discourse in which these collective frames of reference are established. Text mining methodologies allow us to trace changes in “big data” repositories of public media, such as newspapers, journals, and other periodicals. Central to this project is the large digital data collection of the National Library of the Netherlands (KB), which contains 9 million newspaper pages and over 1.5 million journal pages2. This large collection of serialized historical texts, which have been OCR-ed and provided with meta-tags, allows us for the first time to study long-term developments and transformations in national discourses in a systematic, longitudinal, and quantifiable way, by using innovative text-mining tools.
Methodological innovations and challenges

The semantic text mining tool Texcavator has direct access to historical textual repositories and is able to handle queries on-the-fly, and to produce visualization such as timelines and word clouds based on integrated topic modeling and NER modules. This allow us to test the value of qualitative heuristic models and to pair them in a meaningful fashion with quantitative methodology. Some of the methodological challenges involve the calibration between close and distant reading, the normalization of search results from unevenly distributed historical media, and adjusting for lexicological changes that affect the accuracy of sentiment mining and concept mining.
First results indicate the ability to mine “hidden debates” in public media in a bottom-up (inductive) manner, based on the footprints that used terms leave behind. More importantly, the tool is innovative in that it pinpoints continuities and discontinuities in public discourse, for instance by showing variations in the context in which key terms are used, and changes in sentiment values of words over time. We argue that this marks a promising transition from text mining to “concept mining” and new forms of cultural text mining that go beyond already established mining features.

We will demonstrate that semantic mining of big data open new vistas in historical research because they (a) provide a robust framework for producing new vistas on macro history; and (b) can be complemented with numerical data sets provided by other researchers, for example on economic and social trends. This, ultimately, is the transformative promise of digital humanities as a multi-dimensional window on political, economic, and social change.


1. Translantis: Digital Humanities Approaches to Reference Cultures; The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990 (funded by the Netherlands Organization for Scientific Research),
Aiden, Erez, and Jean-Baptiste Michel (2013). Uncharted: Big Data as Lens on Human Culture. New Yrok: Penguin.
Balog, K., M. Bron and M. de Rijke (2011). ‘Query Modeling for Entity Search Based on Terms, Categories and Examples,’ ACM Transactions on Information Systems 29, no. 4, Article 22, November 2011.
Dougherty, M., E.T.Meyer, C. Madsen, C. van den Heuvel, A. Thomas, and S. Wyatt (2010), Researcher Engagement with Web Archives: State of the Art. London: JISC.
Eijnatten, Joris van, Toine Pieters, and Jaap Verheul (2013). “Big Data for Global History: The Transformative Promise of Digital Humanities.” Low Countries Historical Review/BMGN 128-4: 55-77.
Hernandez, J.F., A.K. Mantel-Teeuwisse, G.J.M.W. van Thiel, S.V. Belitser, J.A.M. Raaijmakers and T. Pieters (2011). Publication trends in newspapers and scientific journals for SSRIs and suicidality: a systematic longitudinal study. BMJ OPEN 6, no. 1.
Huurnink, B., L. Hollink, W. van den Heuvel and M. de Rijke. ‘Search Behavior of Media Professionals at an Audiovisual Archive: A Transaction Log Analysis.’ Journal of the American Society for Information Science and Technology 61, no. 6 (June 2010): 1180-1197.
Huijnen P., F. Laan, M. de Rijke, and T. Pieters. A digital humanities approach in the history of science; eugenics revisited in hidden debates by means of semantic text mining. Histoinformatics, Springer (forthcoming, 2013).
Jijkoun, V., M. de Rijke and W. Weerkamp. ‘Generating Focused Topic-specific Sentiment Lexicons,’ 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), July 2010.
Meij, E., M. Bron, L. Hollink, B. Huurnink and M. de Rijke. “Mapping queries to the Linking Open Data cloud: A case study using DBpedia.” Journal of Web Semantics 9, no. 4 (November 2011): 418-433.
Pieters, T., and S. Snelders.“Standardizing psychotropic drugs and drug practices in the twentieth century: Paradox of order and disorder.” (2011) Studies in the History and Philosophy of the Biological and Biomedical Sciences 42: 412-415.
Snelders, S., and T. Pieters. (2011) “Speed in the Third Reich: Metamphetamine (Pervitin) Uses and a Drug History From Below.” Social History of Medicine. First published online: February 19.
Thomas, A., E.T., Meyer, M. Dougherty, C. van den Heuvel, C. Madsen, and S. Wyatt (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC
Verheul, Jaap (2010). “Through Foreign Eyes.” In Discovering the Dutch: On Culture and Society of the Netherlands, edited by Emmeline Besamusca and Jaap Verheul, 267-77. Amsterdam: Amsterdam University Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO