University College Dublin
University College Dublin
University College Dublin
University College Dublin
Abstract
This paper explores the cultural representation of migration and the biopolitics of contagion and disease represented in a digital corpus of literary fiction from the British Library. This work is part of a project examining the shifting representation of migration, ethnicity and contagion in cultural memory. A curated subset of the British Library Digital Corpus was examined using techniques from artificial intelligence and text mining. Concept modelling with neural word embedding revealed complex relational dynamics between societal views of migration, ethnic identity and contagion that question prevailing theories. Thematic lexicons were generated with word embedding to mine the corpus for excerpts of text that capture these conceptual relationships and enable critical analysis. This bridging of digital analysis and close reading sets out a methodology whereby patterns identified in corpora with artificial intelligence techniques may be critically evaluated through close reading of the text.
Keywords: migration, contagion, biopolitics, word embedding, text mining, literary fiction
Introduction
The complex relationship between societal views of migration, ethnicity and concepts of contagion and disease are explored in this paper through neural word embedding and text mining. This research is part of a project examining the representation of migration, ethnicity and contagion through the analysis of a collection of 45,000 digital texts from the British Library, primarily dating from the late 19th century. In order to explore the cultural representation of migrants, this paper focuses on their representation within literary fiction, which comprises 16,426 texts of the digital collection. Given the largest communities of migrants to Britain during the late 19th century were Irish and Jewish, this paper focuses on their portrayal in relation to prevailing concepts of contagion, disease and migration.
Lexicons of terms associated with the migration and the biopolitics of contagion were generated with neural word embedding. Thematic lexicons are learned associations between terms in the corpus and a set of seed terms corresponding to a concept (Lavelli et al., 2002). The dynamics of the relationship between concepts in the corpus were modelled and explored with t-SNE visualisation and measures of semantic distance. Through modelling how the concepts of migration and disease and contagion, real or imagined, were related within the corpus, patterns emerged that revealed a complex conceptualisation of contagion and the nature of its association with Irish and Jewish migrants. Excerpts of texts capturing the interaction between concepts of migration, ethnicity and contagion were extracted using text mining based on thematic lexicons, developed with word embedding. This bridging of neural word embedding methods with text mining demonstrates how complex conceptual relationships may be identified in text and critically examined through close reading.
Related Research
Migration and the Biopolitics of Contagion
Cultural attitudes towards migration have traditionally been associated with fear of contagious disease (Nelkin and Gilman, 1988, Kinealy, 2006). Poverty induced migration from Ireland to Britain during the famine has been cited as generating a fear of transmission of contagious disease (Morash, 2009). The conflation of issues of migration, ethnicity and contagion is evidenced by the fact that tuberculosis was identified as the “Jewish Disease” despite the fact that the mortality rates from the disease in London were lower for Jewish immigrants than their counterparts. Recent research, notably Samuel Kline Cohn
'
s work on the history of epidemics, has produced a more complex and nuanced understanding of the relationship between fear of contagion and suspicion of migrants, based on a much broader historical and cultural archive than heretofore (Cohn Jr, 2018). This research addresses this by applying digital methods to support the systematic study of the relationship between concepts of migration, disease and contagion.
Concept Modelling and Text Mining
Machine learning has been used in digital humanities research to generate thematic lexicons for a range of purposes, including detecting language change over time (Hamilton et al., 2016), extracting social networks from literary texts (Wohlgenannt et al., 2016), sentiment analysis (Tang et al., 2014), and semantic annotation (Leavy et al., 2018). In developing domain-specific vocabularies, neural word embedding can be particularly effective (Chanen, 2016). The exploration of topics in text through visualisation of neural word embedding models has been applied in automated text analysis systems (Park et al., 2018). However, challenges have been identified in bridging patterns uncovered through visualisation and semantic similarity analysis with close reading of texts in digital humanities research (Janicke et al., 2015). This paper addresses this issue by using thematic lexicons developed through neural networks to explore the relationships between concepts in text and also as a basis for mining excerpts of text.
Methods
In this work, thematic lexicons are developed using neural word embeddings, and then visualized using the t-SNE algorithm (Maaten and Hinton, 2008). The word embedding algorithm used here is the popular word2Vec approach (Mikolov et al., 2013), which generates real-valued, low-dimensional representations of words based on lexical co-occurrences, as identified by sliding a window over documents in a corpus. Lexicons representing key thematic strands in the dynamics of bio-politics and migration were developed by using word embeddings to uncover terms that are semantically related to an initial set of seed terms (Table 1). The resulting expanded thematic lexicons were used to model concepts within the corpus, and also to uncover excerpts that capture relationships between key concepts in the text.
The dynamics of relationships between thematic lexicons and their positioning within the entire corpus were modelled using a t-SNE visualisation approach. The t-SNE method allows the visualisation of high-dimensional data, such as word embedding models. The conceptual structure was explored through an interactive embedding projector in TensorFlow platform (Abadi et al., 2015). Observed patterns proposed the nature of the relationship between ethnicity, migration and concepts relating to contagion. These patterns were evaluated through the analysis of the cosine similarity of word vectors in the embedding to quantify the semantic distance between concepts (see Fig. 3).
Table
: Seed terms for thematic lexicons
Texts relating to the key themes listed in Table 1 were uncovered based on the use of the lexicons described above. Specifically, excerpts of texts were extracted if they contained one or more words from a given lexicon. A sample of top words from the lexicon representing the concept of contagion is provided in Table 2.
Table
: Sample of top terms from thematic lexicon related to contagion
Findings and Conclusions
The findings of this research uncovered a dynamic between concepts of race and migration that challenge prevailing theories about the attribution of threats of contagion to Jewish and Irish immigrants. Contrary to expectations, analysis of the corpus with neural word embedding did not support a link between race and concepts pertaining to contagion. While Irish and to a lesser extent, Jewish communities were described as themselves being disease, a fear of transmission of disease to British people was not systematically evident in the corpus. Religion and new political ideologies, rather than ethnicity itself, show a stronger association with a threat of contagion and ultimately disease.
Figure
: Visualisations of conceptual model of migration and biopolitics
A striking pattern evident within the clustering of concepts in the t-SNE visualisation was an absence of proximity between the lexicon of disease and those representing either Irish or Jewish identity (Fig. 1). However, the term “exterminate”, a term used in relation to the extermination of disease, was aligned with elements of the Irish lexicon. Cosine similarity analysis demonstrated a stronger association of “extermination” with ethnic identity, and particularly the religious aspect of that identity, than with disease itself (Fig. 2a). Excerpts of the texts which were identified as containing this term alongside elements from the Irish or Jewish thematic lexicons, suggested a conceptualization of some migrants as disease to be exterminated, rather than presenting a threat of contagion (Table. 4).
Figure
: Semantic distance (similarity) of concepts in text
While most lexicons appeared clustered within the model, the concept of contagion was more dispersed within the t-SNE visualisation. Contrary to expectations, an overall close relation between concepts relating to contagion and Jewish or Irish identities was not evident (Fig. 3). However, the aspect of Irish identity pertaining to religion was aligned with elements of the concept of contagion. Cosine similarity analysis of contagion and Catholicism, the nearest neighbours from the Jewish identity, along with the key themes of poverty, migration and immorality, was used to evaluate the semantic distance between concepts. This indicated a stronger association between Catholicism and the concept of contagion than the Irish as an ethnic community or migrants in general (Fig. 2b). Similarly, the aspects of Jewish identity that pertained most strongly to religion were associated more closely with concepts of contagion, suggesting that religion rather than ethnic identity itself may have had a stronger association with concepts of contagion.
Table
: Excerpts capturing conceptual relationships between ethnic identity, religion, and contagion
Figure
: Heatmaps indicating cosine similarities between thematic lexicons and ethnic identities
Excerpts of text capturing concepts associated with contagion along with Irish and Jewish identity were extracted to critically evaluate the patterns identified in the word embedding model. Close reading of these revealed a fear of contagion of religions and political ideology. Tracking back to these excerpts also demonstrates a belief that foreign religion could induce disease (see Table. 4). Future narrative analysis will examine the extent to which opposition to intermarriage with these groups used the imagery of infection.
Conclusion
This paper investigated key themes pertaining to migration and the biopolitics of contagion, and uncovered conceptual relationships and excerpts of texts from a collection literary fiction from the British library that question prevailing theories pertaining to the historical association of perceptions of migrants with fear of contagion. Insights regarding the association of the religious aspects of the ethnic identity of immigrants and concepts of contagion in Britain, particularly in the 19th century, were uncovered using neural word embedding. Critical analysis of these complex conceptual patterns was enabled though mining the corpus based on thematic lexicons derived from the embedding models. In bridging artificial intelligence and text mining approaches in this way, this research merges both digital and traditional forms of humanities research.
Bibliography
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y. and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Chanen, A. (2016). Deep learning for extracting word-level meaning from safety report narratives, Integrated Communications Navigation and Surveillance (ICNS), 2016, IEEE, pp. 5D2–1.
Cohn Jr, S. K. (2018). Epidemics: Hate and Compassion from the Plague of Athens to AIDS, Oxford University Press.
Hamilton, W. L., Clark, K., Leskovec, J. and Jurafsky, D. (2016). Inducing domain-specific sentiment lexicons from unlabeled corpora, Proc. EMNLP 2016, Vol. 2016, NIH Public Access, p. 595.
J¨anicke, S., Franzini, G., Cheema, M. F. and Scheuermann, G. (2015). On close and distant reading in digital humanities: A survey and future challenges, Eurographics Conference on Visualization (EuroVis)-STARs. The Eurographics Association.
Kinealy, C. (2006). This Great Calamity: The Great Irish Famine: The Irish Famine 1845-52, Gill & Macmillan Ltd.
Lavelli, A., Magnini, B. and Sebastiani, F. (2002). Building thematic lexical resources by term categorization, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp. 415–416.
Leavy, S., Pine, E. and Keane, M. T. (2018). Industrial memories: Exploring the findings of government inquiries with neural word embedding and machine learning, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018. to appear.
Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne, Journal of machine learning research 9(Nov): 2579– 2605.
Mikolov, T., Corrado, G., Chen, K. and Dean, J. (2013). E_cient Estimation of Word Representations in Vector Space, Proc. ICLR 2013 pp. 1–12.
Morash, C. (2009). The Hungry Voice: The Poetry of the Irish Famine, Irish Academic Press.
Nelkin, D. and Gilman, S. L. (1988). Placing blame for devastating disease, Social Research pp. 361–378.
Park, D., Kim, S., Lee, J., Choo, J., Diakopoulos, N. and Elmqvist, N. (2018). Conceptvector: text visual analytics via interactive lexicon building using word embedding, IEEE Transactions on Visualization & Computer Graphics (1): 361–370.
Tang, D.,Wei, F., Yang, N., Zhou, M., Liu, T. and Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1555–1565.
Wohlgenannt, G., Chernyak, E. and Ilvovsky, D. (2016). Extracting social networks from literary text with word embedding tools, Proc. Workshop on Language Technology Resources and Tools for Digital Humanities, pp. 18–25.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO