Big Data and the Literary Archive: Topic Modeling the Watson-McLuhan Correspondence

Harvey Quamen; Paul Hjartarson; Matt Bouchard; Nicholas van Orden

Authorship

1. Harvey Quamen

University of Alberta
2. Paul Hjartarson

University of Alberta
3. Matt Bouchard

University of Alberta
4. Nicholas van Orden

University of Alberta

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

The world of Big Data has introduced humanist scholars to new and relatively unfamiliar data-handling techniques such as data mining, graph visualizations, document clustering, and topic modeling. Topic modeling techniques “are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.”1 This paper examines how one digital humanities project—the digitization of a literary archive—is using topic modeling in order to help users browse and discover the contents of the archive.
Background and Methodology

Topic models have been commonly used to classify documents 23 or to cluster tagged artifacts 4; however, researchers are increasingly using topic modeling as a way to allow users to browse and search large corpora, often through data visualizations. 567 Traditional search techniques often fail with large corpora because, as one researcher puts it, "users may not be familiar with the vocabulary that defines the topics of their interest, or simply they may wish to get a broad summary of the collection in order to guide their searches."8 If data visualization in the sciences typically happens after the research and number crunching is finished, data visualization in the humanities often serves as a preliminary tool of exploration and discovery. Literary archives are a perfect litmus test: online finding aids often lack detail and there’s a storied tradition of having to pay one’s scholarly dues by enduring long, physically exhausting archival sessions whose consequences range from bad posture and poor eyesight to contagion and even meningitis.9 Digitization and new search techniques can help make archival research more accessible for all scholars.
Our research team is experimenting with a wide variety of big data techniques as we continue a project funded by the Social Sciences and Humanities Research Council of Canada to digitize the archives of Wilfred and Sheila Watson, two 20th-century Canadian writers. The Watson Archive—a lifetime of journals, manuscript drafts, notes, sketches, artwork, newspaper clippings, reviews, and correspondence distributed across two universities 1500 km apart—clearly exceeds the scope of any one publication or scholarly project. In archival terms, Wilfred's papers occupy 10.6 metres of shelf space (producing over 101,000 digital images), while Sheila's occupy a further 8.4 metres, none of which has yet been digitized. Topic modeling is one promising way for scholars to understand the the content and the contours of these two separate but related Watson archives in ways that move significantly beyond online finding aids or the serendipity of sifting through the materials in person.
Big data techniques like topic modeling have found a mixed reception in the humanities, however. Twenty-five years ago, Carlo Ginzburg argued that the humanities were different from the sciences because while the humanities privileged “the study of individual cases, situations, and documents precisely because they are individual,” the sciences investigated only phenomena that were quantifiable and repeatable.10 Big Data has begun to challenge that neat division: perhaps most famously, Franco Moretti has used quantitative techniques in order to “distant read” literary history,11 while, more recently, Johanna Drucker has indicted those same methods as “pernicious” because they “violat[e] the very premises of humanistic inquiry.”12
Research

This paper engages those debates in light of one particular test case, a corpus of 413 letters that Wilfred Watson, Sheila Watson, and Marshall McLuhan wrote to each other over a period of more than twenty years. Because Sheila Watson studied for her PhD under McLuhan’s supervision and Wilfred Watson collaborated with McLuhan on the 1970 monograph From Cliché to Archetype,13 the archival letters range across a wide spectrum of topics from the personal to the professional, from the microdata of timelines, draft revisions, and gossip to the macrodata of history, culture and civilization. Topic modeling offers scholars one browsable entrance into such an unwieldy corpus, a corpus that is nonetheless just a tiny fraction of the entire archive. Our team has built a prototype interface that allows scholars to choose the number of topics into which the algorithm should cluster the letters and then, in the resulting force-directed graph, users are able to click on nodes that reveal the significant words that form each cluster and to browse each cluster’s individual letters. Data visualization merges with user interface design to provide scholars a new means of engaging the Watson archive.
Our interface prototype provides scholars a hybrid between a “distant reading” of the archive and a full text search. Broad, long-term patterns and topic shifts become immediately visible. For example, clustering the letters into as few as three or four groups reveals that, as McLuhan and co-author Wilfred Watson collaborated on From Cliché to Archetype, their paradigmatic literary figure shifted from Wyndham Lewis to James Joyce. The conversations about why Joyce proved more satisfactory than Lewis, of course, appear only in the letters and not in the finished monograph. Through data visualization and topic modeling, however, scholars can explore how ideas shift and change over time and can “zoom in” to important moments in the corpus of letters.
Our argument, then, is that topic modeling and other big data techniques are increasingly invaluable to humanists, especially as scholars are confronted by the overwhelming data available in even the most modest-sized archives. Topic modeling provides humanist scholars a valuable new way to explore large collections of texts and artifacts—not necessarily to determine definitively or algorithmically how texts should be classified or clustered, but as experimental, dynamic means of seeing patterns of similarity and difference in a wide range of materials. The result is that humanists are now increasingly able to move beyond the individual, idiosyncratic cases that Ginzburg described to see how people and ideas and discourses change over time.
References

1. Blei, David M. (2011) Introduction to Probabilistic Topic Models.www.cs.princeton.edu/~blei/papers/Blei2011.pdf. p.2.
2. Zhou, Shibin, Kan Li and Yushu Liu (2009). Text Categorization Based on Topic Model. International Journal of Computational Intelligence Systems 2.4 (December 2009): 398-409.
3. Song, Min, and Su Yeon Kim (2013). Detecting the Knowledge Structure of Bioinformatics by Mining Full-Text Collections. Scientometrics 96: 182-201.
4. García-Plaza, Alberto Pérez, Arkaitz Zubiaga, Víctor Fresno, and Raquel Martínez (2012). Reorganizing Clouds: A Study on Tag Clustering and Evaluation. Expert Systems with Applications 39: 9483–9493.
5. Shao, Jian, Shuai Ma, Weiming Lu, and Yueting Zhuang (2012). A Unified Framework for Web Video Topic Discovery and Visualization. Pattern Recognition Letters 33: 410–419.
6. Anaya-Sánchez, Henry, Aurora Pons-Porrata, and Rafael Berlanga-Llavori (2010). A Document Clustering Algorithm for Discovering and Describing Topics. Pattern Recognition Letters 31: 502–510.
7. Gretarsson, Brynjar, John O'Donovan, Svetlin Bostandjiev, Tobias Hollerer, Arthur Asuncion, David Newman, and Padhraic Smyth (2012). TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling. ACM Transactions on Intelligent Systems and Technology 3.2 (February 2012): Article 23. 26pp.
8. Anaya-Sánchez (2010), p. 502.
9. O'Driscoll, Michael and Edward Bishop (2004). Archiving 'Archiving.' English Studies in Canada 30.1 (March 2004): 1-16.
10. Ginzburg, Carlo (1989). Clues: Roots of an Evidential Paradigm. In Clues, Myths, and the Historical Method. Trans. John and Anne C. Tedeschi. Baltimore: Johns Hopkins, UP. 96-125.
11. Moretti, Franco (2005). Graphs, Maps, Trees: Abstract Models for Literary History. London: Verso.
12. Drucker, Johanna (2011). Humanities Approaches to Graphical Display. Digital Humanities Quarterly 5.1. www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html
13. McLuhan, Marshall and Wilfred Watson (1970). From Cliché to Archetype. NY: Viking.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014

"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Big Data and the Literary Archive: Topic Modeling the Watson-McLuhan Correspondence

1. Harvey Quamen

2. Paul Hjartarson

3. Matt Bouchard

4. Nicholas van Orden

ADHO - 2014

"Digital Cultural Empowerment"