Small-Scale Big Data: Experimental Literature and Distributed Computing

paper, specified "short paper"
  1. 1. Aaron Mauro

    Electronic Textual Cultures Lab - University of Victoria

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This short paper describes how a small-scale implementation of big data text analysis can be used for reading single texts and testing algorithmic processes. By using a small Hadoop cluster, distributed computing methods can be used to parse typographically experimental texts and delineate even minute units of meaning. This experiment takes Ron Silliman’s 26-volume collection entitled The Alphabet (1979-2008) as an object of study because it anticipates and resists quantitative analysis techniques. Silliman describes his brand of “language writing” as a kind of “composition as investigation” that requires active participation of the reader to make meaning.1 The process of running resistant texts through an algorithmic system also exposes the interpretive biases of computational methods of reading poetry. Large scale text analysis can reveal promising texts for further inquiry, but these systems can also be used on a small-scale to complement qualitative human readings with quantitative results. Matthew Jockers has recently warned in Macroanalysis (2013) that “from thirty thousand feet, something important will inevitably be missed. The two scales of analysis, therefore, should and need to coexist.”2 This paper shows how a proximate use of “distant reading” can render a text productively unfamiliar and also show how distributed computing systems can be used to structure highly experimental literary texts.

The core of my method is animated by two observations: First, Google has become a constant touchstone for DH scholarship. The proceedings for the 2013 meeting of DH in Lincoln, Nebraska includes 156 references to Google Search or other Google products, including Anna Jobin and Frederic Kaplan’s attempt to reverse engineer Google’s autocomplete algorithms. 3 However, the scale and sophistication of Google’s systems make them impenetrable for all but the most experienced Google employees responsible for building these proprietary systems. Second, there has been a recent boom in avant-garde or experimental writing that is fuelled by the growing awareness of large scale text analysis. I argue, therefore, that experimental literature has begun to function as a resistant dataset that can test and even react to algorithmic methods. This new experimental corpus thereby engages in a reciprocal critique of both literary and technical systems. My experiment is grounded in contemporary cultural and technological contexts, while seeking to explore the relationship between corporate analysis tools and literary production. The technologies that support this “Big Data Revolution” were developed by many of the most important technology companies in the US, with Google, Yahoo, and Facebook among them. 4 Hadoop was derived from Google’s MapReduce and Google File System white papers. Doug Cutting (Yahoo!) and Mike Cafarella (Google) spun off an open source implementation in 2005 that is now released by Apache. 56 Hive was written by Facebook to streamline the process of writing MapReduce jobs in Java by allowing queries to be written in SQL. A secondary purpose for this research explores the capabilities and weaknesses of proprietary systems from their open source derivatives.

A small-scale implementation of Hadoop and Hive on Amazon Web Services Elastic Map Reduce (EMR) platform is a highly reliable and cost effective means of working with distributed computing methods. As both a pedagogical and rapid experimental tool, EMR automates the networking between data nodes throughout the system and opens the scalability of Hadoop to an extremely broad user base. Hive allows for queries to be written in HiveQL, which does not follow the full SQL-92 standard but rather retains many of the features of MySQL. 7 While the distributed nodes are often virtualized on EMR and the user has no way to determine the full composition of the cluster, the costs remain extremely low and do not require complex discussions with system administrators in a restrictive institutional environment. 8 Hadoop’s speed and flexibility is the result of its very simple order of operations and physical architecture that allows for scaling from just three to potentially thousands of nodes. 9 As the corpus size expands, this system can scale rapidly with only modest additional investment of research funds.

The term “distant reading” first articulated by Franco Moretti has come to describe the process of quantitative analysis. 10 It arose out of a very human inability to read all the texts that compose World Literature. A remarkable number of issues emerge from this very simple state of affairs. Firstly, we now have a kind of writing that acts like reading. In other words, the SQL queries used to parse and structure vast strings of data represents a readable trace of reading. Coding commands for this system represents a “composition as investigation." 11 Secondly, the scalability of these systems allows for machine reading systems to perform human lifetimes worth of reading tasks in mere minutes or hours. The computer’s ability to perform repetitive tasks at speeds also means that the interpretive experience of texts is now outside the domain of human perception. “The Hadoop Distributed File System” white paper described the pace of operations through the “heartbeats” that guide the operations of the distributed machines by explaining how the TaskTracker “can process thousands of heartbeats per second.” 12 Thirdly, this technological moment represents the collapse of the false dichotomy between philosophical, subjective, and speculative analysis, and scientific, objective, empirical analysis. Quantitative methods have the potential to erase the long held doctrine of the “two cultures” that divides the sciences and humanities and presumes that “Intellectuals, in particular literary intellectuals, are natural Luddites.” 13

The use of the word “experimental” carries a strategic significance in this context. It is a word that simultaneously accesses literary, scientific, and technological discourses. My experiment is primarily animated by the “aesthetic provocation” of avant-guard writing.14 Because computation relies upon symbolically stable inputs, making computational sense of non-sense characters becomes a central challenge to overcome in the study of contemporary experimental literature with computational methods; in order to read these non-sense characters, the “stop list" for this topic model may need to comprise the entire corpus of proper words. Rather than treating literature as strictly quantifiable data that algorithmic analysis can simply glean information from, I propose a methodology that assumes that literary information is profoundly resistant, reactive, and unpredictable. Kenneth Goldsmith claims, in Uncreative Writing (2011), that “digital media has set the stage for a literary revolution.” 15 While Goldsmith is thinking here about distribution methods on the Web, there is little doubt that literature is responding to the technological context into which it is published. It is now time to include the algorithm in this history of the avant-guard.

I would also like to thank the generous support of the Electronic Textual Cultures Lab at the University of Victoria and Implementing New Knowledge Environments group. This work is supported by the Social Science and Humanities Research Council of Canada.

1. Silliman, Ron (1984). For L=A=N=G=U=A=G=E. The Language Book. Eds. Bruce Andrews and Charles Bernstein. Southern Illinois UP. 14. Print.

2. Jockers, Matthew (2013). Macroanalysis: Digital Methods and Literary History. Urbana: U of Illinois P. 9. Print.

3. Broeckmann, Andreas. Digital Culture, Art, and Technology. IEEE Multimedia 12.4 (Oct.-Dec. 2005): 9-11. Web. 8 April 2013.

4. Mayer-Schönberger and Kenneth Cukier (2013). Big Data: A Revolution that Will Transform how We Think Live, Work, and Think. New York: Houghton Mifflin Harcourt. Print.

5. Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, Robert Chansler (2010). The Hadoop Distributed File System. Symposium on Massive Storage Systems and Technologies and Co-located Events. IEEE: Computer Society. Web. 22nd Oct. 2013.

6. Dean, Jeffrey and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. USENIX Association OSDI ‘04: 6th Symposium on Operating Systems Design and Implementation. Web. 22nd Oct. 2013.

7. Apache Hive.

8. Amazon Elastic MapReduce (Amazon EMR).

9. Apache Hadoop.

10. Moretti, Franco. Conjectures on World Literature. New Left Review 1 (Jan./Feb. 2000): 54-68. Web. 30 July 2013.

11. Silliman, Ron. (2008) The Alphabet. Tuscaloosa: U Alabama P. 1057. Print.

12. Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, Robert Chansler.The Hadoop Distributed File System. Symposium on Massive Storage Systems and Technologies and Co-located Events. IEEE: Computer Society, 2010. Web. 22nd Oct. 2013.

13. Snow, C.P. (1961) The Two Cultures and the Scientific Revolution. New York: Cambridge UP. 23. Print.

14. Drucker, Johanna (2009). SpecLab: Digital Aesthetics and Projects in Speculative Computing. Chicago: U of Chicago P. xi. Print.

15. Goldsmith, Kenneth (2011). Uncreative Writing. New York: Columbia UP. 15. Print.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO