Towards an Archaeology of Text Analysis Tools

paper, specified "long paper"
  1. 1. Stéfan Sinclair

    McGill University

  2. 2. Geoffrey Rockwell

    University of Alberta

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

How have text analysis tools in the humanities been imagined in the past? What did humanities computing developers think they were addressing with now dated technologies like punch cards, printed concordances and verbose command languages? Whether the analytic functionality is at the surface, as with Voyant Tools, or embedded at deeper levels, as with the Lucene-powered searching and browsing capabilities of the Old Bailey, the web-based text analysis tools that we use today are very different from the first tentative technologies developed by computing humanists. Following Siegfried Zieliniski's exploration of forgotten media technologies, this paper will look at three forgotten text analysis technologies and how they were introduced by their developers at the time. Specifically we will:

Discuss why is it important to recover forgotten tools and the discourse around these instruments,
Look at how punch cards were used in Roberto Busa’s Index Thomisticus project as a way of understanding data entry,
Look at Glickman’s ideas about custom card output from PRORA, as a way of recovering the importance of output,
Discuss the command language developed by John Smith for interacting with ARRAS, and
Conclude with a more general call for digital humanities archaeology.
Zieliniski and Media Archaeology
Siegfried Zielinski, in Deep Time of the Media, argues that technology does not evolve smoothly and that we therefore need to look at periods of intense development and then look at the dead ends that get overlooked to understand the history of media technology. In particular he shows how important it is to look at technologies that are not in canonical histories as precursors to “successful” technologies, because they provide insight into the thinking at the time. A study of forgotten technologies can help us understand opportunities and challenges as they were perceived at the time and on their own terms rather than imposing our prejudices. From the 1950s until the early 1990s there was just such a period of technology development around mainframe and personal computer text analysis tools. The tools developed, the challenges they addressed, and the debates around these technologies have largely been forgotten in an age of web-mediated digital humanities. For this reason we recover three important mainframe projects that can help us understand how differently data entry, output and interaction were thought through before born- digital content, output to wall-sized screens, and interaction on a touchscreen.

Busa and Tasman on Literary Data Processing
The first case study we will present is about the methods that Father Busa and his collaborator Paul Tasman developed for the Index Thomisticus (Busa could hardly be considered a forgotten figure, but he's often referred to metonymically as a founder of the field, with relatively little attention paid to the specifics of his work and his collaborations). Busa, when reflecting back on the project justified his technical approach as supporting a philological method of research aimed at recapturing the way a past author used words, much as we want to recapture past development. He argued in 1980 that, “The reader should not simply attach to the words he reads the significance they have in his mind, but should try to find out what significance they had in the writer’s mind.” (Busa 1980, p. 83) Concordances could help redirect readers towards the “verbal system of an author” or how the author used words in their time and away from the temptation to interpret the text at hand using contemporary conceptual categories. Concording creates a new text that shows the verbal system, not the doctrine.

Busa’s collaborator Paul Tasman, however, presents a much more prosaic picture of their methodology that focuses on data entry using punch cards so you can actually get concordances of words. He published a paper in 1957 on “Literary Data Processing” in the IBM Journal of Research and Development that focuses on how they prepared their texts accounting for human error and other problems. Tasman writes, “It is evident, of course, that the transcription of the documents in these other fields necessitates special sets of ground rules and codes in order to provide for information retrieval, and the results will depend entirely upon the degree and refinement of coding and the variety of cross referencing desired.” (p. 256) This case study takes us back to a forgotten set of problems (representing text using punch cards) which led to more mature issues in text encoding. In the full presentation we will look closely at the data entry challenges faced by Busa’s team and how they were resolved with the card technology of the time.

Glickman and Stallman on Printed Interfaces
The second case study we will look at is the development of the PRORA programs at the University of Toronto in the 1960s. PRORA was reviewed in the first issue of CHUM and with the publication of the Manual for the Printing of Literary Texts and Concordances by Computer by the University of Toronto Press in 1966 is one of the first academic analytical tools to be formally published in some fashion. What is particularly interesting, for our purposes, is the discussion in the Manual of how concordances might be printed. Glickman had idiosyncratic ideas about how concordances could be printed as cards for 2-ring binders so that they could be taken out and arranged on a table by users. He was combining binder technology with computing to reimagine the concordance text. Today we no longer think about output to paper as important to tools, and yet that is what the early tools were designed to do as they were not interactive. We will use this case study to recover what at the time was one of the most important features of a concording tool – how it could output something that could be published for others to use.

Fig. 1: Example of PRORA output from the Manual

Smith and Interaction
One of the first text analysis tools designed to support interactive research was John Smith’s ARRAS. In ARRAS Smith developed a number of ideas about analysis that we now take for granted. ARRAS was interactive in the sense that it was not a batch program that you ran for output. It could generate visualizations and it was explicitly designed to be part of a multi-tasking research environment where you might be switching back and forth between analysis and word processing. Many of these ideas influenced the interactive PC concordancing tools that followed like TACT. In this paper, however, we are not going to focus on all the prescient features of ARRAS, but look at the now rather dated command language which Smith was so proud of. Almost no one uses a command language for text analysis any more; we expect our tools to have graphical user interfaces that provide affordances for direct manipulation. If you need to do something more than what Voyant, Tableau, Lucene, Gephi or Weka let you do, then you learn to program in a language like R or Python. John Smith by contrast, spent a lot of time trying to design a natural command language for ARRAS that humanists would find easy to use and this comes through in his publications on the tool (1984 & 1985). Command languages were, for a while, the way you interacted with such systems and attention to their design could make a difference. Smith tried to develop a command language that was conversational so humanists could learn to use it to explore “vast continents of literature or history or other realms of information, much as our ancestors explored new lands.” (Smith 1984, p. 31) Close commanding for distant reading.

In the 2013 Busa Award lecture Willard McCarty called us to look to our history and specifically to look at the “incunabular” years before the web when humanists and artists were imagining what could be done. One challenge we face in reanimating this history is that so much of the story is in tools, standards and web sites – instruments difficult to interrogate the way we do texts. This paper looks back at one major thread of development - text analysis tools – not for the entertainment of outdated technology, but recover a way of thinking about technology. We will conclude by discussing other ways back including the need for better documentation about past tools, along the lines of what TAPoR 2.0 is supporting, and the need to preserve tools or at least a record of their usage.

Busa, R. (1980). "The Annals of Humanities Computing: The Index Thomisticus." Computers and the Humanities. 14(2): 83-90.

Glickman, Robert Jay, and Gerrit Joseph Staalman. Manual for the Printing of Literary Texts and Concordances by Computer. Toronto: University of Toronto Press, 1966.

Liu, Alan. (2012) “Where is Cultural Criticism in the Digital Humanities.” In Debates in the Digital Humanities. Ed. Matthew K. Gold. University of Minnesota Press. Liu’s essay is online at <>.

Smith, J. B. (1978). "Computer Criticism." STYLE XII(4): 326-356.

Smith, J. B. (1984). "A New Environment For Literary Analysis." Perspectives in Computing 4(2/3): 20-31.

Smith, J. B. (1985). Arras User's Manual: TR85-036. Chapel Hill, NC, The University of North Carolina at Chapel Hill.

Tasman, P. (1957). "Literary Data Processing." IBM Journal of Research and Development 1(3): 249-256.

Zieliniski, Siegfried. (2008) Deep Time of the Media: Toward an Archaeology of Hearing and Seeing by Technical Means. Cambridge, Massachusetts: The MIT Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO