Framework for Testing Text Analysis and Mining Tools

poster / demo / art installation
Authorship
  1. 1. John Edward Simpson

    University of Alberta

  2. 2. Geoffrey Rockwell

    University of Alberta

  3. 3. Stéfan Sinclair

    McGill University

  4. 4. Kirsten C. Uszkalo

    University of Alberta

  5. 5. Susan Brown

    University of Alberta, University of Guelph

  6. 6. Amy Dyrbye

    University of Alberta

  7. 7. Ryan Chartier

    University of Alberta

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The most extensive compendium of text mining tools to date included 71 tools and summarized each based on ten criteria (van Gemert 2007). While extensive, this listing of tools and their properties is general in its review criteria and does not offer any testing-based observations to help users assess actual usability. Humanists looking to try text analysis, visualization and mining tools for research need better information that is relevant to their needs and reviews of tools that help them make choices. This poster presents the testing framework developed for the TAPoR 2.0 portal reviews. The poster will cover:

1. The need for tool reviews
2. The information gathered about tools
3. The testing and reviewing process
4. Conclusions about the state of text tools
The poster will be accompanied by a demonstration of TAPoR 2.0 so that users can see the reviews in context.
1. The Need for Tool Reviews
A humanities researcher new to computing methods looking for reviews of text tools on the internet by peers is going to be disappointed. There is nothing like the New York Review of Books, though in the early days of humanities computing you could find short announcements about tools in journals like Computing in the Humanities. We, however, believe that certain text tools are intellectual contributions to the field (Ramsay 2012) that should be reviewed not just to help people choose what tools to use, but also as a way of engaging these tools in a dialogue around computer-assisted interpretation. While there are individual blog entries about tools scattered across the web, each is from the perspective of a single user with an entirely different dataset, making comparison difficult. If we want to make computing methods accessible and encourage colleagues to use tools we need a more systematic approach. This is especially true of text mining tools that can’t simply be tried with a text at hand.

2. Information Gathered About Tools
TAPoR 2.0 (www.tapor.ca) is a portal for text analysis, visualization mining tool discovery and review. TAPoR 2.0 is a complete redevelopment of the original TAPoR portal (Rockwell 2009) that has refocused the portal on discovery and review instead of trying to provide access only to web services. As part of the redevelopment of TAPoR 2.0 we used a persona/scenario usability design approach (Cooper 2004) to identify attributes that users might want to discover tools. Further we built TAPoR 2.0 so that editors can add new attributes without the database having to be reprogrammed. Some of the attributes we currently record for tools include the author(s), ease of use, type of analysis, type of license and so on. We also have links to related tools and tools people also used. Our poster will be accompanied by a demonstration of TAPoR 2.0 so that visitors can explore what we have and how we represent it.

Figure 1:
TAPoR 2.0 Home Screen

3. The Testing and Reviewing Process
Recording basic information about tools alas, is not enough, especially with sophisticated text mining tools like Mallet (mallet.cs.umass.edu/) that take time to learn and that can be used in different ways. With text mining tools users need longer narrative reviews. For this reason we developed processes for testing and reviewing tools. For simpler text analysis and visualization tools this involved developing a set of different texts with which to test tools so we could compare their use. For text mining we had to go further and are working with the CWRC project (Canadian Writing Research Collaboratory, www.cwrc.ca) developing a number of literary corpora with experts we can draw on to help assess the value of results. As of writing we have three corpora drawn from the Orlando project (www.ualberta.ca/ORLANDO) and one of Victorian children’s literature. We expect to have two more by the time or presentation. The poster will discuss the criteria used to develop these open test corpora.

The reviews take the form of comments that have been pinned to the top of the list of comments available. This allows others to leave comments, though we haven’t seen much activity by people not connected to the project (with the exception of spammers who seem to feel there is a connection between text analysis tools and various stimulants.) We have developed guidelines for reviews so as to make them accessible and comparable. The poster will outline our guidelines.

4. Conclusions from Testing and Reviewing
Having tested and reviewed a variety of tools and text mining systems we see some common barriers to access. Most of these tools have been developed for use by the developers and are poorly documented for people not involved in the development. Further, many tools, including those we are involved in, are in continuous development so what documentation there is, is out of date. We will therefore end this poster with lessons learned while testing and reviewing text mining tools, with particular attention to removing usability barriers for novice users.

References
Cooper, A. (2004). The Inmates Are Running the Asylum. Indianapolis, Indiana: SAMS.
Ramsay, S. and G. Rockwell (2012). Developing Things: Notes toward an Epistemology of Building in the Digital Humanities. In Gold, M. K. (ed.), Debates in the Digital Humanities. Minneapolis, Minnesota: University of Minnesota Press, 75-84.
Rockwell, G. (2006). TAPoR: Building a Portal for Text Analysis. Mind Technologies: Humanities Computing and the Canadian Academic Community. edited by Raymond Siemens and David Moorman. Calgary: University of Calgary Press, 285-299.
van Gemert, J. (2000). Text Mining Tools on the Internet. ISIS Technical Report Series. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4312886.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2013
"Freedom to Explore"

Hosted at University of Nebraska–Lincoln

Lincoln, Nebraska, United States

July 16, 2013 - July 19, 2013

243 works by 575 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: http://dh2013.unl.edu/

Series: ADHO (8)

Organizers: ADHO