The Value of Tag Cloud Visualizations for Textual Analysis

Stefan Jänicke; Markus John; Annette Geßner

Authorship

1. Stefan Jänicke

Leipzig University of Applied Sciences
2. Markus John

Universität Stuttgart
3. Annette Geßner

Leipzig University of Applied Sciences

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

In digital humanities applications, tag clouds are often used as a means of distant reading (e.g., Beaven 2011, Koch et al. 2014, Hinrichs et al. 2015, Montague et al. 2015, John et al. 2016). By dissolving the structure of texts, thus, splitting it into words, the frequencies of different words, in the following denoted as
tags,

can be determined. Typically, tag clouds take the most frequent N tags of a text corpus, and by mapping frequency to font size and arranging the tags in a random manner on the screen, the observer gets a quick and intuitive summary of the textual content of the corpus. At least since Wordle (Viégas et al. 2009) has been offered to the public to generate tag clouds on demand, they enjoy great popularity and are widespreadly used.

Nevertheless, there are crucial theoretical problems in the design of tag clouds (Viégas et al. 2008) that question their benefit for text analysis tasks. Finding a single word without assistance is hard, long words receive more visual attraction than short ones as they cover more space, and font sizes, thus the frequencies of words, are difficult to compare. Furthermore, a tag cloud usually does not display all words in a corpus, thus, neglecting less frequent words can lead to misinterpretations. In this paper, we evaluate the value of several tag cloud visualization techniques that have been designed to support research tasks in various digital humanities scenarios. Continuing prior works of the co-authors (Jänicke et al. 2014, Jänicke and Geßner 2015), we base our analysis on the Bible known as the most often read and researched books, thus, well-suited for evaluation purposes. We chose the King James Bible being the most influential English translation.

TagVenn Diagrams

TagPies have been designed to compare the contexts of different words (Jänicke et al. 2018). For a set of M different terms, TagPies generate M+1 different tag sets for shared (1 set) and non-shared vocabulary (M sets). TagVenn diagrams are an extension of TagPies aiming to compare co-occurrences more precisely as all combinations of shared vocabulary are considered. Taking three terms a, b and c and their respective sets of co-occurring tags A, B and C as an example, TagVenn diagrams visualize the following tag sets: A\(B∪C), B\(A∪C), C\(A∪B), (A∩B)\C, (A∩C)\B, (B∩C)\A and A∩B∩C. The tags are arranged in a Venn diagram style with a set of colors reflecting the cut sections. As the human ability to distinguish colors is limited, a maximum of four texts (generating 15 sets) can be analyzed with Tag Venn diagrams.
A similarity in topics and writing style can be seen looking at the King James Bible: It is known that the books of the four Evangelists have a lot in common with John differing the most from the others. This can clearly be seen by comparing John with Marcus and Matthew with the minimum number for occurrences set to four (Figure 1). John (JOH) and Marcus (MAR) have significantly less words in common than Mark and Matthew (MAT). It is interesting to see here that Matthew has the biggest number of words only used in this book instead of John as might be expected. Especially worthy for further investigation in the close reading were words like “truth” (27x) and “true” (13x) are frequently used by John, and less than 4x used by the other two Evangelists.
In the current version the scholar has to set a minimum number of occurrences, which may not always give accurate results. Then, the diagram might show a word as only occurring in one book of the Bible although other books may also include that word with a too small amount of occurrences. Setting a high number of occurrences will not always be interesting for a researcher since very frequently used words are not necessarily relevant for determining the content of a text. A workaround to avoid unwanted results could be extending the number of stopwords on demand. Also, adding the option to set a “maximum number of occurrences” would give more opportunities to research different questions. Maybe even considering classes of frequency could bring interesting results and open up an interesting field of research.

TagVenn Diagram showing the most frequent words in Mark, Matthew and Johannes

MultiCloud

MultiCloud has been developed as a flexible approach to show different documents in a merged word cloud visualization (John et al. 2018). It depicts each document with a colored circle around the word cloud shape (Figure 2) and it applies a force-based layout in combination with a collision detection algorithm to use the available space as optimally as possible. This way, the positions of the tags represent the occurrences in the different documents. Additionally to the position in the layout, the font size represents the overall occurrences in the documents and color saturation indicates how relevant each tag is in the respective text documents. Furthermore, MultiCloud offers the possibility to control the number of displayed tags: Analysts can set both the number of displayed tags that occur in more than one document and the number of displayed words that occur only in an individual document. Both options can be combined and trigger an immediate update of the visualization as depicted in Figure 2. Since MultiCloud uses the largest qualitative color scheme of ColorBrewer (Harrower & Brewer 2003), which comprises twelve distinct colors, it supports the analysis of twelve different documents.

By scanning the first tag cloud (Figure 2A), users get a quick and intuitive overview of the textual content that all the chosen books have in common, like “lord” and “jesus” as well as differences in topics. Looking at words only appearing in one of the chosen books shows that the word “esaus” (Figure 2B, orange) might be a spelling mistake in book Genesis or a problem of normalization, if the apostrophe is filtered out, for the Biblical name “Esau”, which is used more often in different books of the Bible. Also interesting is the name “abram”, which is only used in one of the chosen books (Genesis, orange), for this is the rarely used former name of the very frequently mentioned “abraham”. Determining whether a word is relevant for a specific research question depends strongly on the books chosen to compare. The third tag cloud visualization (Figure 2C) offers an interesting overview of tags that describe the seven subparts and the individual documents. Document-related tags, for example, like the city of “babylon” only being mentioned in book
Jeremiah (green
) as well as tags that characterize all loaded parts such as “glory” or “prophet” can be easily identified.

Three different word cloud variations of seven subparts of the King James Bible using: only tags that occur in multiple documents, tags that only appear in the individual documents, and a combination of both.

TagSpheres

TagSpheres have been designed within a digital humanities project to support the analysis and classification of a term's co-occurrences according to their clause functions (Jänicke et al. 2017). The analyzed term T is placed in the center, and the co-occurrences are placed on concentric discs around it. The depth of the analysis, thus the number of levels, is user-configurable. The farther away from the co-occurring tag in the text, the farther away it is also placed in TagSpheres. A divergent cold-hot color map is implemented to transmit the notion of distance. T is colored red, the tags of the outer disc are colored blue, and a color on the gradient between red and blue is assigned to intermediate levels. Clicking a tag X lists text passages in which T and X have the associated word distance.

Choosing the word “righteousness” with a maximum distance of 6 in all Bible books shows an interesting amount of times this word itself and variations of it are reoccurring in the near proximity: The adjective “righteous” appears once two words apart, four times three words apart, once four words apart, twice five words apart and twice six words apart. Also, the word itself is repeated several times (i.e. six times at four words apart) as well as its opposite “unrighteousness” etc. This case shows that the chosen word is frequently being used in a repetitive, sermon-like style of writing, e.g. in Psalms and especially in the Sermon on the Mount in Matthew.

Increasing the minimum number of occurrences can massively change the result. Currently, stopwords are omitted, but researchers might want stopwords taken into consideration when trying to detect interesting chains of words like sayings, proverbs etc. It could also be interesting to look for a single word re-occurring in the span of a work and visualizing this or looking for a more flexible span of words between two re-occurring words indicating (e.g. rhetoric or stylistic) habits of the author.

TagSpheres showing the co-occurrences for the word “righteousness” dependent on word distances from 1 to 6

Summary
The three case studies outline that, despite the above mentioned theoretical problems, tag clouds can be–if they are carefully designed–valuable tools to support different research inquiries. As opposed to Wordle, all presented visualizations use tag color and position to express a tag’s set relations. It was important for the literary scholar in all scenarios to interactively get access to the underlying texts to examine upcoming hypotheses. Furthermore, different parameter sets shall be provided to generate multifarious views on the text in question.

Bibliography

Beaven D.
(2011).

ComPair: Compare and Visualise the Usage of Language. In Proceedings of the Digital Humanities 2011.

Harrower, M., & Brewer, C. A.
(2003).

ColorBrewer. org: an online tool for selecting colour schemes for maps. The Cartographic Journal, 40(1), 27-37.

Hinrichs, U., Alex, B., Clifford, J., Watson, A., Quigley, A., Klein, E., & Coates, C. M.
(2015).

Trading consequences: A case study of combining text mining and visualization to facilitate document exploration. Digital Scholarship in the Humanities, 30(suppl_1), i50-i75.

Koch, S., John, M., Wörner, M., Müller, A., & Ertl, T.
(2014).

VarifocalReader—in-depth visual analysis of large text documents. IEEE transactions on visualization and computer graphics, 20(12), 1723-1732.

Jänicke, S., Geßner, A., Büchler, M., & Scheuermann, G.
(2014).

5 Design Rules for Visualizing Text Variant Graphs. In: Conference Abstracts of the Digital Humanities 2014.

Jänicke, S., & Geßner, A.
(2015).

A Distant Reading Visualization for Variant Graphs. In: Conference Abstracts of the Digital Humanities 2015.

Jänicke, S., & Scheuermann, G.
(2017).

On the Visualization of Hierarchical Relations and Tree Structures with TagSpheres. In Computer Vision, Imaging and Computer Graphics Theory and Applications: 11th International Joint Conference, VISIGRAPP 2016, Rome, Italy, February 27 – 29, 2016, Revised Selected Papers, pages 199–219. Springer International Publishing, Cham.

Jänicke, S., Blumenstein, J., Rücker, M., Zeckzer, D., & Scheuermann, G.
(2018).

TagPies: Comparative Visualization of Textual Data,” in Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: IVAPP. SciTePress, 2018, pp. 40–51.

John, M., Lohmann, S., Koch, S., Wörner, M., & Ertl, T.
(2016).

Visual analysis of character and plot information extracted from narrative text. In International Joint Conference on Computer Vision, Imaging and Computer Graphics (pp. 220-241). Springer, Cham.

John, M., Marbach, E., Lohmann, S., Heimerl, F., & Ertl, T.
(2018).

MultiCloud: Interactive Word Cloud Visualization for the Analysis of Multiple Texts. In: Proceedings of Graphics Interface 2018. Canadian Human-Computer Communications Society / Société canadienne du dialogue humain-machine; 2018, p. 34 – 41.

Montague J., Simpson J., Rockwell G., Ruecker S., & Brown S.
(2015).

Exploring Large Datasets with Topic Model Visualizations. In Proceedings of the Digital Humanities 2015.

Viégas, F., & Wattenberg, M.
(2008).

TIMELINES: Tag Clouds and the Case for Vernacular Visualization. interactions, 15(4):49–52.

Viégas, F., Wattenberg, M., & Feinberg, J.
(2009).

Participatory Visualization with Wordle. IEEE Transactions on Visualization and Computer Graphics,15(6), 1137–1144 (Nov 2009).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO

The Value of Tag Cloud Visualizations for Textual Analysis

1. Stefan Jänicke

2. Markus John

3. Annette Geßner

ADHO - 2019

"Complexities"