Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)
Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)
Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)
One of the algorithms that have been established in digital literary studies during the recent years is LDA (Latent Dirichlet Allocation) topic modeling (Blei 2012; Seyvers and Griffith 2006). This method allows researchers to analyze the distributions of semantic groups in a corpus of texts, which can be useful for both the exploration of contents and automated text classification tasks. With regard to the increasing interest on topic modeling in the DH community, the infrastructure project DARIAH-DE has started in 2017 to develop the TopicsExplorer. Based on the python libraries "DARIAHTopics" (Jannidis et al. 2017) and "LDA" by Allan Riddell, the TopicsExplorer is a locally run standalone software for topic modeling that allows researchers to generate and explore topic models on their own computers, with their own text files, all within the comfort of a single GUI (graphical user interface) tool that supports the entire process from preprocessing to the visualization of results (https://dariah-de.github.io/TopicsExplorer/).
Though lacking the performance and flexibility of popular command line solutions, such as MALLET (McCallum 2002), or programing libraries, like Gensim (Rehurek and Sojka 2010), the advantage of the TopicsExplorer is its simplicity and usability. It can be used productively - within its limitations - by researchers without any programing skills. It does not even require users to use the command line. It thereby covers a significant gap: curious researchers lacking the technical skills required to use conventional tools are now able to try topic modeling and learn about the potentials and limitations of the method. On the one hand, this enlarges the spectrum of researchers able to participate in an informed discourse about DH-related research that relies in topic modeling as a method. On the other hand, researchers who think about using topic modeling in their own investigations can either directly use the TopicsExplorer for simpler problem, or at least learn beforehand enough about the method to make an informed decision before investing the effort to acquire the technical skills necessary to work with more advanced tools.
The first version of the TopicsExplorer has been presented to various groups of researchers and students in a number of workshops in 2017 and 2018. Experiences and user feedback collected in these workshops have shaped the development work in the past year, and the changes implemented go far beyond simple debugging and securing sustainable functionality on as many platforms as possible. The tool began as a so-called "GUI Demonstrator" for the DARIAHTopics Python library that required installation of a Python environment, ran as a local server and was displayed in a browser (Simmler et al. 2018). With version 1.0, it became a fully functional standalone software that can be downloaded and run on common Windows, MacOS and Linux systems without further preparation. It features interactive visualizations and csv export of results. A number of smaller enhancements proposed by test users, like a progress bar and abort button (Fig. 1), have incrementally improved the usability of the 1.x versions.
With the recently published version 2.0 architecture and interface have undergone a major redesign that addresses the more complicated feature demands derived from the feedback from our test users. On the technical level, the former solution for interactive visualization that was based on Python's "Bokeh" library has been replaced by a Javascript-based solution. This allows for more flexibility in the implementation of additional interactivity features. To improve user experience especially for users who want to explore a corpus visually through a topic model, a new visualization concept based on the ideas of Chaney and Blei (2012) was implemented in this version. The concept allows, for example, to display a document from the corpus with all its related information from the topic model, and at the same time to list other documents with similar or related content (Fig. 2).
Topic modeling already is a research method often encountered in the digital humanities, though one exclusively used and critically discussed by researchers with advanced technical skills. It is our hope that the TopicsExplorer, with all the ongoing improvements, will help to move the method out of that particular niche.
Figure 1: Version 1.x progress bar.
Figure 2: Overview for a single document in version 2.0.
Bibliography
Blei, David M.
(2012): „Probabilistic Topic Models“, in
Communication of the ACM
55, Nr. 4 (2012): 77–84. doi:10.1145/2133806.2133826.
Chaney, Allison J.B./ Blei, David M. (2012): "The Visualizing Topic Models", in:
Proceedings of the S
ixth International AAAI
Conference on
Weblogs and Social Media 419-422.
Jannidis, Fotis/ Pielström, Steffen / Schöch, Christof / Vitt, Thorsten (2017): "Making topic modeling easy: a programming library in Python", in:
Proceedings of the
Digital Humanities 2017 Conference.
McCallum, Andrew K.
(2002):
MALLET : A Machine Learning for Language Toolkit
.
http://mallet.cs.umass.edu
.
Rehurek, Radim/ Sojka, Petr
(2010): "Software framework for topic modelling with large corpora."
In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks
.
Simmler, Severin/ Vitt, Thorsten / Pielström. Steffen (2018): "LDA Topic Modeling über eine graphisches Interface", in:
Konferenzabstracts
der 5. Tagung des Verbands Digital Humanities im deutschsprachigen Raum e.V. 428-429.
Steyvers, Mark/ Griffiths, Tom
(2006): „Probabilistic Topic Models“, in
Latent Semantic Analysis: A Road to Meaning
, herausgegeben von T. Landauer, D. McNamara, S. Dennis, und W. Kintsch. Laurence Erlbaum.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO