King's College London
A journey from Hell to Heaven, investigating the computational opportunities of automating text analysis and producing data visualisations.This poster presents the results of the exploratory work for a reusable tool to generate data visualisations based on automatic text analysis. Its non-functional requirements respond mainly to flexibility (accept different text inputs) and optimisation (produce rich visualisations with minimal set up). The visual outputs produced by the application have an explorative function in that they aim to:offer a different perspective on the text under study;highlight patterns and/or outliers (Meirelles 2013);drive research in formulating new hypotheses;provide support to, or disprove, existing theses.The current version accounts for modules (i.e. software components) designed around one selected test case, namely Dante Alighieri’s Divine Comedy, but serves as a blueprintfor further modules to be plugged in.The Italian version of the Commedia (Petrocchi 1966-67) is used to perform text structural analysis and work on the rhyme scheme, while the English translation (Mandelbaum 1980-84) is used for sentiment analysis. The unique way in which Dante wrote his masterpiece, makes the text an interesting dataset to be explored computationally. Structural (spatial and temporal) textual components lend themselves to be represented graphically, and offer insights into its linguistic content.The visual outputs allows users to interact with both the content and the metadata.The application performs computational text analysis to produce data visualisations representing the following structural, stylistic and semantic features of the text:schematic representation of the poem’s structure and rhythm ;distribution of keywords;visual representation of the sentiment analysis (fig. 3).Figure 1 An example of the schematic representation of the poem’s structure: rhythm imposed by tercets and rhyme prediction.Figure 2 Words like Cristo (Christ) and stelle (stars) are distributed unevenly across the three cantiche: the word “Christ” never appears in the Inferno, while it’s widely used in the Paradiso. One square per line. Figure 3 Sentiment analysis visualisation of the three cantiche. Red is negative, blue is positive and the opacity indicates how close to the polarity (-1, 1) the sentiment is. One square per line.The application has been developed modularly (Martin and Martin 2006), following the separation of concerns design principle (Dijkstra 1982) to allow for flexibility and scalability.The computational aspect of the project is implemented in Python, a flexible programming language that supports object-oriented programming and functional paradigms.The visualisations are produced with the support of d3.js library, “a JavaScript library for manipulating documents based on data” (Bostock D3.js <https://d3js.org/>). The application exploits HTML5 and SVG specifications to allow for greater interaction and portability.Natural language processing (NLP) and machine learning techniques have been applied to process and transform the data. The Naive Bayes Classifier (Perkins 2010) technique has been chosen due to its performance and simple implementation.A training dataset has been manually created collecting random subsets of text from other authors close in language and time, and further work from Dante himself:Ludovico Ariosto, Orlando furioso (Wikisource contributors 2012)Dante Alighieri, Convivio (Gallarino <http://www.italica.it/dante/convivio.html>)Giovanni Boccaccio, Decamerone (Wikisource contributors 2017)The poster illustrates the workflow from input to output, displaying a diagram of the process.The poster demonstrates achievements of this proof of concept and development ideas for the future. The main success lies in its modular development (fig. 4), making it amenable to further development3 (algorithm refinements, visualisation workflows, stylometric analysis). More languages and different text structures will be integrated and a wider range of output visualisations offered, while making use of the same core functionalities for ingesting and processing data.Figure 4 The data model of the application, illustrating the separation of concerns andthe potential for extensibility.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Carleton University, Université d'Ottawa (University of Ottawa)
Ottawa, Ontario, Canada
July 20, 2020 - July 25, 2020
475 works by 1078 authors indexed
Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.
Conference website: https://dh2020.adho.org/
References: https://dh2020.adho.org/abstracts/
Series: ADHO (15)
Organizers: ADHO