National Research Unversity Higher School of Economics
National Research Unversity Higher School of Economics
The introduction of computerized methods for philological analysis is compounded by the richness, complexity and multidimensional nature of literary texts. As it has been coined by Ju.M. Lotman, the arts should be considered as a secondary modelling system, whilst the natural language is a primary one. (Lotman 1967). Computational models allow to extract and analyse linguistic features of the text - POS tagging, syntactic structures, word frequency, and semantic domains of word classes. Meanwhile the most important elements of textual poetics remain outside the scope of application of these tools. As a result, the most rapid development of computational methods for literary study is observed in the field of computational stylistics and stylometry (Burrows, 2002), (Rybicki, 2006), (Eder, 2015), (Franzini et al., 2018), and thematic modelling (Jockers, 2015), (Schöch, 2017), which may be deployed using bag of words approach only without any specific textual mark-up.
It seems that one of the most important reasons why distant reading methods are still not regarded as a mainstream in literary scholarships and are often looked at with suspicion by traditional philologists is their extremely difficult access to textual complexity, the layers of the secondary modeling system. In fact, even in the cases of modelling some complicated phenomena such as social networks of characters or plot elements extraction, the studies are more concerned on engineering but not on the research issues, which means elaboration of computational but not literary analysis methods.
This paper aims to introduce a new approach to the task of capturing textual complexity. We use five methods to model the character system in a novel, each one is aimed to discover one of the layers of this system. The combination of these layers gives as a result a complex view on the novel's composition enriched by computationally obtained data, quantitative and statistical metrics and graphical schemes and networks We apply slylometric and alternative non-lexical analysis to characters' direct speech, two alternative methods of network analysis to model characters interactions and clustering method for comparison of portrait descriptions in Leo Tolstoy's “War and Peace". The Tolstoy's great novel which counts hundreds of characters among which several dozen may be viewed as prominent, serves a perfect material for such a study. We claim that with the help of
the complex layer analysis we can reveal some new structural constituents of the novel composition, that could not be captured by traditional (close reading) interpretations of Tolstoy's poetics.
Preliminary preparations
The complex layer analysis of the character system requires thorough and precise mark-up. All automatic or semi-automatic mark-up has been checked and corrected manually if needed. First of all, all the characters have been encoded with TEI labels. This procedure was also important as far as the characters may be referred to by several different names (сf. Pierre, Bezuchov, Petr Kirillovich) and anaphorical pronouns. Secondly, all the dialogues in the novel have been identified and connected to speakers (characters) and their TEI labels. Finally, all the portrait descriptions have been elicited with the help of semantic mark-up, borrowed from the National Corpus of the Russian Language (Toldova et al., 2008). Next stage of the portraits mark-up involved encoding one of the four types of Tolstoy's way of designation of his characters' appearances - metaphorical, emotional, portrait, and value expression. Each sentence of the description was also supported by the integral sentiment assessment (positive, negative, oxymoron and neutral)
Methodology
Three major traits of literary characters have been studied with the help of computational means: speech, portrait and social interactions.
Two methods have been used for character speech analysis. The first method exploits basic stylometric analysis (delta). It measures the distance between direct speech of different characters by calculating the distribution of top keywords for speech sentences of every prominent character. The stylometric method refers to the layer of topical connections between the characters in the novel. The main oppositions which are set by this method are the oppositions between men and women, and between the Moscow and Saint-Petersburg circles.
The second method presents an alternative to the stylometric approach. It compares non-lexical characteristics of the direct speech sentences, such as ratio of words and punctuation in a speech sentence, exclamation and question marks, frequency of discourse words and readability score. These parameters differentiate the characters by their manner of speech, in particular distinguishing "oral" and "written" types of sayings as a principal opposition. Each considered character has been defined by a vector, which accumulates the mean value of all the parameters. The clustering model brings together the vectors with closest distance. The PCA analysis shows that this layer is sensitive to family similarity, as it opposes Natasha and Nikolay Rostovs to Andrey Bolkonsky and his sister Princess Mary.
The analysis of portraits was also based on vector clustering. This method builds the vectors for each prominent character out of two metrics: the first one combined the normalized frequency values of each type of portrait description, the second referred to the frequency values of four types of sentiment assessment. The hierarchical tree of this layer brought together the vectors of the main pairs of the novel: Pierre and Natasha, and Nikolay and Marya. Surprisingly the analysis also revealed intrinsic similarity between the two "bad guys" of the novel: Napoleon and Dolokhov. This layer concerns concealed parallelism in Tolstoy’s way of thinking and describing his characters.
The social interaction has been measured by two semantic networks built on different bases. The first method sets the connection between the characters that talk to each other. This network reveals the main communities of the novel, the characters that have intensive communications cluster together. One could suppose that the layer captured by this network reflects the family dramatical part of the plot.
The second method sets as a connection the fact of two characters mentioned together. The two networks differ in a crucial way, in particular within the war parts, where the dialogue communications happen less often then in the peaceful parts. This layer stands for the epic part of the novel, thus Napoleon and Kutuzov form one cluster here, though they never communicate and are at utmost distance from each other on the previous network.
Discussion
The five layers of the character system in ”War and Peace" are defined by the five different methods of computational analysis of literary text. All the methods are in no way new to digital humanities: stylometry, clustering, PCA, corpus and network analysis. The important conclusion needs to be emphasized: all the methods when applied to clean, well marked up data give different results and lead the researcher to different interpretations. At the same time the masterpiece contains all the interpretations within itself. To say more the inner architecture of the text supports and interconnects in a very sophisticated way all the interpretative layers. This point may be proven by the following example. If we consider the whole novel excluding the epilogue, we may see that the speech and network layers group the five main characters of the novel (Natasha, Pierre, Andrey, Nikolay and Princess Mary) differently, but no grouping reflects the romantic relationships of the novel.
Figure 1. Connections/proximity of the five main characters, revealed by the 4 layers of speech and interaction analyses of the novel without the epilogue: no romantic relations reflected on these layers
Figure 2. Connections added by the analysis of the epilogue (dashed lines) reveal the romantic connections of the main characters)
It is in the epilogue where the whole construction gets the stability and the romantic attractions of the characters are openly demonstrated to the reader. The Prince Andrey, though not alive in the epilogue, is still mentioned a lot in the talks of Natasha and Pierre, thus the romantic triangle which has never been defined explicitly during the novel is resolved in the epilogue. Amazingly, the fifth layer not mentioned above, which reflects the way Tolstoy depicts and presents his characters, sets the connection between the members of the two happy families of the epilogue from the very beginning.
Bibliography
Burrows, J. (2002). ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship, Literary and Linguistic Computing, 3:267–287.
Eder M., (2017). Visualization in stylometry: Cluster analysis using networks, Digital Scholarship in the Humanities, 1: 50-64.
Franzini, G., Kestemont, M., Rotari, G., Jander, M., Ochab, J.K., Franzini, E., Byszuk, J., Rybicki, J. (2018). Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm, Frontiers in Digital Humanities, published 05 April, 2018: https://doi.org/10.3389/fdigh.2018.00004
Jockers M. , (2015). Revealing Sentiment and Plot Arcs with the Syuzhet Package http://web.archive.org/web/20181116084544/http://www.matthewjockers.net/2015/02/02/syuzhes/ (accessed 16 November, 2018)
Lotman Ju.M., (1967) Statji po semiotike kultury i isskustva (Papers on semiotics of culture and art), Uchenye Zapisli Tartuskogo Universiteta, 198: 130-145.
Rybicki J., (2006). Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and its Two English Translations, Literary and Linguistic Computing. 1:91–103.
Schöch C., (2017) Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama, Digital Humanities Quarterly. 2:1–53.
Toldova S.Ju, Kustova G.I, Lyashevskaya O.N., (2008) Semanticheskie filtry dlia razreshenia mnogoznachnosti v nacionalnom korpuse russkogo jazyka (Semantic filters for word sense disambiguation in the Russian National Corpus), Computational linguistics and intellectual technologies, Moscow, RGGU, pp. 522-29.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO