School of Humanities and Social Science - University of Newcastle
Pedagogical University of Krakow, Polish Academy of Sciences
Institut für Deutsche Philologie (Institute for German Philology) - Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)
Universität Antwerpen (University of Antwerp)
Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)
Quantitative analysis of literary texts is now well established in authorship attribution. There are continuing lively discussions of method, and the understanding of how classification works best with language continues to evolve, but there are some successes to point to, and most literary scholars accept that when experts disagree on an attribution, a statistical approach can be helpful. The great advantage for quantitative work in this area is that methods can be tested with texts of known origin, so that calibration and validation can be done. Practitioners and traditional scholars can share confidence in rigorous studies with good sampling, controls and validation of various kinds.
There is also the potential for computational stylistics to make a contribution in stylometry beyond authorship attribution and in the wider area of literary interpretation. However, here the problem of validation is acute. If a surprising finding emerges from a quantitative study, how can we tell a chance result, or an artefact of method, from a well-founded finding? How do we judge the robustness of the results and the degree to which conclusions may be generalized? How do we relate analyses based on thousands of long texts to the established understanding of areas of culture based on the intensive study of a few works? The best first options in moving beyond authorship may be other areas of classification, where validation is still possible, like chronology and genre study, but the bigger challenge and greatest rewards will be in interpretation in the wider sense.
This panel approaches the question of validation in computational stylistics beyond authorship attribution through case studies in a variety of languages and literary traditions. Each of the case studies concerns computational stylistics beyond authorship attribution, discusses issues of validation, robustness, and/or interpretation, and offers some considerations of the wider questions of method which arise. The panel will combine brief presentations of the use cases exemplifying the larger issues with ample time for discussion among the panelists and with the audience.
1. Fotis Jannidis & Hugh Craig: Statistical complexity in the language of lowbrow and highbrow novels in German and English
In this study we apply Shannon Entropy and Jensen-Shannon Divergence (Lin, 1991; Rosso et al., 2009) to the language of the novel, using one German and one English corpus from the nineteenth century, and one English corpus from the late twentieth century. We focus on the way low-brow and high-brow novels score on the two measures. We rely on classifications from standard literary histories for the novels and explore the statistical results with close readings of selected passages.
Figure 1 is a graph of Jensen-Shannon Divergence and Shannon Entropy in the more modern English corpus. The novels from the Booker Prize shortlist have generally higher scores on both measures than the other identified groups of low-brow novels (with a p-value of 0.0007 for the t-test on JSD).
Fig. 1: Jensen-Shannon Divergence and Shannon Entropy scores for 376 novels in the BNC
The pattern for the nineteenth-century corpora is different. In the early- to mid-Victorian English novels, works by Charles Dickens have low scores on both Entropy and Divergence, despite his canonical status. In the 1860-90 German corpus, some popular novels, for example those by Johanna Spyri, author of the famous Heidi books, also have low scores on Entropy and Divergence, but some works by popular authors like Marlitt span the highest to the lowest range of divergence, revealing a hitherto unnoticed variety of styles (the t-test didn’t show a significant difference between the groups). Under some circumstances Entropy, which has been judged not to be a useful measure for author-attribution studies (Hoover, 2003), and Jensen-Shannon Divergence seem to be useful to distinguish between lowbrow and highbrow novels, but it is yet unclear under which circumstances. In the paper we will discuss the overall relationship between the information-theory metrics as applied to language and classifications of the novels in terms of market sectors, relying on standard measures of statistical significance to validate our claims.
2. Maciej Eder: Bootstrap consensus network: towards a robust visualization in stylometry
Stylometric methodology, developed to solve authorship problems, can easily be extended and generalized to assess different questions in literary history. Explanatory multidimensional methods, relying on distance measures and supported with visualization techniques, are particularly attractive for this purpose. However, they are very sensitive to the number of features (usually: frequent words) analyzed. Even worse, they are either unable to fit dozens of texts on a single scatterplot (e.g. Multidimensional Scaling), or highly dependent on the choice of a linkage algorithm (e.g. Cluster Analysis). The technique introduced in this study combines the concept of network as a way to map large-scale literary similarities (Jockers, 2013), the concept of consensus (Lancichinetti and Fortunato, 2012), and the assumption that textual relations usually go beyond mere nearest neighborship.
Fig. 2: Two algorithms of mapping textual relations
Particular texts can be represented as nodes of a network, and their explicit relations as links between these nodes. The procedure of linking is twofold. One of the involved algorithms (Fig. 2, top) computes the distances between analyzed texts, and establishes, for every single node, a strong connection to its nearest neighbor (i.e. the most similar text), and two weaker connections to the 1st and the 2nd runner-up (i.e. two texts that get ranked immediately after the nearest neighbor). The second algorithm (Fig. 2, bottom) performs a large number of tests for similarity with different number of features to be analyzed (e.g. 100, 200, 300, …, 1,000 MFWs). Finally, all the connections produced in particular “snapshots” are added, resulting in a consensus network. Weights of these final connections tend to differ significantly: the strongest ones mean robust nearest neighbors, while weak links stand for secondary and/or accidental similarities. Validation of the results – or rather self-validation – is provided by the fact that consensus of many single approaches to the same corpus sanitizes robust textual similarities and filters out apparent clusterings. The idea discussed in this paper can be applied to map large literary corpora (see the contribution by Jan Rybicki, below).
3. Jan Rybicki: Validating a large bootstrap consensus network in literary history
Over five hundred English novels from Swift to Rowling were used to produce a bootstrap consensus network of most-frequent-word frequencies, using a pseudo-bootstrapped cluster analysis from 100 to 1,000 most frequent words with the stylo package (Eder et al., 2013) for R and visualized with GEPHI’s Force Atlas 2 layout algorithm (Bastian et al., 2009). The resulting graph yielded the usual strong authorship signal, but the overall shape exhibited a number of features that make sense in the context of traditional literary history.
Fig. 3: Bootstrap consensus network of over 500 English novels
The overall shape of the resulting network (Fig. 3) is reminiscent of the earlier-observed “stylistic drift” phenomenon (Burrows, 1994) in its general chronological order and observes certain topographic rules. The texts are roughly ordered from top (early) to bottom (late), with three avenues of transition from the top-most 18th-century writings to late Victorians and modernists: Americans Melville and Hawthorne; Dickens; and the mid-Victorian female writers, with some notable outliers. Most of the latest texts in this set gravitate towards the bottom area of the graph.
With the impact of spelling variation minimized by 100% culling (a procedure by which all words that do not appear in each of the studied texts are rejected from the analysis), this is a clear indication that distant reading by most-frequent-words frequencies can mirror the evolution of literary style over hundreds of texts and hundreds of years and open new perspectives for close reading. After all, the application of statistics to literature is remarkable in that its results can be validated not by statistical means alone, but also, and perhaps above all, by traditional literary history, classification and interpretation.
4. Christof Schöch: Validating and interpreting Principal Component Analysis: A Case-Study from the Analysis of French Enlightenment Plays
This case study investigates issues of validation and interpretation of stylometric results with regard to authorship, genre, date and form, based on a collection of 120 French plays from the French Enlightenment period. Preliminary analyses using Cluster Analysis have suggested that besides authorship, categories like genre (tragedy or comedy) and form (verse or prose) are important stylistic signals in this collection.
To verify this observation, PCA (Jackson, 2003; Diana & Tommasi, 2002) was performed on different subsets of plays. In a subset of 19 comedies in either verse or prose written by five authors between 1712 and 1760, PCA shows strong effects for form (verse or prose) but results appear to vary for different settings (particularly, number of frequent words).
To test the salience of the effects for form and the robustness of the results, the contribution of several variables to the first three principal components was calculated for different settings. More precisely, F-measures of ANOVA tests were calculated for author, form and date in relation to PC1 to PC3 for PCAs based on 5-200 most frequent words.
Fig. 4: F-scores for author, form and data on PC1 to PC3
Figure 4 shows how PC1 is dominated by “form”, with an extremely high F-score, while “author” and “date” hardly contribute. In PC2, form does not play any role, but “author” and also “date” do. Considering its reduced scale, it appears that PC3 does not show any clear trends.
That the effect for “form” is so concentrated in one major component comes as a surprise, especially because similar effects have not been observed in other domains, such as Early Modern English Drama (Hugh Craig, paper in preparation). However, given the striking robustness of the results, the interpretation of the PCA can proceed with confidence. Further application domains of this method are stylometric investigations of variables such as genre, theme or literary period.
5. Mike Kestemont: Learning Deep Representations of Characters in Literary History
One of the most exciting movements in current Machine Learning is “Deep Learning” (Bengio, 2009). In this field, people attempt to leave the idea of “hand-crafted” features. Older, so-called “shallow” learning techniques – commonly used in stylometry – heavily depend on a researcher’s, typically strongly biased, representation of a problem and will not attempt to optimize or even correct this representation. In “Deep Learning”, the idea is that one should not only learn how to solve a problem, given some input information, but additionally, how the input information is best represented in order to solve the problem. To achieve this, researchers send data through a layered structure of units (“neural network”). At each subsequent layer in this network architecture, the representation of the original input information grows increasingly abstract.
Fig. 5: Country and Capital Vectors Projected by PCA. (Copyright: Mikolov et al. for Google Inc.)
Recent research has demonstrated that Deep Learning yields extremely valuable problem representations. In distributional semantics, for instance, it yields an extremely powerful vector-space model of words. This contribution will survey how such vector spaces have recently been used for advanced analogical reasoning. In a series of breakthrough papers, Mikolov et al. (e.g. 2013) have shown that these models (see Fig. 5) can be used to answer complex questions like: “What is to king, like woman is to man?” (answer: “queen”), or “What is to Warsaw, like France is to Paris?” (answer: “Poland”). I will discuss how this kind of representational learning could be applied to modeling characters from literary history. The main idea is that we should be able to easily answer questions about the archetypical relationships between characters: Who is to Romeo, like Isolde is to Tristan?”. I will argue that this approach offers an exciting new framework to study the “meaning” of literary personas (cf. Bamman et al. 2013) and their cross-novel interrelations.
Bamman, D., Brendan O’Connor, B. and Smith, N. (2013). Learning Latent Personas of Film Characters. ACL 2013, Sofia, Bulgaria. 352-361.
Bastian M., Heymann S. and Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.
Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends in Information Retrieval 2: 1-127.
Burrows, J. F. (1994). Tiptoeing into the infinite: testing for evidence of national differences in the language of English narrative.Research in Humanities Computing, 2: 1-33.
Diana, G., and Tommasi, Ch. (2002). Cross-validation Methods in Principal Component Analysis: A Comparison. Statistical Methods and Applications 11/1: 71-82.
Eder, M., Kestemont, M. and Rybicki, J. (2013). Stylometry with R: a suite of tools. Digital Humanities 2013: Conference Abstracts. Lincoln: University of Nebraska-Lincoln, pp. 487-89.
Hoover, D. (2003). Another perspective on vocabulary richness. Computers and the Humanities, 37: 151-78.
Jackson, J.E. (2003).A User’s Guide to Principal Components. Hoboken: Wiley.
Jockers, M. (2013). Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press.
Lancichinetti, A. and Fortunato, S. (2012). Consensus clustering in complex networks. Scientific Reports, 2: 336: 1-7.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions in Information Theory, 37(1): 145-51.
Mikolov, T., Yih W. and Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT 2013, Atlanta, Georgia, 746-751.
Rosso, O., Craig, H., and Moscato, P. (2009). Shakespeare and other English Renaissance authors as characterized by Information Theory complexity quantifiers. Physica A, 388: 916-26.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)