Challenging Stylometry: The Authorship of the Baroque Play La Segunda Celestina

paper, specified "long paper"
  1. 1. Laura Hernández Lorenzo

    University of Seville

  2. 2. Joanna Byszuk

    Institute of Polish Language - Polish Academy of Sciences

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In November 1675, Spanish writer Agustín de Salazar died leaving unfinished the play
La Segunda Celestina (SC), which he was writing on commission for the birthday celebrations of the Spanish Queen. The play was probably finished by an anonymous writer and performed the following year (Sabat de Rivers, 1992). In 1989, Guillermo Schmidhuber published a newly discovered ‘suelta’ of SC with the anonymous ending he claimed had been written by Sor Juana Inés de la Cruz (SJ), a prominent Hispanoamerican writer of the time, whom he also thought to have made significant changes to the original.

Literary hypothesis
In 1700, the editor of the works by SJ, Castorena y Úrsua, mentioned that she finished and improved a literary text by Salazar. When her entire works were published in 1957, one of the editors, Salceda, connected Castorena’s mention with SJ referencing a comedy about Celestina in her own play
Los empeños de una casa, and concluded that the said play could be SC. Later, Schmidhuber tried to prove it making historical, linguistic and even primitive stylometric arguments (Schmidhuber de la Mora, 1991). After a comprehensive research, Georgina Sabat de Rivers concluded that although there was not enough evidence to make it a fact, it was highly probable that SJ indeed wrote the ending, but not edited the whole text (Sabat de Rivers, 1992).

In the course of our study, we found that availability of digitized Spanish texts, especially historic ones, poses a great problem, due to few resources or repositories, and poor state of digitization – for Spanish works it mostly means scanning images of early editions (especially manuscripts or old prints), and their typographic variance makes OCR results not very useful.
As a result, our corpus was composed based on various sources. The text of SC was extracted from a digital edition (Schmidhuber de la Mora, 2016), and converted into plain text. Other dramatic works by SJ were extracted from the Cervantes Virtual Library. Salazar’s texts were much more difficult to find, as there are no digital editions. We had to use the image digitization of his texts offered by the ‘Biblioteca Digital Hispánica’. The OCR provided by the library and our software (ABBYY-FineReader 12) produced so many irregular errors that we decided to transcribe
El amor más desgraciado and
Más triunfa el amor rendido.

To place the problem in a broader perspective, we used the Canon-60 corpus (Oleza Simó, 2014), a collection of digitized Spanish Golden Age plays that includes canonical baroque works. However, this corpus is imbalanced – some authors are overrepresented, whereas for others – less famous or relevant for the literary history – there is only one play.
The final corpus combining the Canon-60, SJ’s and Salazar’s texts lacked balance in terms of genre (a mix of secular and religious plays, as well as comedies and tragedies), gender (only two women in our corpus: SJ and María de Zayas, showing that the relevance of other Spanish baroque female writers needs further studies), and, finally, nationality – all the authors are Spanish-born, except for SJ, born in ‘Nueva España’ (i.e. Mexico). We therefore limited our corpus to only one genre: ‘la comedia de capa y espada’.


Works from New Spain in the Spanish Baroque perspective
We approached the issue of verifying SJ’s authorship in a multi-step study, starting with a distant look on the literary surrounding of SC. With the primary network analysis conducted on the large corpus (Canon + SJ + Salazar) with Bootstrap Consensus algorithm for 100-1000 most frequent words, as implemented in the stylo package (Eder et al., 2016), we determined optimal settings granting stable results – deciding against using culling which completely distorted any authorial signal and relying on Cosine Delta distance measure. Importantly, network analysis confirmed some existing inspiration links between authors, e.g. SJ’s work bearing similarities to Calderón and Salazar, whom Paz cites as ones she mimicked in her youth; and unveiled new connections, e.g. to Antonio de Solís and Juan de Vera Tassis. What also supports claims of SJ extraordinary talent surpassing her times is the fact that while works of other authors cluster mostly together, her works span across the whole corpus.

Fig. 1 Network of all works in the corpus (each work divided into 2000 word samples). SJ is marked with black edges.

Strength of authorial signal and determining authorship
Preliminary authorship attribution and verification tests showed very unstable classification results. For cross-validation with SVM, Delta and NSC, and verification with the so-called “Imposters method” (Kestemont et al., 2016; Koppel and Winter, 2014) varying on settings a number of candidates were recognized as the author of anonymous part – from Calderón and Moreto to SJ, Solís and de Vera Tassis.
We decided to examine the strength of authorial signals in our corpus, which led us to excluding those who could not author the anonymous part for objective reasons such as the time of its creating (e.g. Lope) or being hub authors – strongly connected to every text in the corpus (Moreto). Inspired by Eder’s evaluation of authorial signal in short samples (2017) and thanks to his courtesy in making the script from the study available to us, we conducted a series of evaluation tests on our corpus until we were left with two authors beside Salazar: SJ and Solís. Of the three considered authors SJ had the most stable signal (see Fig. 2-4).

Fig. 2-4 Accuracy of recognition of particular authors by classification algorithm.
In the final part of our examination we once again performed cross-validated classification and verification on the small corpus consisting of one-genre works by the mentioned 3 authors against anonymous part of the text. As the anonymous part is only 4863 words, we used sequential sampling of 1000 words. In this case, SJ was attributed as the author in almost all settings (some of the results pointed to Solís influence in the last two thousand words), with the most reliable results produced by SVM and 100-500 MFWs scope (from 54.8% to 81.2% accuracy, with the average of 72.75%). Interestingly, parts of works by other authors were consistently misclassified as SJ, which might indicate either/both her domineering style or her taking inspiration from either of the authors, of whose works she must have been aware.

Editorial influence in the non-anonymous part and the ending
This problem of authorship requires detecting multiple authorial voices as we know for sure Salazar wrote significant part of the play before other person finished it, which is why we apply Rolling Classify (Eder, 2016) to detect authorial takeovers. This allows to discover both who authored the ending, and if this author made significant changes to the rest of the play. We marked two important points in the whole play: mark ‘b’ represents the place at which, according to Vera Tassis, Salazar left the play unfinished. As it can be observed in the figures 5, 6 and 7, the ending is attributed to SJ in SVM and NSC, and to Solís in Delta. The most surprising thing is that Salazar’s signal is not detected at all, which may be related to the weakness of his signal detected in previous analyses, and Sor Juana seems to dominate the rest of the play. Could it be that, if it was SJ who finished the play, she altered the rest of the text to the extent that we are not able to see Salazar anymore? This would confirm the claims of some of the scholars who defend her authorship (Schmidhuber de la Mora, 2016; Paz, 1990).
We also marked the beginning which portrays the first encounter between protagonists: doña Beatriz and don Juan, with the mark ‘a’. It is retold and alluded to several times in the play, something unusual in Golden Age theatre. The actual first encounter is a very feminist confrontation, later in texts belittled in don Juan’s retelling. It seems to betray a female writer which is why we examined it. As it can be observed in the figures, in all tests this fragment is attributed to SJ.

Figures 5-7. Rolling SVM, NSC and Delta on SC. 500 MFW and 5000 words per slice.


Our experience emphasizes the need for and usefulness of taking corpus evaluation steps in all analyses, and especially in the case of historic works, for which it is impossible to create a truly balanced corpus. Various authors seem important for the text and the situation is quite blurry. Second best candidate, well above chance level, Solís, is a new discovery, and his possible relation to the SC and SJ, especially in terms of influence or themes, may also be of interest to future studies. However, quantitative analysis and literary evidence history show that the influence of SJ is definitely the strongest, supporting the theory of her being the author of anonymous part and editor of the whole text.

Further Materials
See our corpus, evaluation of the OCR difficulties which led us to transcribe the texts, and list of plays in our GitHub:


Alatorre, A. (1990).
«La Segunda Celestina» de Agustín de Salazar y Torres: Ejercicio de crítica.
Vuelta, 46-52.

Bia, A.,
Pedreño, A. (2001). The Miguel de Cervantes Digital Library: the Hispanic Voice on the Web.
Digital Scholarship in the Humanities,
16(1), 161-177.

Eder, M. (2016). Rolling stylometry.
Digital Scholarship in the Humanities,
31(3): 457–469.

Eder, M. (2017). Short samples in authorship attribution : a new approach.
Digital Humanities 2017. Conference Abstracts. McGill University & Université de Montréal. August 8-11, 2017.

Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: A Package for Computational Text Analysis.
The R Journal,
8(1): 107–21.

Evert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C. and Vitt, T. (2017). Understanding and explaining Delta measures for authorship attribution.
Digital Scholarship in the Humanities,
32(June 2017): ii4–ii16 doi:10.1093/llc/fqx023.

Jannidis, F., Pielström, S., Schöch, C. and Vitt, T. (2015). Improving Burrows’ Delta – An empirical evaluation of text distance measures.
DH 2015 - Global Digital Humanities.

Kestemont, M., Stover, J., Koppel, M., Karsdorp, F. and Daelemans, W. (2016). Authenticating the writings of Julius Caesar.
Expert Systems with Applications,
63: 86–96.

Koppel, M. and Winter, Y. (2014). Determining if two Documents are written by the same Author.
Journal of the Association for Information Science and Technology,
65(1): 178–187 doi:10.1002/asi.

Oleza Simó, J. (2014).
Canon 60. Valencia: Universitat de València.

Paz, O. (1982).
Sor Juana Inés de La Cruz o Las Trampas de La Fe. México: Fondo de Cultura Económica.

Paz, O. (1990). ¿Azar o justicia?.

Sabat de Rivers, G. (1992). Los problemas de la Segunda Celestina.
Nueva Revista de Filología Hispánica,
XL(1): 493–512.

Schmidhuber de la Mora, G. (1991). ‘La Segunda Celestina’: Sor Juana y la estilometría.
Vuelta: 54–60.

Schmidhuber de la Mora, G. (2016).
Hallazgo de Una Comedia Perdida de Sor Juana: La Gran Comedia de La Segunda Celestina. Alicante: Biblioteca Virtual Miguel de Cervantes
hallazgo-de-una-obra-perdida-de-sor-juana-la-gran-comedia-de-la-segunda-celestina/ (accessed 11 November 2018).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO