The Moderniſa Project: Orthographic Modernization of Spanish Golden Age Dramas with Language Models:

paper, specified "long paper"
Authorship
  1. 1. Javier de la Rosa

    National Library of Norway, Norway

  2. 2. Álvaro Cuéllar

    College of Arts & Sciences, Hispanic Studies, University of Kentucky, USA

  3. 3. Jörg Lehmann

    Romanistic Seminar, Eberhard Karls Universität Tübingen, Germany

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction

The application of computational analysis to Spanish literature, and to the Golden Age period (16th-17th centuries) in particular, has grown in interest in recent years (De la Rosa and Suárez, 2016; Cerezo Soler and Calvo Tello, 2019; Demattè, 2019; Fiore, 2020; García Reidy, 2019, Vega García-Luengos, 2021). For most of this research (e.g., stylometry, sentiment analysis), a modern and homogenized orthography is usually preferred (Cuéllar and Vega García-Luengos, 2017-2021
a-b
). In addition, there is a genuine interest in modernization among historians and literature editors, who would benefit greatly from automatic modernization. Unfortunately,
w
e failed to find such systems for Spanish.

Normalization alternatives exist as part of multilingual toolkits that deal with OCR post-correction (e.g., Reynaert, 2015).

Most digitization pipelines apply optical character recognition (OCR) to identify the characters of a text as printed, and traditional philologists transcribe texts as faithfully to the original as possible. While new approaches try to improve the existing OCR systems to produce modernized text directly (Cuéllar, 2021a-b), the vast amount of readily available digitized materials in digital libraries and archives cannot be easily re-processed. In this work, we demonstrate how techniques from natural language processing (NLP) can be employed to transform Spanish texts available with historical orthography (circa 1590–1680) into modern normalized Spanish (RAE 2021).

Methodology
The development of the transformer architecture (Vaswani et al., 2017) caused a paradigm shift in NLP. Transformer-based language models excel at many tasks from coherent narrative generation to question answering, and from any sort of classification task to translation (Brown et al., 2020; He et al., 2021, Liu et al., 2020; Xue et al., 2021a). Unfortunately, creating these models requires billions of words, thousands of hours of computation, and many tons of carbon emissions dropped into the atmosphere (Strubell et al., 2019). The bright side is that once a pre-trained language model (PLM) exists, it can be adjusted (fine-tuned) to a specific downstream task with limited data in a fraction of the time and the resources. In this work, we approach orthographic modernization as a translation task and fine-tune existing language models on a parallel corpus of Spanish Golden Age dramas. The majority of PLMs work with vocabularies that might split words into smaller sub-word units called tokens (Devlin et al., 2019). The more frequent a word appears in the pre-training corpus, the higher the probability of keeping the word intact. Since orthographic modernization is a character-based process, we tested both token-free and token-based PLMs. In particular, we fine-tuned the multilingual versions of text-to-text transformers T5 and ByT5 (Xue et al., 2021, 2022) for translation from 17th-century Spanish to modern Spanish and evaluated the results using the BLEU metric (Papineni et al., 2002). In order to avoid misinterpretations of the translation metric caused by the similarity between 17th-century Spanish and Modern Spanish (Post, 2018), we complemented the metric with the average character error rate (CER) and calculated both metrics for the corpus pairs as our baseline.

Corpus Construction

We built a parallel corpus of Spanish Golden Age theater texts with pairs of Golden Age orthography and current orthography. For the old orthography, we used the
Teatro Español del Siglo de Oro

(TESO) corpus (https://quod.lib.umich.edu/t/teso/), because they present the texts “
copied exactly as it is written, with all peculiarities captured –accents, abbreviations, etc.” (TESO Editorial Policy, online). For the current orthography, we used the
Corpus de Estilometría aplicada al Teatro del Siglo de Oro

(CETSO), a collection of modern editions of the same and many more texts. We chose 44 dramas by the Golden Age dramatists Juan Ruiz de Alarcón, Pedro Calderón de la Barca, Félix Lope de Vega Carpio, and Juan Pérez de Montalbán. All dramas were published in Madrid and Barcelona between 1614 and 1691 for the first time and were written in verses of similar metrical characteristics. Both corpora were aligned line by line to establish a ground truth for the translation between the different historical varieties of Spanish.

Results
After randomizing all 141,023 lines in the corpus, we split it into training (80%), validation (10%) and test (10%) sets stratifying by play. We then fine-tuned T5 and ByT5 base models on sequence lengths of 256 doing a grid search for 3 and 5 epochs, weight decay 0 and 0.01, learning rates of 0.001 and 0.0001, and with and without a “translate” prompt. Table 1 shows the results on the test set of the best model on the validation set for each model type.

BLEU
CER

Baseline
48.04
8.95%

T5
79.22
4.48%

ByT5
80.66
4.20%

Table 1. Scores for baseline and the best models on the test set.

While both models perform modernization reasonably well, ByT5 seems to be outperforming baseline and T5. We applied our best model to an unseen play (
Castelvines y Monteses

by Lope de Vega, 1647) and analyzed the errors produced. We discovered that the model is capable of solving some difficult corner cases in typographical marks (e.g., adding initial exclamation marks) and some other tricky words (
cómo

vs
como
,
qué

vs
que
) by leveraging contextual information. However, it struggles with proper nouns that normally would go uppercase (e.g., ‘
Castelvines’
, ‘Monteses

).

We also discovered some strange artifacts in our ground truth corpus regarding archaisms and homogeneity of spelling that might have impacted the learning of the models (e.g., ‘
efeto’
should appear as ‘
efecto’
effect, ‘
agora’
as ‘
ahora’
now).

Discussion
While the overall error rate of 4.20% can be regarded as satisfying, the results were only evaluated on the basis of dramas written in verse form in 17th-century Spanish. However, there is a broad range of orthographic variation (Mediavilla, 2007) and it may differ from one publishing house or region to another. Thus, the modernization of historical texts that were not produced in the same conditions as our corpus may lead to poorer results. Finally, we found slight differences in punctuation and spelling in our own corpus, even though the aim of these editions was to use modern normalized Spanish. While some of these undesired effects may be addressed by training at the stanza or greater hierarchical level to capture longer range contextual information, it might also imply significantly higher computing resources, training times, and manual revision.

Conclusion
In this work, we have built a parallel corpus of 44 Spanish Golden Age dramas with text in both 17th-century Spanish and Modern Spanish. We have fine-tuned language models on the task of orthographic modernization and show a significant improvement of token-free models over token-based models and baseline. We analyzed closely the errors produced and assessed possible causes and mitigation formulas. We are also releasing our best model hoping to foster research within the Spanish Golden Age period and to establish an alternative to the current cumbersome approach of transcribing Golden Age texts solely by hand.

Availability
A demo of our system can be found at

Bibliography

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., et al. (2020). Language Models are Few-Shot Learners.
Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., pp. 1877–901
(accessed 26 April 2022).

Cerezo Soler, J. and Calvo Tello, J. (2019). Autoría y estilo. Una atribución cervantina desde las humanidades digitales. El caso de La conquista de Jerusalén.
Anales Cervantinos,
51: 231–50 doi:
10.3989/anacervantinos.2019.011.

Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S. and Smith, N. A. (2021). All That’s `Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 7282–96 doi:
10.18653/v1/2021.acl-long.565.
(accessed 26 April 2022).

Cuéllar, Á.

(2021a)
. “Spanish Golden Age Theatre Prints (Spelling Modernization) 1.0”.
Transkribus
.

Cuéllar, Á.

(2021b)
. “Spanish Golden Age Theatre Manuscripts (Spelling Modernization) 1.0”.
Transkribus
.

Cuéllar, Á and Vega García-Luengos,
G.
(2017-2021a)
. C
ETSO. Corpus de Estilometría aplicada al Teatro del Siglo de Oro
, 2017-2021,

.

Cuéllar, Á and Vega García-Luengos,
G.
(2017-2021b)
.
ETSO. Estilometría aplicada al Teatro del Siglo de Oro
. 2017-2021, http://etso.es/.

De la
Rosa, J. and Suárez, J. L. (2016). The Life of Lazarillo de Tormes and of His Machine Learning Adversities: Non-traditional authorship attribution techniques in the context of the Lazarillo.
Lemir: Revista de Literatura Española Medieval y Del Renacimiento(20). Universitat de València: 373–438.

Demattè, C. (2019). Una nueva comedia en colaboración entre ¿Calderón?, Rojas Zorrilla y Montalbán: ‘Empezar a ser amigos’ a la luz del análisis estilométrico Universidad de Navarra.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–86 doi:
10.18653/v1/N19-1423.
(accessed 26 April 2022).

Fiore, A. Questioni di autorialità a proposito di tre commedie seicentesche: Pedro de Urdemalas tra Cervantes, Lope, Montalbán, Diamante e la scuola di Calderón | Artifara.
(accessed 26 April 2022).

García-Reidy, A. (2019). Deconstructing the Authorship of Siempre ayuda la verdad: A Play by Lope de Vega?.
Neophilologus,
103(4): 493–510 doi:
10.1007/s11061-019-09607-8.

He, P., Liu, X., Gao, J. and Chen, W. (2021). DeBERTa: Decoding-Enhanced BERT with Disentangled Attention.
(accessed 26 April 2022).

Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M. and Zettlemoyer, L. (2020). Multilingual Denoising Pre-training for Neural Machine Translation.
Transactions of the Association for Computational Linguistics,
8. Cambridge, MA: MIT Press: 726–42 doi:
10.1162/tacl_a_00343.

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation.
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, pp. 311–18 doi:
10.3115/1073083.1073135.
(accessed 26 April 2022).

Post, M. (2018). A Call for Clarity in Reporting BLEU Scores.
Proceedings of the Third Conference on Machine Translation: Research Papers. Brussels, Belgium: Association for Computational Linguistics, pp. 186–91 doi:
10.18653/v1/W18-6319.
(accessed 26 April 2022).

Reynaert, M., Gompel, M. van, Sloot, K. van der and Bosch, A. van den (2015). PICCL: Philosophical Integrator of Computational and Corpus Libraries: CLARIN Annual Conference 2015. (Ed.) De Smedt, K.
Proceedings of CLARIN Annual Conference 2015. Wrocław, Poland: CLARIN ERIC: 75–79.

Sebastián Mediavilla, F. (2007).
Puntuación, Humanismo e Imprenta En El Siglo de Oro. (Publicaciones Académicas 9). Vigo. Pontevedra [Spain]: Academia del Hispanismo.

Strubell, E., Ganesh, A. and McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3645–50 doi:
10.18653/v1/P19-1355.
(accessed 26 April 2022).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is All you Need.
Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc.
(accessed 26 April 2022).

Vega García-Luengos, G. (2021). Las comedias de Lope de Vega: confirmaciones de autoría y nuevas atribuciones desde la estilometría (I).
Talía. Revista de estudios teatrales,
3: 91–108 doi:
10.5209/tret.74625.

Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A. and Raffel, C. (2022). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.
Transactions of the Association for Computational Linguistics,
10(0): 291–306.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 483–98 doi:
10.18653/v1/2021.naacl-main.41.
(accessed 26 April 2022).

Teatro Español del Siglo de Oro (TESO) Editorial Policy
(accessed 26 April 2022).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO