PoetryLab. An Open Source Toolkit for the Analysis of Spanish Poetry Corpora

paper, specified "long paper"
  1. 1. Javier de la Rosa

    Universidad Nacional de Educación a Distancia (UNED) (National Distance Education University)

  2. 2. Álvaro Pérez

    Universidad Nacional de Educación a Distancia (UNED) (National Distance Education University)

  3. 3. Salvador Ros

    Universidad Nacional de Educación a Distancia (UNED) (National Distance Education University)

  4. 4. Elena González-Blanco

    School of Human Sciences and Technology - IE University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

IntroductionThe transmission of text in poetic form is a quasi universal aspect in the oral tradition of every culture. The study of the poetic features of text, especially their rhythmic structure when forming verses, pertains to the different traditions, whose scholars established the rules that might govern poetry. Within this context, the POSTDATA Project formalized a network of ontologies able to express any poetic expression and its analysis at the European level, enabling scholars all over Europe to interchange their data using Linked Open Data. However, varied research interests result in corpora that might not share the same facets of an analysis. To alleviate this concern and foster the completeness of the interchanged corpora, our team set out to build a software toolkit to assist in the analysis of poetry. This paper introduces PoetryLab, an extensible open source toolkit for syllabification, scansion (extraction of stress patterns), enjambment detection (syntactical units split in two lines), rhyme detection, and historical named entity recognition for Spanish poetry. Our toolkit achieves state of the art performance in the tasks for which reproducible alternatives exist.Design PrinciplesManuals for metrical analysis of Spanish poetry exist at least since the 18th Century, although the foundational work and subsequent refined guides for modern analysis would take another century to appear. Despite such a long and rich tradition, not many computational tools have been created to assist scholars in the annotation and analysis of Spanish poetry. With ever increasing corpora sizes and the popularization of distant reading techniques, the possibility of automating part of the analysis became very appealing. Although solutions exist, they are either incomplete (i.e., fixed-metre poetry, mostly hendecasyllables, not applicable to Spanish, or not open nor reproducible. These limitations guided the design of PoetryLab. At its core (see Figure 1), PoetryLab provides a compliant OpenAPI that connects independent packages together. Built on top of the natural language processing framework spaCy, two Python packages perform scansion and enjambment detection, namely, Rantanplan and JollyJumper.1 In Spanish, some words are stressed depending on their function in the sentence, hence the need for a proper part of speech (PoS) tagger. AnCora, the corpus spaCy is trained on for PoS tagging of Spanish texts, splits most affixes thus causing some failures in the tags it assigns. To circumvent this limitation and to ensure clitics were handled properly, we integrated Freeling’s affixes rules via a custom built pipeline for spaCy. The resulting package, spacy-affixes,2 splits words with affixes before assigning PoS, and can be plugged in to a regular spaCy pipeline loading one of the statistical models for Spanish. This pipeline is the foundation for Rantanplan and JollyJumper, which are rule-based algorithms inspired by Ríos Mestre, Caparrós and Navarro Tomás, and Quilis and Spang, respectively. Figure 1. General architecture of PoetryLab.Following the OpenAPI specification, we defined a REST API that unified the internal interface of the different packages and provided a common endpoint for analysis. For external packages developed in languages other than Python, PoetryLab provides a pluggable architecture that allows their integration. This is the case for our named entity recognition system, HisMeTag, developed in Java and connected to the PoetryLab API through an internal REST API. The only requirement for third-party integrations is to consume text and produce both JSON and RDF triples.The PoetryLab API was then used to provide with functionality a React-based web interface that non-technical scholars can use to interact with the packages in a graphical way (see Figure 2). The frontend also allows downloading the generated data. Figure 2. PoetryLab showing stressed syllables (blue), sinalefas (‿) and enjambments (↵).ResultsOne notably difficult aspect of benchmarking automated analysis of Spanish poetry is the lack of a gold standard reference corpus. For the evaluation of the syllabification algorithm in PoetryLab we build a 100k words corpus using a combination of online resources,4 which we named EDFU and are releasing under a Creative Commons license.5 For metrical analysis we used Navarro-Colorado’s corpus. For mixed-metre we are using our own copus obtained from Carjaval’s annotated anthology. Unfortunately, we have not found a public corpus for rhyme and stanza identification yet, and although an enjambment corpus seems to exist, it is not publicly available.Table 1 shows the ratio of success extracting the list of syllables of the words in EDFU, and the correct metrical analysis for the different corpora and tools. Notably, PoetryLab achieves state of the art performance for syllabification and per line metrical analysis.7 We were unable to reproduce Gervás’ approach and are reporting their own ratios.Syllabification (EDFU)PoetryLab (rantanplan): 99.98Navarro-Colorado: 98.74Agirrezabal: 98.06Metrical patterns (fixed-metre)PoetryLab (rantanplan): 96.22Navarro-Colorado:94.44Gervás:88.73Agirrezabal:90.84Metrical pattern (mixed-metre)PoetryLab (rantanplan): 65.02Navarro-Colorado:49.38. Although at an early stage, PoetryLab has proven useful in that it highlights some issues with the existing corpora and techniques developed to this day. First, there was no alternative system to analyze poetry composed of other than hendecasyllables, for which we are using a corpus of mixed-metre poetry based on Carvajal’s original annotations. Moreover, we are contributing with a new corpora to evaluate syllabification procedures, and enriching the ecosystem of Python tools for Spanish by providing a spaCy pipeline that deals with clitics. Finally, we make the data produced by the PoetryLab machine readable, interoperable, and ready to be ingested into a triple store compliant with the POSTDATA Project network of ontologies.Eventually, PoetryLab will be integrated into the larger POSTDATA Project public website, making working with European repositories of poetry a more pleasant task, and assisting whenever possible with the metrical and rhetorical side of the analysis.Founding SourceResearch for this paper has been achieved thanks to the Starting Grant research project Poetry Standardization and Linked Open Data: POSTDATA (ERC-2015-STG-679528) obtained by Elena González-Blanco. This project is funded by the European Research Council (https://erc.europa.eu) (ERC) under the research and innovation program Horizon2020 of the European Union.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO