The lemmatization and grammatical categorization of the Latin and Vernacular works of Dante

  1. 1. Mirko Tavoni

    Università di Pisa

  2. 2. Elena Pierazzo

    King's College London, Università di Pisa, Université Grenoble Alpes

  3. 3. Letizia Leoncini

    Università di Pisa

  4. 4. Paolo Ferrargina

    Scuola Normale Superiore di Pisa

  5. 5. Ivan Boscaino

    Università di Pisa

  6. 6. Mirko Tavosanis

    Università di Pisa

The kind of language used by Dante Alighieri, both in Vernacular and in Latin, is of great relevance for the Italian culture. Dante's contemporaries already recognized the relevance of his literary work. His choice of language deeply influenced the whole Italian literary tradition and the Italian language itself. To know the usus scribendi of Dante as much as possible, then, deepens our knowledge not only of an outstanding historical and critical subject, but also of the evolutionary processes of the Italian language itself. Such are the motivations grounding the project that we want to present here.

The Latin and Vernacular works of Dante have been elaborated by tracing every form in texts to:

the relevant lemma, with additional indications;
the grammatical category to which they belong.
The elaboration was carried out through many years of work by many young scholars. The project has attained good results, but it cannot be deemed as closed. New pathways for research have been opened by the expressive capabilities of the XML language (recently used as encoding base) that replaced the system used in a first release of the corpus. All the texts are now facing a systematical editing and a deepening of encoding, which should be finished in the summer of 2004. At the same time the encoding of Dante's Inferno with lexical categorization is carried on as prototype of a new phase of work.

The language used by Dante is remarkable for its freedom and its openness to new words. This required an in-depth lexical study. Particular attention was paid to the construction of the lemmas in the case of real or believed neologisms. Every unattested term was then lemmatized only after an extensive research on the lexicons and in the repertories of Vernacular and Latin texts of the Middle Ages. This search was aimed firstly at identifying the neologisms probably created by Dante himself. At the same time it was attempted to find attestations for the other terms used in the Middle Ages. The unattested Vernacular and Latin words were traced to a lemma, with a particular encoding.

The grammatical analysis has followed similar but differenced criteria for the Vernacular and the Latin, according to the different nature of the two languages.

Vernacular Works
In the lemmatization of the Vernacular works were kept as a model the OVI (Opera del Vocabolario Italiano, Institute for the Italian Lexicon)1 guidelines for the lemmatization and categorization of ancient Italian.

However, our encoding is more thorough than the OVI encoding, since the latter prescribes only the identification of the grammatical category of the lemmas. Our encoding prescribes, instead, the identification of the grammatical category of the single forms. In other words, the verbal form andando is not simply traced to the verb andare, but it is also described as "present time gerund of an active intransitive verb of the first conjugation".

The grammar categories used for the grammatical encoding of the Vernacular works are: verb, substantive, adjective, pronoun, article, adverb, preposition, conjunction, interjection, proper noun, enclitic particle, quotation of foreign languages. Every one of these grammar categories was split in features that circumscribe the morphological aspect and, sometimes, the syntactic function of the single forms.

Moreover, the syntactic functions of the conjunctions and prepositions of the Commedia were systematically analyzed. All of the conjunctions of the text were categorized as introducers of co-ordinated or subordinated propositions, while prepositions were given a tag indicating if they introduce a proposition (and, if so, which kind of proposition) or a complement.

The XML-based encoding system overcomes of the old system's limits. In particular, with the old system it was mandatory to classify all of the enclitic particles in a single category, with no further classifications. The conversion of the corpus in XML and the following revision allowed a more exact classification of the enclitic particles, where pronouns are separated from adverbs.

Latin Works
The lemmatization of the Latin works acknowledged the fact that this language was a Middle Ages Latin. So, it had characters that distinguished it from the classical Latin. The lemmatization started from a grid of tags created for the classical Latin and reconfigured it following the phonetic, the morphology and the syntax of the Middle Ages Latin.

E.g., when the verb sum was used as auxiliary in the periphrastic forms of the historic times of the active and deponent verbs, the Middle Ages syntax allowed using it at the perfect or imperfect time (laudatus fui or laudatus eram), as in De Vulgari Eloquentia I 1 4 (ed. Mengaldo, 1968): "(scil. locutio) prima fuit humano generi usitata". In the classical Latin it was instead permitted to use the verb only at the present time. We have then created a new tag of "auxiliary at the perfect time" for this use of sum. With the old encoding system, in fact, there was no way to classify forms of the verb sum as auxiliary. Moreover, the grammatical grid of the lemmatization system, created for the classical Latin, did not allow marking the traits not permitted by this standard. Verb forms like "fuit... usitata" were then lemmatized with "usitata" as "passive perfect" of the verb usito, while the auxiliary "fuit" was lemmatized as third person singular of the perfect of sum.

We will give a last example drawn from the syntax: in the old system there was no tag to signal the adjectival function of the interrogative pronoun quis. It was then mandatory to lemmatize "quo" as pronoun even in phrases like "Nunc autem quo modo ea coartare debemus que tanto sunt digna vulgari, sollicite investigare conemur".

The lemmatization work is based on a wide gamut of authoritative grammatical and linguistic tools. Those instruments were the reference for the construction of the grammatical tags and of the lemmas. We will quote only a few of them as examples.

Reference grammars for the Vernacular:

P. Esperti, Grammatichetta della lingua italiana ad uso del calcolatore. In S. D'Arco Avalle, Al servizio del vocabolario della lingua italiana. Firenze,1979.
G. Rohlfs, Grammatica storica della lingua italiana e dei suoi dialetti. Torino, 1966.
L. Serianni, Grammatica italiana. Italiano comune e lingua letteraria. Torino, 1998.

Reference grammars for the Latin:

M. Geymonat - L. Fort, Dialogare con il passato. Corso di lingua latina. Bologna, 2002.
A. Traina - T. Bertotti, Sintassi normativa della lingua latina. Bologna, 1993.

Lexicons for the Vernacular:

S. Battaglia, Grande dizionario della lingua italiana. Torino, 1960-2002.
Opera del Vocabolario Italiano, Tesoro della lingua italiana delle origini (Tlio):

Lexicons for the classical Latin:

Thesaurus linguae Latinae, Leipzig, 1900 ss.;
A. Forcellini, Lexicon totius Latinitatis, Padova, 1940.

Lexicons for the Late Antiquity and Middle Ages Latin:

C. Du Fresne Du Cange, Glossiarium mediae et infimae Latinitatis. Graz, 1954;
A. Blaise, Lexicon Latinitatis Medii Aevi, praesertim ad res ecclesiasticas investigandas pertinens. Turnhout, 1986.

Main database for the Christian and Middle Ages Latin:

Cetedoc Library of Christian Latin Texts (CLCLT-3), Turnhout, 1996.

