The Lemmatized Dante's works encoding

Mirko Tavoni; Elena Pierazzo; Letizia Leoncini; Paolo Ferrargina; Ivan Boscaino; Mirko Tavosanis

Authorship

1. Mirko Tavoni

Università di Pisa
2. Elena Pierazzo

King's College London, Università di Pisa, Université Grenoble Alpes
3. Letizia Leoncini

Università di Pisa
4. Paolo Ferrargina

Scuola Normale Superiore di Pisa
5. Ivan Boscaino

Università di Pisa
6. Mirko Tavosanis

Università di Pisa

Parent session

An on-line Laboratory for Linguistic Research - Complete works of Dante lemmatized , Mirko Tavoni, Elena Pierazzo, Letizia Leoncini, Paolo Ferrargina, Ivan Boscaino, Mirko Tavosanis

Original URL

http://web.archive.org/web/20040903094216/http://www.hum.gu.se/allcach2004/AP/html/prop39.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The lemmatized works of Dante originated as a CiBit (Centro Interuniversitario Biblioteca Italiana Telematica, Interuniversitary Centre for the Italian Telematic Library) initiative. All of Dante's works, both the Vernacular and the Latin ones, were lemmatized and each single word was added of a label specifying its grammatical function.

The lemmatization has been done firstly using the DBT-Lemmat program that allowed a semi-automatic lemmatization and a categorization of all in-text forms; the lemmatized texts have been fully converted in XML-TEI format, together with the full CiBit textual patrimony. E.g., the lemmatized texts have been processed using a perl script that was able to covert:

&{Lcum"cs"&}Cum
to:

<LM lemma="cum" catg="cs">Cum</LM>
We decided not to use the TEI <w> element because the project lines of development suggested distinguishing the encoding of the lemma's lexical and grammatical characteristics (encoded with <LM>) from the form connotations (encoded with <OCC>; see further). The TEI compliance has been reached in any case thanks to a TEI DTD extension.

A particular problem is represented by the encoding of composed forms that could be assigned to more than one lemma. An example ,is the Latin form virumque that can be split out in the lemmas vir and que. Firstly, we supposed to nest an <LM> element inside another <LM> element, but this solution did not let us clearly distinguish the hierarchical order of the two lemmas, so we finally choose to encode in the following way:

<LM lemma="que" catg="9">virumque</LM>

</LM1>
This solution highlights in any circumstances the different weight of the two lemmas, because, even if it is true that virumque can be split out in two different lemmas, it is also true that the lemma vir is predominant. Moreover, querying the system to get the occurrences of the vir lemma, it will be easy to distinguish when vir is the dominant part of a composed form.

After the XML conversion, the grammar encoding was examined in depth, and most of the limitations determined by the previous system were corrected.

The grammar labels have been composed using some special libraries; the system sets that different codes can assume different meaning depending on the position taken inside the label itself. An example is the case of the Italian definite article gli; in order to encode articles, the library offers the following categories:

r Article

Type
d definite
i indeterminate
Gender
m masculine
f feminine
Number
s singular
p plural
Since gli is a definite, masculine, singular article, the resulting label is composed as follow: rdmp. In order to query the textual base, we choose to exploit one of the possibilities offered by XCDE search engine, i.e. the regular expression search. For example, in order to search the definite articles - not worrying about gender or number - in all the vernacular works of Dante, the catg attribute value string will be:

rd(m|f)(s|p)
where r = article d = definite, while the rest of the label will be passed as indifferent: m or f (masculine or feminine), s or p (singular or plural).

The primitive encoding system was able to inquire just one aspect of the text, i.e. it only allowed to analyze forms and lemmas, as there was a list of non-contextual words, taken out from the complex text's structure. Once the corpus was converted in XML, we got the possibility to introduce a specific mark-up for text structure divisions (books, chapters, paragraphs, stanzas, verses, heads...), for quotations and dialectal words (quite numerous mostly in the De Vulgari Eloquentia), re-collating the texts with the copy texts. The present encoding interprets the texts at different levels:

grammatical-lexical level, with the punctual analysis of every single form and lemma (<LM> and <OCC> elements);
logical-structural level, with the characterization of the constituting textual elements (<div*>, <head>, <p>, <lg>, <l> elements);
diatopical-interpretative level, with the geographical characterization of the forms (<foreign>, <distinct>).
Future developments
The flexibility of the XML encoding and the possibility to further extend the TEI DTD allow the design of further project developments. In fact, we have settled to characterize the lemmas and the forms with a mark-up able to describe:

etymology
frequency index
jargon
neologism
rhyme word
registry
diatopical characterization
As a pilot scheme, we are at present re-encoding the Inferno. For that purpose we created the element <OCC> and a range of new attributes for the element <OCC> itself and for <LM> too. The new element will allow distinguishing the characteristics of the lemma from the characteristics of the form; for example, if a lemma can be diatopically characterized as southern, the form can be northern for some particular phonetic features.

1The OVI is charged to create the Tlio (Tesoro della lingua italiana delle origini, Treasure of the early Italian language) as first section of the historical vocabulary of the Italian language up to the current times. The OVI has then collected and grammatically analyzed all the Italian texts written before 1375. Its database holds now 1,780 texts and more than 20 millions of words.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

The Lemmatized Dante's works encoding

1. Mirko Tavoni

2. Elena Pierazzo

3. Letizia Leoncini

4. Paolo Ferrargina

5. Ivan Boscaino

6. Mirko Tavosanis

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004