An on-line Laboratory for Linguistic Research - Complete works of Dante lemmatized

multipaper session
  1. 1. Mirko Tavoni

    Università di Pisa

  2. 2. Elena Pierazzo

    King's College London, Università di Pisa, Université Grenoble Alpes

  3. 3. Letizia Leoncini

    Università di Pisa

  4. 4. Paolo Ferrargina

    Scuola Normale Superiore di Pisa

  5. 5. Ivan Boscaino

    Università di Pisa

  6. 6. Mirko Tavosanis

    Università di Pisa

Child sessions
  1. The lemmatization and grammatical categorization of the Latin and Vernacular works of Dante, Mirko Tavoni, Elena Pierazzo, Letizia Leoncini, Paolo Ferrargina, Ivan Boscaino, Mirko Tavosanis
  2. The Lemmatized Dante's works encoding, Mirko Tavoni, Elena Pierazzo, Letizia Leoncini, Paolo Ferrargina, Ivan Boscaino, Mirko Tavosanis
  3. The Search Engine and the User Interface, Mirko Tavoni, Elena Pierazzo, Letizia Leoncini, Paolo Ferrargina, Ivan Boscaino, Mirko Tavosanis
Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Humanities Computing research group co-ordinated by Professor Mirko Tavoni at Pisa University has decided to post on the web the results and the research tools it used in its research projects.

For that reason, a web site has been created ( The web site collects the results and the research tools of many different research projects. The main project was the production of the lemmatized and grammatically marked up corpus of Dante Alighieri's complete Vernacular and Latin works, which will be further discussed in details. Other important projects are:

Correspondence encoding: starting from the pilot scheme of the Giacomo Puccini's correspondence corpus (realized in co-operation with the Centro Studi Giacomo Puccini), the project has been enlarged in order to include the correspondences of Vittorio Alfieri and Ugo Foscolo.
Digital editions of librettos: at present just the first act of Giacomo Puccini's Tosca is available. This project also has been realized in co-operation with the Centro Studi Giacomo Puccini.
Texts from Pisa and Ferrara-: two distinguished text collections about history, culture, art history and literature of Pisa and Ferrara.
Pinocchio Game: the experience of a group of PhD students of Pisa University that reconsider and fit the principles of James McGann's Ivanhoe Game.
All texts are available both for reading (many as hypertext) and for linguistic querying and are XML-TEI encoded. The user interface for managing and querying the texts is optimized for the same encoding language. The queries are performed by the XCDE Search Engine, a tool developed at Pisa University by Professor Paolo Ferragina.

Most of the texts available on the web site are the results of a semi-automatic transformation from the DBT encoding language. In particular, both the lemmatized Dante's works and the Ferrara and Pisa collections were created as a part of the CiBit project (Centro Interuniversitario Biblioteca Italiana Telematica, Interuniversitary Center for the Italian Telematic Library), and were later fully converted in XML-TEI encoding system. Both the texts and the tools (search engine and interface) that build up the web site are open source and are freely available for scientific and non-commercial purposes.

The site is offered also as a public resource and a laboratory for the linguistic research. Scholars interested on linguistic research can send their XML-TEI encoded texts to be processed by the research group's search engine. Scholars are also free to restrict access to their text to specific groups of users if they so wish.

From the web site it is possible to access a tools collection for the NLP (Natural Language Processing) of the Italian Language. These tools have been developed by ILC-CNR (Istituto di Linguistica Computazionale, Pisa) in collaboration with the Dept. of Linguistics (Computational Linguistics Section) of the Pisa University. The tools allow users to perform various levels of text processing, such as tokenization, lemmatization and morphological analysis, shallow parsing (chunking), dependency parsing, etc.

The session will focus on the lemmatized Dante's works.

The first paper will present the project from a linguistic point of view and will explain the scientific criteria of the linguistic analysis. The second paper will give an overview of the lemmatized texts encoding history; the encoding model will be also illustrated. The third paper will describe the functioning and the usage of the search engine and of the user interface.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None