Modelling Lexical Entries: a Database for the Analysis of the Language of Young People

  1. 1. Fabrizio Franceschini

    Italian Studies - Università di Pisa

  2. 2. Elena Pierazzo

    Italian Studies - Università di Pisa

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

At the Humanities Faculty of University of Pisa, we started a big project devoted to the study of the language and culture of the young people. The enquiries were initially held in the area of provenance of students
of the University of Pisa, including the districts of
Massa-Carrara, Lucca, Pistoia, Pisa, Livorno, Grosseto and the district of La Spezia in the Liguria region. We distributed a questionnaire among students of the last two years of secondary school (around 18 years old). The enquiries took place in towns that host secondary
schools, i.e. towns and big villages of the urbanized country.
The questionnaire includes:
1. a socio-linguistic section enquiring on the social condition, cultural preferences and style of life; usage of the dialect inside the family and usage of forms exclusive for young people;
2. a lexical section that counts 36 onomasiologic
questions (“How do you say for…”) referred to the different spheres of the life of young people
(family, school, external world, interpersonal relations,
judgements, etc.)
3. open fields for spontaneous linguistics insertions.
The enquiries involved 2.500 informants and produced over 70.000 forms. Because of this huge mass of data we have looked for a suitable storage solution that would have let us easily query the data. First of all, we wanted to query the forms themselves in a quick and simple way. Secondly, we wanted to group the results according to different parameters such as place of enquiry, sex of the informants, and socio-cultural divisions. Others goals were the possibility of measuring the dialectal or the mass-media language influence and focalizing any lines of development of the Italian language.
It was immediately clear that the results of the enquiries
could not be directly reversed into a digital format
because of a lack of structured information. Furthermore,
we felt the need of analyzing and classifying the
produced forms from a linguistic point of view, for example
tracing every form to the relevant lemma and recognizing the grammar category in order to enable sophisticated queries.
Firstly we tried to mark data in the XML language,
using the TEI Terminological Databases tag set, but this attempt showed some limitations from the very beginning.
As we know, XML works at its best with semi-structured data such as texts, on the contrary our lexical entries are strongly structured data, and are connected with both linguistic information and personal data supplied by
informants. Performing such links through XPath or
ID-IDREFS system proved to be quite a farraginous mechanisms. Furthermore the TEI Terminological
Database tag set offers just generic elements, providing
poor description of lexical entries; the TEI Consortium itself has considered unsatisfactory such tag set that will be strongly revised in the forthcoming P5 version
(see In
addition, reversing in XML the data pertinent to two points of enquiry, we obtained so large files that
most common XML applications showed remarkable difficulties in managing them. For all these reasons, we decided to reverse the data in a MySQL relational
database that would have let us to bypass such
limitations. A relational DB is certainly a more suitable and performing solution for strongly structured data as we conceived our entries. We called our database BaDaLì, acrostic for Banca Dati Linguagiovanile, but also an expression that means ‘look over there’ or ‘mind you’ or even, in an antiphrastic sense very common in young people language, ‘never mind’. Figure 1 shows a simplified diagram of the BaDaLì
BaDaLì can be ideally subdivided in four main
modules: the informants module (green area), the
questionnaires module (yellow area), the lemmas
module (blue area) and the forms module (orange area); table lingue (‘languages’) is a lookup shared by lemmas
and forms modules, while table inchieste (‘enquiries’) connects the informants module with the questionnaires module.
Forms module
The very central point of the database is the table forme (‘forms’) and contains the forms produced
by informants. Every form is traced to its grammar
category (categorie grammaticali table). In some cases the
form is also related to a specific dialect (dialetti table), or to a language (lingue table) in case of foreignism. In this way an eventual existing distance between form and lemma (that we call gradiente ‘gradient’) is measured
in terms of dialectal influence, foreign features or
innovative traits on a graphical level.
The forms module is connected to all other modules. Every form, in fact, is traced to its relevant lemma, is produced by an informant, under the stimulation of a question.
Lemmas module
Tracing forms to their relevant lemmas is a crucial point, especially for innovative forms not recorded by dictionaries. For this reason, we selected a reference dictionary and established a number of criterions to create the suitable lemma for the unattested occurrences. We decided to adopt the most complete dictionary of modern Italian, the Grande Dizionario Italiano dell’Uso (Gradit), edited by Tullio De Mauro. When a new lemma is inserted
in the database, specific codes are added to mark its
absence or a semantic innovation in regards of Gradit.
Relevant lemmas (lemma table) have been categorized too, following an updated version of the classification
of lexical components of the language of young people
(componenti lessico table) proposed by Cortelazzo 1994.
Informant module
Table parlanti (‘informants’) collects all the
information pertinent to the informants. A number
of questions pointed out the need of typifying data collected
by the enquiry. For example, the questionnaire asked the informer to declare his/her birthplace and residence. The result was a list of towns and villages that had little
relevance in the case of a large scale enquiry. As the question about birthplace was included to retrieve
information about the origin of the informant’s family in order to determine if an influence of a non local
dialect can be assumed, we decide to group answers in
macro-regional categories. The question on the
residence was introduced to retrieve information about the commuting, to enquire on eventual differences between the language of towns and small villages. As the secondary schools where the enquiries were held are
located in towns or in big villages, we decided to consider
only if the informant lives in the same town where the school is located or elsewhere. In such cases a relative loss of information is compensated for the opportunity of comparing the results of different points of enquiry.
Questionnaires module
Questionnaires module collects data about
questionnaires and questions. We took into account
the possibility of inserting lexical entries produced by
different questionnaires or by updating of our questionnaire.
Therefore the term enquired by the questions (voce
indagata table) is isolated. For example, in case of
the question “How do you say for money?”, the word “money” is the enquired term; the comparison of forms produced by different questionnaires enquiring on the same term is easier by isolating the enquired term form the body of the question.
Our database includes onomasiologic questions (‘How do you say for…’) and thematic questions (‘Which words do you know about…’), so a classification for different types of questions (tipi domanda table) was needed. BaDaLì public interface
The database is currently freely available on the Web at the address
[FIG. 2] BaDaLì home page
The provided interface allows a number of possible
1. starting from a lemma (Tipi lessicali), it is possible to retrieve all the forms traced to such a lemma; a further step allows grouping the results according to three parameters: place of the enquiry, sex, and kind of school.
2. starting from the enquired term (Voci Indagate), it is possible to retrieve all the forms produced under the stimulation of such a term. A further step allows
grouping the results according to the same three
parameters as in 1.
3. starting from a form (Forme), it is possible to retrieve forms grouped according to the three parameters as in 1.
Modelling is certainly a crucial point in designing new projects, since the design will determine from the very beginning which requests a tool will be able to satisfy.
In this frame we propose our experience, hoping to
stimulate a reflection on such a topic.
Cortelazzo, M. (1994). Il parlato giovanile. In: Storia della lingua italiana, vol. 2, Scritto e parlato, ed. by Luca Serianni /Pietro Trifone, Torino, p. 291-317.
De Mauro, T. (1999). Grande Dizionario Italiano dell’Uso. (Gradit). Torino.
Sperberg-Mcqueen, C.M. and Burnard, L. (2002). TEI P4 Guidelines for Electronic Text Encoding and Interchange: XML-compatible Edition. (P4).
Oxford, Providence, Charlottesville, & Bergen. Available also at

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website:

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

  • Keywords: None
  • Language: English
  • Topics: None