Istituto di Elaborazione della Informazione - IEI - CNR
Istituto di Linguistica Computazionale (ILC) (Institute for Computational Linguistics) - Consiglio Nazionale delle Ricerche (CNR)
Istituto di Linguistica Computazionale (ILC) (Institute for Computational Linguistics) - Consiglio Nazionale delle Ricerche (CNR)
Istituto di Linguistica Computazionale (ILC) (Institute for Computational Linguistics) - Consiglio Nazionale delle Ricerche (CNR)
1. Introduction
The aim of EuroWordNet, an EC-funded project in the Language Engineering programme (LE4003), is to construct a multilingual semantic database in which a number of monolingual wordnets for different European languages - in the first phase Dutch, (British)English, Spanish and Italian 1 - are linked through an Inter-Lingual-Index which is essentially a modified version of Princeton WordNet 1.5 [1]. For full details on the design, implementation and future development of EuroWordNet, see [2] [3]. In the communication, we will describe the approach being adopted to map between Italian and English lexical data by creating links between equivalent items in the two languages. We will discuss the methodology being employed, some of the problems that it has been necessary to address, and the way in which this cross-language mapping has affected the construction of the Italian monolingual database.
2. The Italian WordNet
The Italian WordNet is being constructed employing existing tools, methodologies and lexical resources available at the "Istituto di Linguistica Computazionale" (ILC), Pisa, and developed in other projects; an underlying principle has been to respect and preserve language-specific features. Moreover, it was decided to extract our lexical entries from a number of different resources to provide an objective as possible perspective on the data. This integration of various archives has highlighted the differences and inconsistencies found in dictionary data; e.g. word senses, synonyms and genus terms can vary widely from source to source. We had four main sources:
The Italian Lexical Database - the subset used for Italian WordNet contains about 30,000 entries (5,500 verbs and 24,500 nouns) totalling about 60,000 word-senses;
An Electronic Synonym Dictionary;
An Italian/English Bilingual Lexical Database;
The Italian Reference Corpus.
A first decision of the EuroWordNet project was to agree on a common vocabulary subset consisting of the most basic word-meanings for each language (the main criterion being "those most frequently used as genus terms in dictionary definitions") [4], [5]. The intention was to ensure that the most important lexical/semantic areas were represented, the highest taxonomic levels of the lexicon were covered, and a high degree of compatibility between the separate wordnets was guaranteed. The selection of this initial vocabulary was followed by a phase of cross-language comparison and manual intervention which made it possible to build a common set of "base concepts" for all the languages.
In the initial stage, a core set of word-meanings, consisting of about 300 nouns and 100 verbs, was derived and analysed to form the first set of base concepts for Italian. This preliminary subset was neither homogeneous nor consistent; it strongly reflected the defects and inconsistencies of the lexicographic metalanguage on which it was based and could only be considered as a starting point for the construction of a coherent semantic network. Many concepts were missing and had to be introduced by manual interventions on the data; further integrations were made on the basis of: a) other sources (the Italian Reference Corpus for example); b) base concept senses chosen by the other partners which had not emerged from our preliminary analysis.
In EuroWordNet the central lexical/semantic relation is synonymy, perhaps better denoted as semantic similarity. In fact, the project adopted a weak definition of synonymy which entails the interchangeability of two words in a particular context. After the selection of the base concepts, we thus had to create the "synsets" for them, i.e. a set of one or more similar word senses grouped together to represent a given concept or word meaning. These synsets are the core elements of our semantic database, We will illustrate the semi-automatic procedure for synset creation in the presentation. However, it is important to note here the utility of the data extracted from the bilingual dictionary: both translation equivalent and semantic indicator data were helpful when deciding whether an item should be included in the synset under construction, at times evidencing when a meaning shift occurs in a chain of automatically linked synonyms. Thus feedback from the bilingual data assisted us in tightening up our Italian synsets. The next stage was to map these word meanings to their equivalent entries in WordNet 1.5. Given its importance, this operation on the set of base concepts was carried out manually. As will be shown in the next section, the consistency and coherency of the taxonomies constructed downwards from the base concepts depends not only on the accuracy and precision of the base concept synset construction but on this first phase of the cross-language mapping operation.
3. The Cross-Language Mapping Procedures
In EuroWordNet, all the language specific wordnets will be stored in a central lexical database system. Equivalence relations between synsets in different language will be made explicit through the Inter-Lingual-Index (ILI). This is an unstructured version of WN1.5 in which original senses will be modified and new senses added when necessary. Each synset in the monolingual wordnets will have at least one equivalence relation with an ILI record which will enable cross-language mapping and comparison, e.g. Has_equivalent_synonym, Has_equivalent_near_synonym, Has_equivalent_hyperonym, Has_equivalent_hyponym relations. Superimposed on the ILI is a language independent Top Ontology and a set of domain labels. We had thus to study and implement procedures which would help us to map our base concepts to the ILI via WordNet 1.5.
It is not possible to establish equivalences between lexical items in different languages without some indication of the meaning of the item to be mapped and of the candidate target items. Such information can be provided by definitions (when working with dictionaries), or by surrounding context (when working with written or spoken discourse). In our case, we use the lexical/semantic taxonomies that we had constructed for the Italian WordNet data and map them against equivalent taxonomies in WordNet 1.5; it is the semantic context provided by the taxonomy that allows us to recognise the right sense in the target language of the word we are examining. Thus, although the ILI itself will be unstructured, we exploitthe structure of WN1.5 in order to make the right connections.
For nouns, our mapping procedure operates in the following way: starting with our set of base concepts, we first identified the main taxonomic groups, e.g. activity, substance, communication, place, food, instrument, condition, life form, etc. (cf. the semantic domains for nouns of WordNet [6]). For each of these main groups, the relevant base concepts had already been structured hierarchically. For instance, for the life form taxonomy, the first level in the hierarchy is represented by Italian synsets for person, animal, plant; at the next level, under plant, we find synsets for tree, bush, grass, fungus, flower, vegetable, and so on.
As stated in the previous section, these base concepts had been mapped manually to our ILI through WN1.5 and thus we have already a set of accurate links between the Italian WordNet and WN1.5 which can be used by our procedure for automatic mapping. Thus, working top-down, taxonomy by taxonomy, we take all the first level hyponyms for each Italian base concept and input them to our bilingual lexical database system, e.g. under the Italian synset "sentimento" mapped to WN1.5 "feeling 1", we have Italian synset equivalents for entries for desire, emotion, happiness, mood, and so on. For each word, all possible translations are read; we then search in the equivalent semantic hierarchy in WN1.5 - using the base concept links - in order to find a matching form; the assumption is that matching word-forms in equivalent semantic hierarchies will refer to equivalent senses. For example, in the taxonomy for Insects, the bilingual LDB assigns three possible translations to the Italian form "mosca": fly, goatee (beard), Moscow. In the WN1.5 taxonomy for {animal, animate being, beast, brute, creature, fauna}under {insect}, only one of these translation candidates was found (fly of course) and a link was thus created. Clearly, however, this is an ideal case; mapping is not always so straightforward. A number of problems have to be addressed. Here below, we just mention some of the most common:
There is no entry in the bilingual LDB for the word searched. When the Italian entry is not listed in the bilingual dictionary, the procedure attempts to find an equivalent entry directly in the relevant WN1.5 taxonomy, as in cases such as "zorilla" and "dugongo" linked, respectively, to "zoril" and "dugong" in the "mammal" taxonomy. If no equivalent is found, then a Has_equivalent_hyperonym relation is created automatically. For instance, the well-known Sicilian ice cream dish "cassata" is linked to its cross-language equivalent hyperonym "ice cream" in WN1.5 as no English translation could be found. Cases of this type are quite frequent, mainly for two reasons: many culture-specific words have no direct equivalent; our Italian WordNet data is richer in terminology than the bilingual lexical database.
There is not a direct equivalence between the Italian and the WN1.5 taxonomies being compared. For this reason, the WN1.5 taxonomy is scanned top-down in two stages; the first pass starts from the equivalent link and goes downwards; if no link is found, in the second pass, the procedure scans from the top of the entire taxonomy. For instance, in the case of the taxonomy found below the Italian synset for "abitante, nativo" linked to WN1.5 "inhabitant" we have a list of nationalities, e.g. American, Bulgarian, Chinese, etc. However, if we search for these entries in WN1.5 under "inhabitant", we only find American; Bulgarian is found under "European", and Chinese is found under "Asian, Asiatic", although all three entries are glossed similarly as "native or inhabitant of ...". Thus the taxonomies do not match completely and a complete mapping is only possible in the second pass when we scan from the WN1.5 entry for {person, individual, someone, mortal, human} downward. On the contrary, at times it is the position of an entry in the hierarchy that determines the right mapping. For example, the two Italian entries: "tragediografo" (author of tragedies) and "tragico" (player of tragic roles) are both translated by "tragedian" and both senses of "tragedian" are found in the WN1.5 "person" taxonomy. The right link is made between "tragediografo" and "tragedian 1" precisely because in the first pass we searched downwards from the immediate hyperonym: "writer, author".
The equivalent WN1.5 entry is in a different taxonomy. If the procedure finds no entry for the translation candidate in the equivalent taxonomy, the whole of WN1.5 is searched; matching entries are output so that they can be evaluated in a later phase. The results may determine a restructuring of our Italian taxonomy. For example "tarlo" (woodworm) is in the Insect taxonomy in the Italian database as it was classified as such by our sources, but is found as an Invertebrate in WN1.5; we must decide if it is correct for Italian to keep "tarlo" as an insect or not.
There is no entry in WN1.5 for the translation candidate(s).For example, our taxonomy for Birds includes "fenice" translated by "phoenix" but there is no such entry in any taxonomy in WN1.5. This means that we will have to include this concept in our ILI. This example also evidences the kind of errors that emerge from the automatic creation of taxonomic chains from source dictionary data. The definition of "fenice" in our source began "uccello favoloso di Arabia ..."; the procedure that extracted the genus had recognised "uccello" (bird) as the semantic head of the definition. However, in this case the adjective "favoloso" (fabulous) indicates that we should reconsider its position in the Italian Bird taxonomy as it negates the condition imposed at the top of this semantic hierarchy {essere vivente} (living being), probably relocating it under the Italian equivalent of WN1.5 {mythical being}
Differences in lexicalisation. Different languages do not lexicalise concepts in the same way. Thus, for example, no WN1.5 equivalent can be found for words which cannot be translated by a single lexical item but only by a phrase (e.g. a "guappo" is an arrogant man). In this case, we create an "Has_equivalent_hyperonym" link between "guappo" and WN1.5 "man", thus the information that this kind of man is "arrogant" is lost. Similarly, Italian uses gender far more than English. Thus, for example, in our taxonomy for "donna" (woman), we have many entries such as "ladra" (woman thief) or "avvocatessa" (female lawyer). Our decision in these cases is to map them with an "Has_equivalent_near_synonym" relation to the WN1.5 gender-neutral equivalents: thief and lawyer; again loosing information. We would like to find a way to pass this information over languages, perhaps by encoding additional features in our entries. Another major difference is that Italian has far fewer multiwords or compounds than English; thus an English entry such as "toenail" will not be matched to an Italian equivalent as the concept is rendered by a phrase in Italian "unghia del piede", which is not a lexical entry in our database.
Over- and under-differentiation of senses. WN1.5 tends towards a very fine-grained sense differentiation. For example, consider the following fragment from the food taxonomy:
choice morsel, tidbit, titbit
=3D> dainty, delicacy, goody, kickshaw, treat -- (something considered
choice to eat)
=3D> nutriment, nourishment, sustenance, aliment, victuals -- (asource of nourishment)
=3D> food, nutrient -- (any substance that can be metabolized
by an organism to give energy and build tissue)
It is easy to imagine that very similar word senses in, for example, Italian and Dutch are mapped differently to choice morsel and dainty, respectively. In such cases, it will not possible to establish a direct relationship between the equivalent senses in the two languages. Situations of this kind will be resolved in ILI by a grouping of similar senses in order to permit corss-language matching in cases like this.
4. Results
The mapping procedure is run automatically and repeated recursively; when the first level hyponyms of an Italian taxonomy have been linked to the EWN Inter-Lingual-Index, the next level is operated on. Although the results must be checked and integrated manually, the benefits obtained by applying the mapping procedure are considerable. It would clearly not be feasible to attempt to link this quantity of data to the ILI manually: a procedure of this type is not only much faster but can treat the data more exhaustively and with more overall accuracy. Problematic cases are clearly evidenced. In addition, this procedure provides important data for the verification and improvement of the Italian WordNet through a comparison of our semantic hierarchies with corresponding taxonomies in WN1.5.
In fact, one major problem when building the Italian WordNet was the tendency of our taxonomies - derived from automatic analyses of dictionary definitions - to be too shallow. In many cases, it is necessary to restructure them, adding intermediate levels for large sets of hyponyms in which many very specific terms were directly linked to hyperonyms that were too high and too generic. A rapid comparison of our results with similar data in WN1.5 evidences obvious gaps in our lexical entries and also shows where some necessary structure can be added to our too-shallow hierarchies, e.g. following the example of WN1.5 but still respecting our data and on the basis of information provided in our definitions, we have subdivided our taxonomy for "mammals" into ruminants, rodents, aquatic mammals, etc. Similarly, in the "instruments" taxonomy we introduced multiwords, which do not appear as lexical entries in the Italian monolingual LDB, but are lexicalised expressions (in keeping with the decision to build EuroWordNet as a lexical rather than a conceptual net) such as "strumenti musicali" (musical instruments), "strumenti di misura" (measuring instruments), "strumenti di bordo" (navigation instruments), to create a new level in the taxonomy and, at the same time, to identify more homogeneous lexical subsets. Another important benefit of the mapping procedure is that it allows us to group similar Italian word-senses automatically as the mapping procedure moves downwards through our taxonomies. When two or more Italian entries have been linked to the same WN1.5 word sense, an Italian synset is created with no need for manual intervention.
The cross-language mapping procedure is thus proving effective not only in linking our Italian WordNet to the EWN Inter-Lingual-Index but also by providing valuable feedback, evidencing interesting lexical gaps and helping us to improve the coherency and consistency of our monolingual database. These points and related issues will be discussed in the presentation.
NOTES
1The initial project partners are: University of Amsterdam (coordinator), Fundacion Universidad Empresa (a cooperation of UNED Madrid, Politecnica= de Catalunya, Barcelona, and University of Barcelona), University of Sheffield, Istituto di Linguistica Computazionale, CNR, Pisa and Novell Linguistic Development (Antwerp). In a second stage, the EWN database will be extended with Czech, Estonian, French and German.
2Cross-language mapping for verbs is rather more complex as the word class is organised according to other kinds of relations, such as causative/inchoative, entailment, or cause, rather than hyperonym/hyponom chains.
References
1. Miller G.A, "Nouns in WordNet: a Lexical Inheritance System", in: International Journal of Lexicography, Vol 3, No.4 (1990),pp. 245-263.
2. EWN96-98: EuroWordNet documentation and papers, Http://www.let.uva.nl/~ewn/docs.htm.
3. Computers and the Humanities (forthcoming). Special Number on EuroWordNet.
4. Alonge A. (1996) "Definition of the links and subsets for verbs",EuroWordNet Project LE4003, Deliverable D006. Http://www.let.uva.nl/~ewn.
5. Climent S., Rodriguez H., Gonzalo J. (1996). "Definition of the links and subsets for nouns of the EuroWordNet project", EuroWordNet Project LE4003, Deliverable D005. Http://www.let.uva.nl/~ewn.
6. Miller G.A, Beckwidth R., Fellbaum C., Gross D., and Miller K.J. (1990) "Introduction to WordNet: An On-line Lexical Database", in: International Journal of Lexicography, Vol 3, No.4 (1990), pp. 235-244.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)
Debrecen, Hungary
July 5, 1998 - July 10, 1998
109 works by 129 authors indexed
Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/
References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm
Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)