Towards standards for lexicons and the linguistic annotation of texts.

  1. 1. Nicoletta Calzolari

    Istituto di Linguistica Computazionale (ILC) (Institute for Computational Linguistics) - Consiglio Nazionale delle Ricerche (CNR)

  2. 2. Antonio Zampolli

    Laboratoria di Linguistica Computazionale

  3. 3. Ulrich Heid

    IMS-CL - Universität Stuttgart

Keywords: linguistic annotation of texts, standardization, guidelines

As more and more machine readable text material becomes available, the importance of linguistic annotation of this material is in steady increase. This is true not only in the field of Natural Language Processing (NLP) and Language Engineering, but as well in Humanities Computing: for example, it is evident that linguistically informed free text search and text retrieval (especially if these are written in morphologically richer languages) is more precise (less noise) than search in texts not linguistically pre-analyzed. Linguistic annotation includes

the identification and tagging of word, sentence and paragraph boundaries;
the identification and tagging of the category (POS, word class) of word forms in running text;
the identification and tagging of morphological features (tense, number, person, etc.);
the identification and tagging of syntactic properties of predicates (syntactic subcategorization);
and many more.
Many corpus Linguistic Engineering companies and projects have designed their own proprietary annotation schemes; broadly available common schemes would have a number of advantages (easy availability, documentation, exchangeability, etc.). The workshop will discuss the need for standards for the above levels of linguistic description.

For the types of annotation listed above, the EAGLES project has attempted to prepare annotation schemes and operational tagging guidelines, to encode these as formal (or formally representable) specifications, and to validate them in a number of application experiments. EAGLES (Expert Advisory Groups on Linguistic Engineering Standards) is an expert group with contributors from both industry and academia from all over the EU aiming at the design of consensual standards for key areas of Linguistic Engineering.

Workshop objectives
The workshop aims at presenting and discussing recent and ongoing work towards standards for linguistic classification and annotation of word forms in texts and lexicons; the second main objective is to gather the feedback of the Humanities Computing scene with respect to the standardization work.

Specific objectives include the following:
- Identify and discuss the need for and the problems related with standards in the field of linguistic resources (in particular lexicons and corpora);

- Discuss questions of the interaction between lexicon and corpus: if there is a common underlying classification of linguistic material, at the levels indicated above, interesting new possibilities for `compound' resources are opened up: dynamic links from the lexicon to the corpus, corpus-based lexicon validation, new possibilities for linguistic acquisition, etc.

- Describe the EAGLES approach to the definition of standards proposals, the representations used, and the mechanisms available for validation, consistency checking etc.

- Describe the existing proposals for syntactic (and possibly semantic) annotation in texts and lexicons, based on efforts in EAGLES and in the COMLEX project at NYU;

- Discuss the EAGLES proposals from the point of view of `users': if a lexicon design project or a corpus analysis project is set up, does the use of annotation standards contribute to the efficiency of the project?

Confirmed workshop participants and their topics
The following participants have agreed to contribute:
Antonio ZAMPOLLI (Pisa): Linguistic Engineering Standards -- the domain of linguistic resources

Nicoletta CALZOLARI (Pisa): Standards for lexicons and corpora -- Areas, interaction between lexicon and corpus, current state of EAGLES

Ulrich HEID (Stuttgart): From specifications to tagsets and coding guidelines: EAGLES morphosyntax annotations in lexicons and texts

Antonio SANFILIPPO (Oxford): Standardizing word knowledge for NLP lexicons

Ralph GRISHMAN/Catherine McLEOD (New York): The Comlex Syntax Lexicon and the Eagles Subcategorization Standard

