From specifications to tagsets and coding guidelines: EAGLES morphosyntax annotations in lexicons and texts

  1. 1. Ulrich Heid

    IMS-CL - Universität Stuttgart

EAGLES has first established a core set of commonly agreed annotations for morphosyntax, basically by collecting, comparing and filtering existing annotation proposals from lexicons and tagsets. Once the synthesis is available, how can it be put to use in both lexicon building and text annotation work? This question is addressed in our contribution.

We want the EAGLES morphosyntax annotation to be applicable in different usage contexts, especially in both, lexicons for NLP and text corpora. Moreover, other than in many tagsets for corpus annotation, the classifications used must be strict: the classes form a hierarchy, and any item to be described has to fall in one branch of the hierarchy. Along with this structure, there is need, however, for support for manual tagging: how can we make sure that different people will classify the same facts in the same way?

In EAGLES, we have tried to come close to solutions for some of the requirements stated above. We have defined a typed class hierarchy, to specify the classifications underlying our language-specific morphosyntactic coding systems. These hierarchies can be mapped onto lexicon codes as well as onto corpus tagsets, the latter is even automatic. For a subset of languages, we have written guidelines for manual annotation, which contain discussions of borderline cases, tests and a large collection of examples, to support manual coders. We have applied the tagset and guidelines to the manual coding of 60.000 word reference corpora for German and Italian.

The contribution will summarize the experiences gained in this exercise, and we will point to the ressources and tools produced therein.

