Renovating a worldclass tagset: from WOTAN to WOTAN-2

poster / demo / art installation
Authorship
  1. 1. Hans van Halteren

    Department of Language and Speech - University of Nijmegen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Renovating a worldclass tagset: from WOTAN to
WOTAN-2

Hans
van Halteren
Dept. of Language and Speech University of
Nijmegen
hvh@let.kun.nl

1999

University of Virginia

Charlottesville, VA

ACH/ALLC 1999

editor

encoder

Sara
A.
Schmidt

n 1994, a new wordclass tagset for Dutch was designed (WOTAN; Berghmans, 1994),
for use in the upgrade of a tagged corpus of more than a million words
(including the Eindhoven corpus; uit den Boogaart, 1975) and the subsequent
derivation of an automatic tagger. WOTAN was based on the most popular
descriptive grammar of Dutch (ANS; Geerts et al., 1984), from which the encoded
distinctions were selected using two criteria: a) importance to potential users,
as estimated from interviews and b) feasibility of (semi-)automatic derivation
from the existing tagging, given the lack of time for extensive manual changes.
WOTAN was judged to be a good compromise and has since been used in several
tagging projects and experiments in the Netherlands and Belgium.
Yet, WOTAN had its shortcomings, leading to the creation of a successor. WOTAN-2
adds some important distinctions originally left out because they needed manual
intervention, and aims for compatibility with the EAGLES guidelines, the
(extensively) revised version of the ANS (Haeseryn et al., 1997), the CELEX
database and the AMAZON syntactic parser. Another, more uncertain, influence is
the tagset to be used for the Spoken Dutch Corpus, which is presently under
construction.
The poster will present:
the differences between WOTAN and WOTAN-2
the influence of the (sometimes contradictory) compatibility issues on
the tagset
additions to (or deviations from) the EAGLES proposal necessitated by
decisions for WOTAN-2
the upgrade of the WOTAN-tagged Eindhoven corpus to a WOTAN-2
version

References

J.
Berghmans

WOTAN, een automatische grammatikale tagger voor het
Nederlands

Dept. of Language and Speech, University of
Nijmegen
1994

Uit
den Boogaart

Woordfrequenties in geschreven en gesproken
Nederlands

Utrecht
Oosthoek, Scheltema & Holkema
1975

G.
Geerts

W.
Haeseryn

J.
de Rooij

M.
van der Toorn

Algemene Nederlandse Spraakkunst (ANS)

Leuven
Wolters-Noordhoff, Groningen and Wolters
1984

W.
Haeseryn

K.
Romijn

G.
Geerts

J.
de Rooij

M.
van der Toorn

Algemene Nederlandse Spraakkunst (ANS)

Deurne
Martinus Nijhoff, Groningen and Wolters
Plantyn
1997

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1999

Hosted at University of Virginia

Charlottesville, Virginia, United States

June 9, 1999 - June 13, 1999

102 works by 157 authors indexed

Series: ACH/ICCH (19), ALLC/EADH (26), ACH/ALLC (11)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None