The Nomen Nescio Project - Scandinavian Named Entity Recognition

paper
Authorship
  1. 1. Janne Bondi Johannessen

    University of Oslo

  2. 2. Eckhard Bick

    University of Southern Denmark

  3. 3. Kristin Hagen

    University of Oslo

  4. 4. Dorte Haltrup

    Centre for Language Technology (CST)

  5. 5. Åsne Haaland

    University of Oslo

  6. 6. Andra Björk Jónsdottir

    University of Oslo

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Authors (note more than 6 - max number in the interface): Janne Bondi Johannessen, University of Oslo, jannebj@mail.hf.uio.no Eckhard Bick, University of Southern Denmark, lineb@hum.au.dk Kristin Hagen, University of Oslo, kristin.hagen@ilf.uio.no Dorte Haltrup, Centre for Language Technology, Denmark, dorte@cst.dk Åsne Haaland, University of Oslo, a.t.haaland@ilf.uio.no Andra Björk Jónsdottir, University of Oslo, a.b.jonsdottir@ilf.uio.no Dimitrios Kokkinakis, University of Gothenburg, dimitrios.kokkinakis@svenska.gu.se Paul Meurer, University of Bergen, paul.meurer@hit.uib.no Anders Nøklestad, University of Oslo, anders.noklestad@ilf.uio.no

1. Background

This paper will present results from The Nomen Nescio project (NN) - a joint project for Named Entity Recognition (NER) for three Scandinavian languages: Norwegian, Swedish and Danish. The project includes research groups at The Universities of Oslo and Bergen (Norway), Gothenburg University (Sweden), Centre for Language Technology, and University of Southern Denmark (Denmark). It has been running for three years starting in 2001, and is funded by the Nordic Council of Ministers (The Language Technology Program, administred by NorFA). Before the Nomen Nescio started, only a couple of small NER projects had appeared in Scandinavia, and only for Swedish (Kokkinakis 2001, Dalianis & Åström 2001). The NN initialized joined activities in this important supporting technology field and a variety of people (students, engineers, researchers and professors) have been actively involved from the initial stages of the project.

2. Methods

The aim of the Nomen Nescio project has been to develop NE recognisers for all the three Mainland Scandinavian languages – Norwegian, Swedish and Danish. Many methods for NER have been used for various languages over the years, and with the diverse background of the participants at the different NN sites, many of these methods have been used (rule-based methods, shallow parsing, statistical methods). While the variety of methods have some well-known advantages and disadvantages, what has proved to be really challenging has been the lack of resources for the different languages. For example, in order to train statistical taggers, an annotated training corpus has to be available in advance, in addition, it turned out that lexicons containing semantic information would be necessary, and again, these were hard to come by.

However, none of the sites have used one method only. They have all developed hybrid ones, using at gazetteers in addition to the main method, and often also other kinds of pattern matching and various means of pre- and post-processing – inspired by work by Mikheev et al (1999). The diverse methods that have been used have made it very difficult to calculate and compare the degree of success for the different methods, languages and sites. However, even in the cases where the same methods have been employed, the results are difficult to compare, given that the basic resources, such as lexicons, differ. Even so, evaluation results will be given at the presentation.

3. Name Categories

The well known Named Entity sets used in the limited target applications in the “Message Understanding Conference” exercises (MUC) recognized not more than seven types of named entities (Grishman and Sundheim, 1996), while in the “Information Retrieval and Extraction Exercise” project (IREX) (Sekine and Isahara, 2000) and in the Concerto project (Black et al., 2000), another kind of named entity, ‘artifact’, was added. While in the “Automatic Context Extraction” program (ACE, [EDT, 2000]), two new entities, ‘geo-political entity’ and ‘facility’, were added to pursue the generalization of the technology.

The main idea behind the NN project’s choice of categories was the assumption that our NER tools could be used for Internet search on the world wide web, and that it would be useful to separate names that are often ambiguous, such as song, book or film titles from actual geographical or person names (“Paris, Texas” – place or film? “Angie” – person or song? “Harry Potter” – person or book?). Moreover, in general language, as found in newspaper texts, many more kinds of “names” or “named entities” are likely to be encountered, and thus in the NN we have further elaborated on a finer taxonomy for names, with six main (person, organization, location, artifact/object, work/art and event) and a large number of subtypes.

4. Challenges for the Scandinavian languages

Some of the Scandinavian languages have certain spelling conventions and certain linguistic features which present serious challenges to the task of actually recognising something as a name in the first instance, and to that of separating one name from an other. One challenging spelling convention amounts to names of organisations. All public institutions in Norway and Sweden have at most one initial capital letter, no matter how many words the name consists of (Den norske kirke ‘The Norwegian Church, Svenska kyrkan ‘The Swedish Church’). This causes a problem in recognising such names at the beginning of sentences, but also to find which words belong to a name at all. The methods we have used to solve this problem are gazetteers, regular expressions and the document centred approach (Mikheev 2000). Moreover, the Scandinavian languages are Verb Second languages, which means that if some constituent other than the subject fills the first position in a sentence, then the subject is demoted to the position after the verb, next to the object or indirect object (Idag sendte Anne Larsen et brev ‘Today Anne sent Larsen a letter’). This leads to ambiguity with respect to finding the border between these two constituents. In order to solve this, the Norwegian NE CG recogniser has relied on syntactically tagged input.

5. Evaluation and Summary

At the moment, we are on the process of evaluating the different systems, see the relevant sections on the NN project in Holmboe (2002, 2003). In the presentation, we plan to give the full results from all the sites and demonstrate the performance of the systems. We will also present some aspects of the work regarding the identification, classification and annotation of named-entities in the three languages and a detailed account for the named entity classification scheme. The project is described at this site: http://g3.spraakdata.gu.se/nn/, at which an online demo for the different languages can also be found.

Acknowledgements

We are grateful to the Nordic Council of Ministers to have funded the project through their Language Technology Program. We also are grateful to Torgny Rasmark for having developed a large part of NN web-pages, and to Botolv Helleland for expert help on philological name research.

References

1. ACE [EDT]. 2000. Entity detection and tracking – phase 1. http://www.itl.nist.gov/iaui/894.01/tests/ace/phase1/doc/
2. Black W.J., McNaught J., Zarri G.P., Persidis A., Brasher A., Gilardoni L., Bertino E., Semeraro G. and Leo P. 2000. A Semi-automatic System for Conceptual Annotation, its Application to Resource Construction and Evaluation. In Proceedings of the Second Language Resources and Evaluation Conference (LREC). Athens, Greece.
3. Dalianis, H. and E. Åström. 2001. SweNam-A Swedish Named Entity recognizer. Its construction, training and evaluation, Technical report, TRITA-NA-P0113, IPLab-189, NADA, KTH.
4. Grishman, R. and B. Sundheim. 1996. Message understanding conference 6: A bried history. In Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen.
5. Holmboe, H. 2002. Nordisk Sprogteknologi 2001- Nordic Language Technology. Museum Tusculanums Forlag, University of Copenhagen.
6. Holmboe, H. 2003. Nordisk Sprogteknologi 2002- Nordic Language Technology. Museum Tusculanums Forlag, University of Copenhagen.
7. Kokkinakis D. 2001. Design, Implementation and Evaluation of Named-Entity Recognizer for Swedish . Research Report from the Department of Swedish, GU-ISS-01-2, Språkdata, University of Gothenburg.
8. Mikheev, A., M. Moens og C. Grover. 1999. Named Entity Recognition without gazetteers. I Proceedings of EACL 99, Ninth Conference of the European Chapter of the Association for Computational Linguistics, s. 1-8.
9. Mikheev, A 2000. Document Centered Approach to Text Normalization In SIGIR'2000. Athens. pp. 136-143.
10. Sekine, S. and H. Isahara. 2000. IREX: IR and IE evaluation project in Japanese. In Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, p. 1475-1480.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None