A Norwegian tagger and a corpus investigation

  1. 1. Janne Bondi Johannessen

    University of Oslo

The University of Oslo is planning to develop a
tagger for Norwegian (both writing systems) as
part of an overall national project which aims to
increase the electronic resources for linguistic and
literary research in Norwegian. Two other universities – those of Trondheim and Bergen – take part
in the project, their aims are building one or more
national corpora and a lexicon.
The tagger will be of the same kind as ENGCG
(English Constraint Grammar) developed at the
University of Helsinki, i.e., a disambiguating tagger of constraint grammar type (see for example
Karlsson et al 1995). This kind of tagger not only
labels each running word in a text with an appropriate tag (such as one for part of speech and
morpho-syntactic information more generally). It
also sees to it that homonymous words are disambiguated, so that a word like “kind” gets either the
tag ADJECTIVE or the tag NOUN, but not both.
In order to do this, the tagger has to look at the
context of the word in the actual text.
Like the Finnish ENGCG tagger, the Norwegian
tagger aims to include more than just strict morpho-syntactic categories. In particular, we want to
include syntactic functions where this is possible.
Not only will this help disambiguate certain
words, but it will also make it possible for linguists
to search for these categories in texts. For a linguist
who, for example, is interested in the thematic
roles which can be realised as syntactic subject, it
will be of great help to search for the label SUBJECT rather than the much more general NOUN
PHRASE (which includes not only subjects, but
also direct and indirect objects, predicate nominals, prepositional complements etc.).
An interesting aspect of developing a practical tool
like a tagger is that the linguist whose task it is to
make the rules and principles from which the
tagger is to make its decisions, has to think differently from what a theoretical linguist must do. In
particular, while any linguistic construction is interesting and important for a theoretical linguist,
the linguist involved in a tagging project might
have to ignore marginal constructions in favour of
more common ones, if any generalisations are to
be made at all. I shall claim that corpus studies can
be of great importance prior to the actual development of the constraint grammar, in order to assess
what kind of constructions are common and which
are not, in critical cases.
In particular, I will look into the tagging of subjects and objects. Like English, Norwegian has no
case morphology in lexical noun phrases. Also
like English, Norwegian does have case distinctions in pronouns. The ENGCG tagger has
used this information indiscriminately, and simply
labelled each nominative pronoun, such as “I” or
“she”, as SUBJECT (Karlsson 1990, Voutilainen,
Heikkila and Anttila 1992, Anttila 1995), with
apperantly remarkably good results. This is despite the fact that there are some kinds of constructions in which pronouns have a morphological
case different from what one should expect. For
example, when pronouns take part in coordinated
structures, they often have deviant case (see Johannessen 1993). Consider “This is between him
and I” or “Me and him went to the cinema”, which
both occur regularly, albeit slightly substandard,
in English.
Norwegian is different from English, however, in
that nominative pronouns can be used in a variety
of different constructions, several of which cannot
function as syntactic subject. One could, then,
discard the attempt to mark anything as subject.
However, it is at this point that the applied and
theoretical linguist think differently. The applied
linguist wants the tagger to work for most texts,
even at the cost of making mistakes in marginal
In Norwegian, nominative pronouns can be used
in at least the following constructions:
a. Hun fru Andersen er jammen rar
(Article, part of subject)
she Mrs Andersen is really strange
b. Jeg hørte hun fru Andersen i dag
(Article, part of object)
I heard she Mrs Andersen today
c. Hun som bor her nå er jammen rar
(Head with rel.clause, part of subject.)
she who lives here now is really strange
d. Jeg liker hun som bor her nå
(Head with rel.clause, part of object)
I like she who lives here now
e. Hun som jeg så i går er jammen rar
(Head with rel.clause, part of subject)
she who I saw yesterday is really strange
f. Jeg liker hun jeg så i går
(Head with rel. clause, part of object)
I like she I saw yesterday
g. Hun i første etasje er jammen rar
(Head with PP, subject)
she on the.first floor is really strange
h. Jeg liker hun i første etasje
(Head with PP, object)
I like she on the.first floor
i. Det er hun
(Copular complement)
it is she
It is self-evident that marking every nominative
pronoun as a subject may give very wrong results.
However, if we investigate the corpora we have at
hand, we will find that several of the constructions
are not common, and that the extent to which they
occur varies from one pronoun to another. These
results will be shown in more detail in the talk.
The results are very useful as far as the tagger is
concerned. The non-subject constructions in (1)
can be ignored unless there is corpus-evidence for
their existence in texts. I will show that constraint
rules can be formulated which safely categorise
nominative pronouns as subjects when there is no
corpus-evidence to the contrary. I will furthermore
show that the contexts in which nominative pronouns are actually found in non-subject constructions, can often be clearly defined. This means that
we will be able to write constraint rules which look
at the context of a given pronoun, and assigns it
the label SUBJECT unless it occurs in some specified non-subject context.
The fact that no lexical noun phrases and only
some pronominal phrases can be marked as subjects may seem very disappointing. However, although there are very few pronouns, pronouns are
of course numerous when they do occur in texts,
and we will hopefully be able to mark many subjects this way. Further, the same kind of investigation that I have presented for subjects has also
been carried out for accusative pronouns as various kinds of objects. The results here are even more
promising. Altogether, then, we can be optimistic.
In actual sentences, the likelyhood that at least one
of the verb’s arguments is a pronoun is rather
good. Given a clause with a transitive verb, if it
contains a pronoun which is marked as an object,
we can safely conclude that the other nominal
phrase is a subject, (2a), and if a pronoun is marked
as a subject, we can conclude that the other phrase
is an object, (2b).
a. Den snille læreren så henne straks
the kind teacher saw her immediately
b. Hun så den snille læreren straks
she saw the kind teacher immediately
The results of this corpus-investigation are interesting both from a theoretical and an applied point
of view. The latter is obvious, of course; it makes
it possible to identify subjects, which in turn will
help us identify objects and other syntactic
functions for the tagger. Theoretically, the findings are interesting, since they show that some
constructions which sound perfectly normal and
not at all like “linguisteese” are almost non-existing in actual written texts. This shows more
differences between spoken and written language
than has previously been known.
Anttila, A. 1995. How to recognise subjects in
English. In Karlsson et al.
Johannessen, J.B. 1993. Coordination. A minimalist approach. Doctoral dissertation, University of Oslo.
Karlsson, F. 1990. Constraint Grammar as a framework for parsing running text. In Karlgren,
H. (ed.) Papers presented to the 13th International Conference on Computational Linguistics, Vol. 3, Helsinki, p. 168-73.
Karlsson, F., A. Voutilainen, J. Heikkila og A.
Anttila (red.). 1995. Constraint Grammar.
Mouton de Gruyter, Berlin.
Voutilainen, A., J. Heikkila og A. Anttila. 1992.
Constraint grammar of English – A performance-oriented introduction. Department of
General Linguistics, University of Helsinki,
Publications No. 21.

