Modeling the Lexicon with Ontologies

  1. 1. Kip Canfield

    University of Maryland, Baltimore

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The use of markup languages (typically XML) for
modeling the natural language lexicon is currently
widespread. It allows a standard representation format
that is friendly to the web, interoperable, and leverages
horizontal standards in that all general XML tools can
be used (Simons 2004). This paper makes an argument
for elaborating this practice to use ontologies for modeling
lexicons. These would be commonly serialized in
OWL/XML, but could use the standard XML transformation
capability to use other serializations without any
problem. The Lexical Markup Framework (LMF) is a
popular ISO standard candidate for lexicons and shows
the benefits described above. Since LMF is XML and
transformations can allow one schema to be changed
into another, using ontologies in this context has been
suggested such as using an LMF to OWL transformation
(Francopoulo 2007) to incorporate ontologies into a
service-oriented language infrastructure (Hayashi 2008).
This paper proposes using ontologies in a much more
local way to support development and testing of natural
language lexicons, but at the same time supporting the
goals of interoperability and standards use. The argument
for this novel use of the semantic web technology
is advanced using a scenario from the development of a
computational lexicon of the Navajo language of North
In the context of a long running project to create a computational
resource for the Navajo lexicon that can be
used for reading and annotating a corpus of Navajo texts,
an XML-based model of the Navajo lexicon was developed.
This lexicon has a native format of an XML schema,
but leverages transformations to customize the format
for any particular use. For example, the web based
applications that display texts from the corpus and allow
access to the lexicon, use a JSON transform since that
is most convenient for web applications. Similarly, the
lexicon can be transformed to OWL for creating an ontology.
The ontology can be used in much the same way
as LMF documents or databases have been used (Beck
2007), but with additional useful properties.
The class structure from the original lexicon model is a template structure with complex morphology that
can be seen as a slot/filler structure were only certain
kinds of affixes can be put into any slot (Young 1992;
Faltz 1998). So the Navajo verb class has an ordered
sequence of properties that correspond to the template
(affix order=outer, distributed plural, inner, object, subject,
classifier, stem). Some of the slot content can be
predicted from context and some must be defined in the
lexicon. For an OWL ontology, this would be the information
in the individual. For example, the individual for
the class verb “cut it out” would be:
(1) individual V2 is Verb and OuterPx=ha, Transitive=
1, Conjugation=S, Classifier=ł, Sid=2, Gloss=”cut
it out”;
These examples use the non-XML serialization from the
Bossam reasoner (Jang 2004) for ease of human reading.
Two other classes are defined for the stem set and the
subject prefixes. A stem set is a set of bound stems that
change shape depending on the verb mode and aspect. A
sample individual stem from the class stem set (StemSet)
(2) individual S3 is StemSet and Sid=2, Root=gizh,
Mode=I, Aspect=cont, Stem=géésh;
The stem in (2) is only for verbs with imperfective mode
and continuative aspect and there would be 5 stems in
the set. The subject prefixes vary widely depending on
the conjugation type and other parameters. A sample individual
from the class subject prefix (SubPxSet) is:
(3) individual Sj2 is SubPxSet and Mode=I,
Conjugation=S, Pnum=3, SubjPx=Ø;
This subject prefix is for the ‘Simple’ conjugation and
the third person. The main point being that many of the
parts of an underlying form for the verb (1) are predictable
and so do not appear in the lexical entry. In order to
supply these predictable parts, the semantic web extension
for ontologies - the Semantic Web Rule Language
(SWRL) is used. This allows rules to apply reasoning to
the individuals in the ontology. For example, since the
distributive plural is only for 3 or more, a simple rule
using unification can add that prefix of ‘da.’ (Pnum=8 is
the person number for the 8th member of a listing of the
conjugation. Navajo has both a 3rd and 4th person, so
the 9th would be the first member after the dual plurals.)
(4) rule Rule1 is if Verb(?v) and Pnum(?x) and [?x>8]
then DistPlPx(?v, da);
This also uses the Bossam rule serialization for ease of
reading, but it can also read the XML serialization of
SWRL. Similarly all the predictable parts of the underlying
verb form are generated. For example, the verb for
“cut it out” in the third person distributive plural is generated
by the rule base:
(5) OuterPx= nav:ha, DistPlPx= nav:da, ObjPx= nav:y,
InnerPx= nav:Ø, SubjPx= nav:Ø, Classifier= nav:ł,
Stem= nav:géésh
The output in (5) shows that each slot in the template has
been correctly filled using the information in the lexical
entry and using rules (nav: is the namespace). The underlying
form for that verb is ha-da-y-Ø-Ø-ł-géésh. This
can in turn be transformed to the surface form using a set
of morphophonemic rules that have been previously developed
for this project. It uses the finite state morphology
tool xfst (Beesley 2003) and produces hadeiłgéésh
which is the distributive third person of the surface form.
This is a command line tool and so the above can be
done with a single unix pipeline. This is an effective tool
for checking lexical entries. For example, a complete
paradigm can be cosnstructed using the lexical ontology
rules to produce the underlying form (left-side of (6))
and xfst to produce the surface form (right-side of (6)) :
(6) Complete imperfective paradigm for: individual P5
is Verb and OuterPx=Ø, Transitive=0, Conjugation=S,
Classifier=Ø, Sid=5, Gloss=”cry”;
1sing: Ø-Ø-Ø-Ø-sh-Ø-cha -> yishcha
2sing: Ø-Ø-Ø-Ø-ni-Ø-cha -> nicha
3sing: Ø-Ø-Ø-Ø-Ø-Ø-cha -> yicha
4sing: Ø-Ø-Ø-Ø-j-Ø-cha -> jicha
1dual: Ø-Ø-Ø-Ø-iid-Ø-cha -> yiicha
2dual: Ø-Ø-Ø-Ø-oh-Ø-cha -> wohcha
1distpl: Ø-Ø-Ø-da-iid-Ø-cha -> deiicha
2distpl:Ø-Ø-Ø -da-oh-Ø-cha -> daohcha
3distpl: Ø-Ø-Ø-da-Ø-cha -> daacha
4distpl: Ø-Ø-Ø-da-j-Ø-cha -> dajicha
Analysis and Conclusions
Using ontologies for modeling the lexicon has benefits
at both the local, lexicographer level and the global level
for searching and interoperation. The approach has particular
benefit for languages with complex morphology.
As seen in this example from the Navajo language, the
process for setting up an underlying form has many parts
that can be modeled in the lexicon using the Semantic
Web technology. Typically, lexical entries would be isolated
for grammatical exposition, but with this approach,
the lexicon becomes an active resource that incorporates
those grammatical facts. Furthermore, the technology is horizontal in nature, meaning that the tools used are not
particular to this application or even this class of applications.
That makes it is easier for everyone to understand
it and share it based on familiarity with the general
technology of the Semantic Web. This helps to avoid the
short ‘half-life’ common to computational lexicography
projects (Maxwell 2008).
More globally, this approach allows the lexical model
to participate in the more common applications of the
Semantic Web such as annotation, search, and linking
resources on the Internet. The resource can also participate
in global semantic relations as with WordNet. For
the Navajo lexicon described here, connecting the model
to other ontologies was made simple. For example, each
class is connected to the GOLD general OWL ontology
for linguistic concepts (Farrar 2003) dynamically over
the Internet using the built-in OWL property owl:sameAs
which links an individual to an individual. Since much
of the terminology in any particular language’s linguistic
literature can be divergent, this helps interoperation
(Chiarcos 2008).
Beck, H. (2007). Contextual Archiving with Linguistic
Analysis: An Ontology-Based Approach to Developing a
Linguistic Database. Workshop Proceedings of “Toward
the Interoperability of Language Resources”, July 13-15
at Stanford University in conjunction with the 2007 LSA
Summer Institute,
cfm, (accessed 11/11/2008).
Beesley, K. and Karttunen, L. (2003). Finite State Morphology.
Stanford: CSLI Publications.
Chiarcos, C. (2008). An ontology of linguistic annotations.
LDV-Forum, 23 (1) , 1-16, http://www.ldv-forum.
org/?language=en, (accessed 11/11/2008).
Faltz, L. (1998). The Navajo Verb. Albuquerque: University
of New Mexico Press.
Farrar, S. and Langendoen, T. (2003). A Linguistic Ontology
for the Semantic Web. GLOT International 7(3),
GLOT-LinguisticOntology.pdf, (accessed 11/11/2008).
Francopoulo, G. (2007). Strategy for an OWL specification
of LMF.
(accessed 11/11/2008).
Hayashi, Y., Declerck, T., Buitelaar, P. and Monachini,
M. (2008). Ontologies for a Global Language Infrastructure.
The First International Conference on Global
Interoperability for Language Resources (ICGL2008),, (accessed
Jang, M. and Sohn, J. (2004). Bossam: An Extended
Rule Engine for OWL Inferencing. Antoniou, G. and
Boley, H.(Eds.): RuleML 2004, LNCS 3323, 128–138,, (accessed
Simons, G., Lewis, W., Farrar, S., Langendoen, D.,
Fitzsimons, B. & Gonzalez, H. (2004). The Semantics of
Markup: Mapping Legacy Markup Schemas to a Common
Semantics. Proceedings of the XMLNLP Workshop,
Association for Computational Linguistics, Barcelona,
pdf, (accessed 11/11/2008).
Maxwell, M. and David, A. (2008). Interoperable Grammars.
The First International Conference on Global Interoperability
for Language Resources, Hong Kong, January,, (accessed 11/11/2008).
Young, R. and Morgan, W. (1992). Analytical Lexicon of
Navajo. Albuquerque: University of New Mexico Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None