Corpus Methods for Interlingual Machine Translation

poster / demo / art installation
  1. 1. Michelle Vanni

    Georgetown University, United States Department of Defense

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Corpus Methods for Interlingual Machine Translation (MT)
That corpus analysis has become a fundamental element in the process of designing natural language processing (NLP) systems is generally recognized:

Efforts in the development of NLP and [information technology] are converging on the recognition of the importance of some sort of corpus-based research as part of the infrastructure for the development of advanced language processing applications (Atkins, Clear and Ostler 1992:1).

In order to be effective, NLP systems must handle not only those linguistic structures occurring in text which are predictable from explanatory models but also those which are idiosyncratic, occur less frequently, and whose meaning is derived from convention rather than composition. Corpus studies provide evidence of such usages. In the close examination of categories of linguistic phenomena, they also offer insight into new generalities not considered by rational theorists.

There is a noteworthy congruence between findings in corpus analysis studies and those in MT research regarding the actual coverage of theoretical models which view syntax and semantics independently. In support of the suggestion that these two levels are instead interdependent, Sinclair (1991) states that a certain structure may only be appropriate for a particular sense of a word and that, conversely, one word sense may have associated with it only a finite set of common syntactic patterns. Lexical studies in support of interlingual MT make a similar point. It has been recognized (Levin and Nirenburg 1991, 1993, 1994a, 1994b) that two levels of representation, one which indicates semantic properties from which syntactic behavior can be predicted (B.Levin 1993) and one which expresses meaning as a set of relationships to concepts as defined in a structured model of a particular semantic domain (Goodman and Nirenburg 1992), must exist in an interlingual MT lexicon in order adequately to account for the meaning of conventional linguistic expressions which have come to be known as constructions (Fillmore 1988, Goldberg 1994).

While the MT research work uses cross-linguistic data to argue that neither of the levels, alone, provides sufficient representation, monolingual data from on-line corpora can be shown to support a similar conclusion, that models of processing which have been developed from rational theories only account for a small percentage of what actually occurs in language and that further research on patterns of actual language use is required in order to derive effective grammars which handle the majority of linguistic phenomena occurring in text.

In this paper, we use corpus methods to explore approaches to the analysis of Italian verbs in related semantic fields and lexical variation associated with three of a particular verb's morphological forms. Hypotheses regarding the complementary argument structure of frequently occurring verbs in the domains of sensation, cognition and emotion will be tested and variation among the structures in which present, imperfect and preterit forms appear will be observed for the changes in semantic interpretation with which they may be associated. Based on preliminary findings, an interlingual structure will be proposed to account for these domains and forms.

Atkins, B.T.S., J. Clear, and N. Ostler. 1992. Corpus design criteria. Language and Linguistic Computing 7/1.1-16

Fillmore, C., P. Kay and M.C. O' Connor. 1988. Regularity and idiomaticity in grammatical constructions: the case of let alone. Language 64.501-38.

Goldberg, A. 1994. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press.

Goodman, K. and Nirenburg, S., ed. 1992. KBMT-89: a Case Study in Knowledge-Based Machine Translation. San Mateo: Morgan Kaufmann.

Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago: The University of Chicago Press.

Levin, L. and S. Nirenburg. 1991. Semantics-driven and ontology-driven lexical semantics. In Lexical Semantics and Knowledge Representation: Proceedings of the First SIGLEXWorkshop, University of California at Berkeley, June 1991.

Levin, L. and S. Nirenburg. 1993. Principles and idiosyncracies in MT lexicons. In Working Notes of AAAI-93 Spring Symposium Series: Building Lexicons for MT, Stanford University.

Levin, L. and S. Nirenburg. 1994a. The correct place of lexical semantics in interlingual machine translation. In Proceedings of COLING-94.

Levin, L. and S. Nirenburg. 1994b. Construction-based MT lexicons. In A. Zampolli, N. Calzolari, and M. Palmer, eds, Current Issues in Computational Linguistics: Studies in Honor of Don Walker. Norwell, MA: Kluwer.

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (

Conference website:

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC