An American National Corpus: a Large Balanced Text Corpus for American English

paper
Authorship
  1. 1. Catherine Macleod

    New York University

  2. 2. Nancy Ide

    Vassar College

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction:

The importance of corpora as resources has become more and more accepted over the years. Many types of corpora have been used for various different purposes but if one is searching for examples of "general" application and not restricting oneself to a particular sub-language, the development of a balanced corpus is of primary importance. Of equal importance is the adoption of a uniform standard annotation. The main areas of application of a text corpus are lexicography (also computational lexicography) and natural language processing, including specifically, adaptation to different domains and genres. For these purposes the corpus must be large (at least 100 million words), contemporary, heterogeneous, uniformly annotated and, for use in the United States, must contain American English. The size will ensure the adequate representation of infrequent words. The selection of contemporary texts is important for both lexicography and NLP, particularly in view of the significant changes in common text genres over the last few years brought about by electronic communication. Heterogeneity ensures that the range of language usage needed for the creation of "general language resources" is represented, and that one can explore a wide spectrum of language genres for NLP. Uniform annotation is paramount in any corpus and the collection of American texts ensures that the grammatical and lexical differences found in British English will not interfere with the classifying of American English.

Background:

The first American text corpus that strived for this balance was the Brown Corpus developed by Kucera and Francis at Brown University in the 1960's. It was the model for many corpora that followed and is still being used today. However, it is a small corpus (one million words) and somewhat dated (the texts are at least 30 years old). It is true that a written language changes rather slowly over time with regard to grammar but there are changes in the structure and there are quite frequent additions of new lexical items.

Recently, the British National Corpus (BNC) was released. It is a rather carefully balanced corpus and a very large corpus (one hundred million words). It also has the advantage of covering the time period from where the Brown Corpus left off until 1993. There are, nonetheless, two distinct disadvantages for Natural Language researchers and dictionary producers in the United States: (1) the corpus is, as yet, unavailable for use outside of Europe and (2) the corpus contains texts of British not American English.

Differences between American and British English:

The grammar of American English (A.E.) varies from British English (B.E.) quite significantly. For example, British English often makes use of a to-infinitive complement where American English does not. In the following examples from the BNC, "assay", "engage", "omit" and "endure" appear with a to-infinitive complement; there were no examples found in our corpus of this construction although the verbs themselves did appear.

Examples:

B.E. "Jerome crept to the foot of the steps, and there halted, baulked, rather, like a startled horse, drew hard breath and ASSAYED TO MOUNT, and then suddenly threw up his arms to cover his face, fell on his knees with a lamentable, choking cry, and bowed himself against the stone of the steps."

B.E. "A magnate would ENGAGE TO SERVE with a specified number of men for a particular time in return for wages which were agreed in advance and paid by the Exchequer." B.E. " 'What did you OMIT TO TELL your priest?' " A.E. "`What did you OMIT TELLING your priest?'"

B.E. "But Carteret's wife, who frequented health spas, could not ENDURE TO LIVE with him or he with her: there were no children."

A.E. "But Carteret's wife, who frequented health spas, could not ENDURE LIVING with him or he with her: there were no children."

For the first two verbs, one can argue that there is not an equivalent verbal meaning in A.E. but, for the last two, the meaning can be paraphrased in A.E. by the gerund.

Adverbial usage is also different. The B.E. use of "immediately" in sentence initial position is not allowed in A.E. For example, B.E. "Immediately I get home, I will attend to that." is incorrect in A.E. where we would say "As soon as I get home, I will attend to that."

Other syntactic differences are formation of questions with the main verb "have". In B.E., one can say, "Have you a pen?" where A.E. speakers must use "do" ("Do you have a pen?"). Support verbs for nominalizations also differ. Note the B.E. "take a decision" vs the A.E. "make a decision".

With these considerable differences and the fact that lexical items may be over- or under-represented or not present at all, it is clear that a corpus of American English is needed.

The proposed American National Corpus:

As seen above, the corpora we have been working with are inadequate and the BNC although meeting our standards of size and balance does not deal with our language. In 1998, at the first LREC conference a proposal was made to create an American National Corpus (ANC) much on the lines of the BNC (Fillmore et al, 1998 [1]).

The corpus should be as far as possible, contemporary (1990's). It should be both static (like the BNC) and dynamic (COBUILD). We will add regular increments but retain the capability to return to the initial corpus as well as the static stages between increments.

The corpus will be both balanced and heterogeneous. The collection of more than 100 million words will make this possible. 100 million words of the ANC should be comparable in balance to the BNC to enable cross linguistic studies between British and American English. There is no set definition for what it means for a corpus to be balanced. The BNC made a principled effort to balance their corpus (see the BNC User's Reference Guide [2] for a break down of their corpus). The ANC will use this as a model. However, since it is also desirable to provide significant components from a wide range of styles, the remaining text will be varied rather than balanced (i.e. we will not try for differing percentages of texts according to their representative importance in the language but will try for smaller samples of a greater variety of texts). The corpus will be annotated at two levels, which serve two different user groups. Base Level will be annotated fully automatically with document, paragraph, sentence, token with POS marking. Level 1 will be heavily manual with the added text structure (titles, headers, footnotes, tables, captions, lists, etc.) which follow the CES standard (Ide, et.al [3]).

Progress towards the creation of the ANC:

The ANC has progressed since its genesis at LREC 98. In May of 1999 the first ANC meeting preceded the Dictionary Society of North America (DSNA) meeting at the University of California at Berkeley. It was attended by a number of representatives of publishing houses. The idea of an American National Corpus was well received and plans for a second meeting were agreed upon.

The second meeting took place at New York University. Invitees to this meeting included not only those present at the May meeting but publishers from Japan and representatives from various software companies from the U.S. and Europe. More substantial issues were discussed including the structure of the consortium, questions of balance in the corpus, funding, time schedules and licensing agreements. Some questions were decided, others such as balance and licensing were referred to committees for further discussion.

The shape of the consortium and future plans:

The licensing and base level annotation is to be done through LDC (UPenn). UPenn will obtain licenses from text providers and provide licenses to users. With regard to data rights, there will be multiple classes. The expectation is that there will be some subset of the data which can be made available under a form of general public license, and hence can be freely redistributed under this license.

The membership agreement provides for paid memberships from commercial organizations. These members will receive the data as soon as it is processed and have exclusive rights to this data for a period of three years. They are expected to make monetary as well as data contributions. The data will be freely available to non-profit educational and research organizations (aside from a nominal fee for licensing and distribution).

Our plan is for the base level to be paid for with consortium fees. We have a 3-year time-frame starting Jan. 2000, with 10% of the corpus deliverable by summer 2000. Level 1 annotation which will require external funding, will proceed dependent on this funding. Therefore, this may lag as much as a year behind the base level corpus. Our goal is a fully annotated level 1 corpus compliant with the CES standard.

References

[1] Fillmore, C., Ide, N., Jurafsky, D. and Macleod, C. "An American National Corpus: A Proposal". The Proceedings of LREC, Granada, Spain, May 28-30, pp. 965-969.
[2] Burnard, L. (ed) (1955). "British National Corpus: User's Reference Guide for the British National Corpus", Oxford University Computing Service, May, 1955, pp. 13-19.
[3] Ide, N., Romary, L. and Bonhomme, P. (submitted). "CES/XML: An XML-based Standard for Linguistic Corpora". Submitted to the Second International Language Resources and Evaluation Conference.
[4] Francis, W.N. and Kucera, H. (1964). "Manual of Information to Accompany 'A Standard Sample of Present-Day Edited American English, for Use with Digital Computers'". Department of Linguistics, Brown University, Providence, RI (revised 1979).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000

Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None