The International Corpus of English (ICE)-Canada
Keywords: Corpus, Canadian English
Thirty years ago, Nelson Francis of Brown University recognized that the computer makes it possible to automatically process statistically adequate samples of authentic language. He and his colleagues created the Brown Corpus of American English, a one-million word collection of texts representing 15 genres of written English. The Lancaster-Oslo-Bergen corpus (LOB), an exactly parallel corpus of British English soon followed. The next important corpus, the London-Lund corpus of spoken British English, was a half-million word computer-readable version of the spoken English component of the Survey of English Usage (SEU) corpus, initiated by Sir Randolph Quirk at a time when a computer-readable corpus still seemed visionary. Hundreds of studies of American and British English have been based on the data from these pioneering corpora. Nevertheless, it is only in this decade that language scientists generally have begun to acknowledge the theoretical and practical significance of computer corpora in the description and analysis of language. Thus, as late as 1987, Geoffrey Sampson observed that specialists in natural language processing based the computer grammars they wrote on highly artificial toy subsets of English. He contrasted the authentic language of the LOB corpus with the invented sentences which preoccupied specialists in natural language processing.
Sampson concluded that the use of corpora and the probabilistic theories which they engender would ultimately overtake the then majoritarian view. This has in fact happened and much sooner than Sampson anticipated. Corpus linguistics is now in vogue. A significant measure of this recent vogue is the many millions of dollars that U.S. funding agencies have poured into archiving large collections of computer-readable texts. In July 1991 the Defense Advanced Research Projects Agency (DARPA) announced major funding for the formation of the Linguistic Data Consortium, now housed at the University of Pennsylvania. More recently (June 1995), the National Science Foundation announced significant funding for improvements in basic speech and text data resources and for new approaches to data collection, distribution and access or analysis. In the interim, the availability of electronic language data from all sources has continually accelerated.
2. The International Corpus of English (ICE)
It was within this new context that the late Prof. Sidney Greenbaum, Director of the Survey of English Usage at University College London and one of the authors of the Comprehensive Grammar of the English Language initiated the innovative, interdisciplinary and international research program designated ICE (International Corpus of English). Following in the tradition of the Brown, LOB and London-Lund corpora, the purpose of ICE is the compilation of one-million-word corpora of contemporary English, each sampling the same (or, where this is not possible, closely similar) text categories. In initiating ICE, Prof. Greenbaum acted on the increasing recognition of the importance of world Englishes and, within each such variety, the significance of unplanned discourse, whether spoken or written. A half-million words in each corpus will sample 15 categories of spoken English, ranging from impromptu, face-to-face conversation to carefully-rehearsed speeches. Similarly, the 15 categories in the written half of each corpus will sample manuscripts, including personal letters, as well as a range of printed documents.
At the present time, 21 countries and 16 research teams are participating in ICE. This includes countries like Canada, where a majority of the population speaks English as a native language, and three East African countries where it has official or quasi-official status. Each research team is preparing: 1) a transcribed version of the spoken corpus 2) a tagged version, in which each word is assigned a part-of-speech label and 3) a parsed version, in which the grammatical constituents of each sentence are identified and labelled. The analyzed corpora will be distributed in both printed and electronic formats.
ICE-Canada began at the Strathy Language Unit at Queen's University under the direction of M. Fee, then Director of the Strathy Language Unit. Much of the written corpus and some of the spoken corpus was collected there between 1990 and 1993. Since 1993, ICE-Canada has been directed by N. Belmore and S. Bergler at Concordia University. The purpose of this poster presentation is to describe the development and current status of the ICE-Canada corpus and to detail some of the challenges in preparing a fully-annotated corpus.
The poster presentation will include posters which briefly describe the corpus, illustrate the techniques we have used for the data collection, now almost complete, and the techniques we have been developing, in cooperation with other researchers, for corpus annotation. This includes: 1) digitization and transcription of the spoken data so that each element in the transcription is linked to the audio segment it represents, 2) evaluation of the transcriptions, 3) preparation of the written data, much of which is scanned data requiring proof-reading but some of which, like personal letters, has been keyed in, 4) tagging (assigning part-of-speech labels to each word), 5) tagger evaluation, 6) parsing, and 7) parser evaluation. The presentation will also describe some of the formats in which the data will ultimately be made available to researchers both in Canada and worldwide.
In a recent on-line corpus linguistics discussion group, one participant claimed that it is now possible to achieve 'a Brown Corpus an hour---a million words per hour' thus revealing a common confusion between an opportunistically collected language archive and an actual corpus. A corpus, to use Nelson Francis's own words, is "...a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis. This takes account of the fact that a corpus may be purposely skewed---toward legal or scientific language, for example---and that it may be used for phonological, graphemic, lexical, or semantic, as well as grammatical analysis." Our presentation will emphasize the problems in assembling and analyzing a corpus defined in this way, which is a major undertaking. We will give particular attention to problems in the collection, time-aligned transcription and grammatical analysis of spoken language.
Although spoken language overwhelmingly dominates all other manifestations of language, in particular, the unplanned discourse of face-to-face conversation, and is, for many practical applications, far more important than written language, previous scholarship has focused overwhelmingly on written language. Thus, that part of the Birmingham corpus which was used as the basis for the original COBUILD dictionary consisted of 18.5 million words of written text but only 1.5 million words of spoken text.
One of the most serious roadblocks to an adequate and accurate description of spoken language has been the fact that transcriptions, where they exist, have not been aligned to the particular segments they represent. ICE-Canada has been investigating ways to achieve this. We have been testing Signalyze, special speech analysis software developed by Eric Keller at the University of Lausanne (Switzerland), which permits hierarchical, time-aligned transcriptions at up to nine different levels. As one researcher has observed, "...the facility of visually superimposing transcription---and different levels of transcription---...would bring about a near revolution in the process and aims of transcription." We will illustrate Signalyze transcriptions aligned with the corresponding raw signal, narrow and wide-band spectrograms and pitch extractions.
The posters will also illustrate the output of programs we have used to do some sample tagging of the written corpus. The programs were developed by the TOSCA (TOols for Syntactic Corpus Analysis) Research Group at Nijmegen University (Holland). The TOSCA group has adapted these programs for the ICE corpora. Because an important aspect of tagging is evaluation of tagger output, we will illustrate some of the results of programs which we have developed for making objective comparisons of the output of different taggers run over the same samples and for comparing the output of a particular tagger with an analyst's preferred tags. Finally, we will exemplify the special problems posed by attempts to tag and parse unedited texts drawn from the highly diversified ICE categories, with emphasis, once again, on the particular problems posed by unplanned spoken language.
We will conclude with a discussion of the various formats in which it would be most useful to distribute the ICE-Canada corpus and the potential for comparing the ICE-Canada corpus with the other ICE corpora. There are now wide-ranging commercial applications of automated techniques for language analysis, each of which depends crucially on well-annotated corpora of both written and spoken language. A well-annotated ICE-Canada corpus should therefore have a direct impact on the development of robust, usable systems for electronically processing Canadian English while its comparison with the other ICE corpora should contribute to more accurate and complete grammars of English worldwide.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Queen's University
Kingston, Ontario, Canada
June 3, 1997 - June 7, 1997
76 works by 119 authors indexed
Conference website: https://web.archive.org/web/20010105065100/http://www.cs.queensu.ca/achallc97/