Humanities Computing Unit - Oxford University
BNC: the World Edition, and the BNC Sampler
Humanities Computing Unit Oxford
University of Virginia
What's the plural of "corpus"? In what social situations is "wicked" a term of
approval? Why does it "sound wrong" to say "The good weather set in on Thursday"
although "The bad weather set in on Thursday" is perfectly acceptable? If I can
say "I live a stone's throw away from here", can I also say "I'm going a stone's
throw away from here"?
Large language corpora can help provide answers for these kinds of questions --
if only because they encourage linguists, lexicographers, and all who work with
language to ask them. The purpose of a language corpus is to provide language
workers with evidence of how language is really used, evidence that can then be
used to inform and substantiate individual theories about what words might or
should mean. Traditional grammars and dictionaries tell us what a word "ought to
mean", but only experience can tell us what a word is used to mean. This is why
dictionary publishers, grammar writers, language teachers, and developers of
natural language processing software alike have been turning to corpus evidence
as a means of extending and organizing that experience.
The British National Corpus (BNC) is a collection of over 4000 different text
samples, of all kinds, both written and spoken, containing in all six and a
quarter million sentences, and over 100 million words of current British
English. Work on building the corpus began in 1991, and was completed in 1994.
In 1997, work on a major revision of the corpus was completed, and in 1998 the
British Government agreed to allow distribution of this revised version
The BNC World Edition is now freely available for sale as a set of CD-ROMs
containing the full SGML text of the corpus, together with software and indexes
needed to search it. It can also be accessed via the BNC Online service provided
by the British Library and managed by the OUCS. In addition, and perhaps of most
general interest, a special purpose "sampler" is now available, containing two
million words selected from the whole corpus, half from spoken and half from
In addition to SARA, (the SGML Aware Retrieval Application developed at Oxford to
work with the BNC), the Sampler includes the following other state-of-theart
corpus analysis software tools:
WordSmith, the tool of choice for many corpus linguists: developed at
Liverpool University by Mike Scott, and distributed by OUP (Windows 95
XKwic, tool of choice for corpus linguists in the Unix Environment:
developed at the University of Stuttgart by Uli Heid;
CUE, a new XML-based corpus utility developed at Birmingham University
by Oliver Mason.
The Sampler also comes complete with detailed documentation in HTML format, some
additional data files (notably a selection of treebanked data prepared at
Lancaster University, and some sample digitized audio files)
This presentation will introduce the Sampler and its contents, demonstrating how
the tools provided can be used to cater for a variety of needs, whether of
language teachers and learners or researchers in general. It will focus on uses
made of the SARA system by advanced language learners, and the pedagagogic
implications of the learning styles this system encourages.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at University of Virginia
Charlottesville, Virginia, United States
June 9, 1999 - June 13, 1999
102 works by 157 authors indexed
Conference website: http://www2.iath.virginia.edu/ach-allc.99/schedule.html