Linguistic Corpus Construction and Analysis Before and After the IT Revolution: The Newcastle Electronic Corpus of Tyneside English in the 1960s and Now

Hermann Moisl

Authorship

1. Hermann Moisl

Newcastle University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

URL: http://www.ncl.ac.uk/necte
The theme of this conference is the impact of IT generally, and of the Web in particular, on
humanities research. The Newcastle Electronic Corpus of Tyneside English (NECTE) project is an ideal case
study on this theme. It is, in large part, based on the Tyneside Linguistic Survey (TLS), which, in the decade
1965-1975, attempted to construct an electronic corpus of the distinctive ‘Geordie’ dialect spoken in
north-eastern England, and to analyze it computationally. That attempt failed, but, because of its
importance—both historically and to the cultural identity of the region—we have received research council
funding to salvage the original TLS materials, amalgamate them with a more recent dialect survey, and
produce a state of the art web-based electronic resource of Tyneside English. We are, therefore, in an
excellent position to assess the impact of the IT revolution on corpus construction and analysis.
HISTORY
The NECTE project amalgamates two separate corpora of recorded speech, one of them collected in the late
1960s as part of the TLS project (cf. Strang 1968), and the other in 1994 (cf. Milroy et al. 1997). The TLS
material is the object of interest here. It originally consisted of audiotaped with 100 informants drawn from a
stratified random sample of Gateshead in North-East England. Many, but not all, of the interviews were
orthographically and phonetically transcribed. These transcriptions were then electronically encoded, and in
that form were the basis for subsequent computational analysis. Although several publications emerged from
the corpus in the interim, there was no further work on the corpus until 1995, when it was properly archived.
In 2001 we were awarded a substantial research grant to produce an enhanced electronic corpus resource from
a combination of the TLS and the 1994 collections which will eventually be made available to the research
community in a variety of formats: digitized sound, phonetic transcription, and standard orthographic
transcription, all aligned and available on the Web. This process is now well advanced.
ANALYSIS
In the late 1960s the TLS research team pioneered a methodological approach that is still radical today. In
contrast to the theory-driven methodology, which was universal in sociolinguistic accounts of the
1960’s/1970’s—and which still predominates—the TLS proposed a fundamentally empirical approach in
which salient factors are extracted from the data itself, and then serve as the basis for model construction.
Unlike the Labovian paradigm, therefore, social and linguistic factors were never selected by the analyst on
the basis of a predefined model of either language or society.
To this end, the project created an electronic corpus from a subset of the data and applied cluster
analysis to it. Interesting preliminary results were published, but nothing was done thereafter: the TLS never
completed its research program, and consequently failed to make a significant impact on the research
community. This failure is, we feel, primarily due to implementation issues in general and more specifically
to the inadequate computational / IT resources available at the time. After 30 or so years the issues which
proved so intractable to the TLS research group have been resolved, and it is now possible both to bring the
TLS agenda to completion, and also to augment it in important ways.
In what follows, we first identify the computational / IT factors that confronted the TLS project, then
describe how these have since been resolved, and finally draw some general conclusions as to how the IT
revolution has impacted upon the construction and analysis of linguistic corpora.
PROBLEMS
Hardware
The rudimentary computational hardware of the late 1960s and early 1970s seriously hampered what could be
achieved:
• Data input using punched cards was slow and error-prone, and error correction via the same
139
medium was unreliable. Creation of electronic files from what is, by contemporary standards,
a moderate-sized corpus was a major undertaking, and only slightly over half the material
was ever digitised as a result.
• Limited memory and low processing speed prevented some tasks being done within a
reasonable time, or indeed at all. Cluster analysis of the TLS electronic files had to be done in
several stages because the entire data set could not be held in memory, and this had
significant consequences for the usability of the results.
• Output was entirely text-oriented; there were no graphics facilities. Direct visualization of
analytical results, such as cluster dendograms, was consequently difficult.
Software
• Operating systems of the period handled basic file I/O, but very little application software
was available. The TLS project team had to write most of its own processing software.
• The available character set was restricted to the standard upper and lower case letters,
numerical digits, and punctuation marks. Fonts for IPA symbols and the elaborate system of
diacritics which TLS used for fine phonetic distinctions were unavailable and so these could
not be directly visualized; the project had to work instead with numeric codes for these
symbols.
Publication
The above hardware and software factors were inconveniences that could be, and to some extent were,
overcome by TLS. Publication of the corpus itself and of analytical results was, on the other hand, a genuine
impediment:
• Lack of portability: The absence of any generally accepted standards for electronic corpus
construction and data encoding combined with the fact that programs for processing the data
were project-specific meant that access to the material was not easy for other researchers.
• Lack of a convenient delivery medium: The only way to transfer data from one site to another
was physically to carry digital tape from place to place. There is no indication in the project
papers of any awareness that it might be possible and, indeed, beneficial from a preservation
perspective, for the electronic corpus to be provided to other researchers.
RESOLUTIONS
Thirty or so years on, none of the above constraints apply:
• Hardware: Memory size and processing power have progressed enormously since the TLS
era, and neither is now an issue for corpus linguists. There is no longer a significant limit on
corpus size or on the amount of data that can be simultaneously processed in numerical
analysis. Hardware improvements have also made it possible to develop and implement
computationally more demanding cluster analysis algorithms, such as self-organizing maps,
which could not have run in a reasonable time earlier on. High resolution graphics now make
visual display of phonetic symbols and cluster diagrams straightforward.
• Software: There is now a very wide range of ready-made software for statistical and cluster
analysis as well as for ancillary purposes, so it is rarely necessary to write bespoke software,
and there are specialty fonts for the display of phonetic symbols.
• Publication: There are now standards for character sets (Unicode) and document structuring
(XML) which make corpora that adopt these standards portable and directly usable by other
researchers. The connectivity of the Internet, together with the pervasiveness and accessibility
of the Web, have opened up new possibilities for corpus publication.
CONCLUSIONS
It is clear that, in attempting to implement its radical research agenda, the TLS was well ahead of its time and
bit off more than it could technologically chew. Substantially more powerful hardware and software were a
precondition for success, and this is what the past two decades or so have provided: it is now possible to
construct and analyze electronic corpora in the manner in which the TLS project intended, and to publish it as
a resource for the research community in a way that its members did not, and could not, conceive.
This conclusion does not, however, exhaust the impact of the IT revolution on corpus linguistics, as
we have found in our revision of the TLS. There is a second main factor, and it is sociological rather than
technological. The TLS worked largely in isolation from the rest of its research community, and feedback on
output was determined by publishers’ schedules. By contrast, we are part of a global research community that
is, in principle, in constant and instantaneous communication via email and the Web. In this community,
information can be shared, draft research output informally peer reviewed, and projects monitored as they
proceed, all effectively online. Any research group that participates in this community can bring the current
state of discipline knowledge to bear on its work, with consequent benefits. For us, this is a major advantage
over the originators of the TLS, and one which, we hope, will allow us to do justice to their efforts.
140
REFERENCES
Jones, V. (1985) ‘Tyneside syntax: A presentation of some data from the Tyneside Linguistic Survey’, in
Viereck, W. (ed.) Focus on England and Wales, pp.163–177. Amsterdam: John Benjamins.
Labov, W. (1972) Sociolinguistic Patterns. Philadelphia: University of Pennsylvania Press.
Local, J.K., Kelly, J. and Wells, W.H.G. (1986) ‘Towards a phonology of conversation turn-taking in
Tyneside’, Journal of Linguistics, 22: 411–437.
Milroy, L., Milroy, J. and Docherty, G. (1997) Phonological Variation and Change in Contemporary Spoken
British English, Unpublished Final Report to the UK ESRC, grant no. R00234892.
Pellowe, J. et al. (1972) ‘A dynamic modelling of linguistic variation: the urban (Tyneside) linguistic survey’,
Lingua, 30: 1–30.
Pellowe, J., and Jones, V. (1978) ‘On intonational variety in Tyneside speech’, in Trudgill, P. (ed.)
Sociolinguistic Patterns of British English, pp.101–121. London: Arnold.
Strang, B.M.H. (1968) ‘The Tyneside Linguistic Survey’. Zeitschrift für Mundartforschung, NF 4
(Verhandlungen des Zweiten Internationalen Dialecktologenkongresses), pp.788–794.Wiesbaden:
Franz Steiner Verlag.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Linguistic Corpus Construction and Analysis Before and After the IT Revolution: The Newcastle Electronic Corpus of Tyneside English in the 1960s and Now

1. Hermann Moisl

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"