Phonetic Access in OED2 on CD-ROM

  1. 1. Wlodzimierz Sobkowiak

    Adam Mickiewicz University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
It is widely conceded that "The Oxford English Dictionary (OED) is a unique reference work" [3]. To many, its second electronic edition (OED2 on CD-ROM) is the ultimate in machine-readable lexicography, which has now produced a plethora of machine-readable dictionaries (MRDs) of all imaginable types and sizes, widely differing in structure and function. As a universal dictionary of English of mammoth proportions (290,500 head entries, 616,500 word forms, two-and-a-half million quotations) and enormous influence, OED2 on CD-ROM is of acute interest to researchers in many related fileds: lexicography and lexicology (naturally), English philology (both historical and synchronic) and cultural studies, English as a Foreign Language (EFL) methodology, information science (DBMS, language engineering, expert systems), and others. In this paper I briefly touch upon a tiny segment of this huge area, that of lexicographic phonetics, or — to be a little more specific — the issue of access to the representation of English pronunciation in the form of phonetic transcription fields in OED2 on CD-ROM. Most of my findings and suggestions are readily applicable to other MRDs containing phonetic information.
The literature of the subject is rather limited: little serious research has been done into phonetic representation in traditional dictionaries ([1], [4], [5], [7], [9], [11], [12], [14], [15], [25], [26]), even less into phonetic aspects of electronic lexicography ([2], [6], [8], [10], [16], [27],[28]), none at all into OED2 on CD-ROM phonetics. And yet, as I have been trying to demonstrate in a series of papers (see Bibliography), this is an extremely interesting and worthy topic to (English) philologists, linguists, phoneticians, language teachers and MRD enthusiasts at all levels of expertise and sophistication. In this brief review-like treatment I will be able to look only at some selected essential issues, all of which warrant a much more extended discussion. More technical problems having some effect on OED2 on CD-ROM's phonetic access (e.g. font installation or IPA character alphabetization) will be ignored.
2. Phonetic access
OED2 on CD-ROM is one of the very few widely available MRDs allowing phonetic access, i.e. searching the contents of the dictionary through phonetic representations of its entries. For a number of years now I have been advocating introduction of this function into EFL MRDs especially, pointing to its enormous potential in this glottodidactic context (see [17], [18], [19], [20], [21]). While OED is seldom used in an EFL class (mostly due to its cost and superfluity), phonetic access is an indispensable MRD feature to practicing phoneticians, both theoretical and applied. The former can empirically check some of their models against the entirety of English (surface) lexical phonetics, the latter can generate phonetically streamlined lists of vocabulary for materials development, remedial phonetic training, EFL teaching, etc. Let us look at some problems of phonetic access in OED2 on CD-ROM.
I will skip the perplexing issue of phonological consistency in MRDs in general, and in OED2 on CD-ROM in particular, which I analysed in some detail in my 1997 paper. Briefly, such consistency will normally result from conformity of dictionary phonetic representations with the established phonological rules of the language, so that in equivalent phonological contexts the same (sub)strings of phonemes are transcribed identically. Compared to, for example, Wells's Longman Pronunciation Dictionary ([29]), OED has — in terms of consistency — the indelible birth effect in that it "is the result of the cumulative work of six different editors and many staff over a period of more than a hundred years. Some irregularities of style [...] are inevitable in such a work, and search strategies used with the electronic version should always take this into account" ([13] p.8).
However, there are also other problems of phonetic access which are readily apparent to a less casual phonetically-minded user of OED2 on CD-ROM. In my experience, among the most annoying are: (a) length- and stress-marks, (b) brackets, (c) digraphs (d) ASCII IPA substitutes.
2.1. Length- and stress-marks
For OED2 on CD-ROM to function as a respectable phonetic resource it must observe certain traditional constraints on phonetic transcription as well as some commonsensical guidelines of user-friendly interface. The use which is made of the colon (:), the quotation mark (") and the percent character1 (%) in the function of the length-mark, primary and secondary stress, respectively, sins against both the constraints and the guidelines. Against the former, because transcriptional bi-uniqueness is compromised; against the latter, mostly because it leads to confusion.
OED2 on CD-ROM ignores the three characters for purposes of phonetic searches (both in extended and exact mode2). This means that the yield of such a query is identical whether or not these characters were typed in. In single-word phonetic look-up the consequences are that the user may retrieve (much) more than s/he asked for: long-vowel equivalents of the word entered with a short vowel, or vice versa. As the colon-contrasted pairs of vowels always come from two different languages, as seen in Table 1., the superfluous yield will be doubly confusing. For example, looking up /si:/ will retrieve comme ci, comme ça, of all entries! Looking up /si/ (with the proper French /i/) will give the same result.
Table 1. Selected vowel transcriptions in OED2 on CD-ROM
(based on Appendix A1.2.2 of the Manual; pp. 100-103)

long (with :)



Ignoring stress-marks may be advisable, of course, as (with the exception of word-initial stress) it is often unclear where exactly in the string of phonemes the stress-mark should appear, so the hit rate would drop dramatically if software insisted on absolute precision here. But the side-effect is that looking up /Èr«UmQns/, for example, will retrieve /r«UÈmQns/.
In list generation, of particular use in advanced phonetic searches, the treatment of the three diacritics in OED2 on CD-ROM may invalidate the phonetic access function completely: after all, no searches are practical for words containing a particular vowel if it belongs to the colon-contrasted set3; long vowels as such cannot be searched directly; as can't fore-stressed words, for example. All these problems extend to the built-in query language as well, where potentially more complex Boole'an queries could be formulated than in the interactive mode. This treatment of colon- and length-marks is all the more surprising in view of the fact that the tilda, which symbolizes nasality (and as such would seem to be of less importance in a dictionary of English), behaves as a proper character for purposes of phonetic searching, so that, for example, /*E ~*/ will retrieve only French loanwords with no admixture of English words with this (oral) vowel. Clearly, this aspect of phonetic access in OED2 on CD-ROM must be reconsidered.
2.2. Brackets
Brackets are used in the phonetics list in their customary function, i.e. to enclose optional elements, mostly pre-sonorant schwa and word-final ('linking') (r). Like the three preceding characters, however, they are also ignored in searches. This has a number of disadvantageous consequences. Typing both /bUtWn/ and /bUt(W)n/ retrieves button, for example, but typing /bUtn/ results in a miss. This is highly phonetically counterintuitive, as it is the last form which is the most common in natural English (see [22] for more discussion of sonorant syllabicity in MRDs). The linking-r problem in simple look-up is less acute because by the time we reach /-(r)/ (which can be substituted with a single /-R/ symbol for convenience) the software dynamically retrieves the unique string.
Ignoring brackets in wildcard queries, however, means that no searches are possible of words with syllabic sonorants, linking-r or other optional elements. Entering /*(r)/ or /*(W)?/ will not generate expected lists because the former string is reduced to /*/, which means: any (single) word, and the latter retrieves both /-(W)?/ and /-W?/ entries. Notice, incidentally, that (r) is treated as a single character by the software, but (W) counts for three characters (and, e.g. backspaces accordingly).
2.3. Digraphs
Another problem in OED2 on CD-ROM phonetic access is the treatment of digraphs, i.e. all diphthongs and both palato-alveolar affricates (/tS/ and /dJ/). These are coded as two independent characters, which is again less problematic in the single-word look-up mode than in interactive wildcard searching. In the latter, the bi-phonemic treatment of these strings means, for example, that no searches are feasible for words containing any single vowel which is part of a diphthong: /e, ç, a, O, W, ï, E/. Indeed, combined with the colon indeterminacy mentioned above, this means that a search for /*O*/ words, for example, will generate words with /Oç/, /O:/, French /O/ and French /O$/. And looking for words containing three /S/'s, for example, we will find both sheshbesh and cha-cha-cha. Excluding the superfluous yield is only possible through OED2 on CD-ROM's query language, which allows Boole'an not, but presents problems of its own, which I will ignore here.
Quite apart from the question of the strictly procedural search-expediency of OED2's bi-phonemic approach, there is the issue of its theoretical validity, which is a difficult problem I will not discuss here4. Suffice it to say that phoneticians are divided in their opinions.
2.4. ASCII IPA substitutes
Finally, as I noticed in my [21] paper, the choice of ASCII substitutes for some of the IPA vowel characters (which the user types in directly from the keyboard) is far from mnemonic or intuitive. Two of the worst excesses here are listed in the bottom section of Table 1. These choices are wildly idiosyncratic to OED2 on CD-ROM, and doubtless follow from the questionable tactics of coding some French and/or German vowels with ASCII signs customarily reserved for English: i, e, a, o, u, A, O. This may be a problem both in keyboard entry and in reading query results converted to ASCII files. Needless to say, OED2's ASCII substitutes do not conform to the by-now de-facto standard of SAMPA (see [27], [28] and [30]; also [31]), which cristallized after this edition of OED was published.
3. Conclusion
As I argued elsewhere [22], [24], electronic lexicography not only opens new horizons in humanities computing, but also creates new problems. Most of the issues which I raised in this paper, for example, simply did not exist in the era of traditional paper-based lexicography, but will tend to grow in importance in proportion to the status of MRDs on the market of lingware. OED2 on CD-ROM will have to follow the user-interface standards of its much younger interactive hypermedia cousins if it wants to sustain its by now proverbial status.
1The latter two are not even listed in the Pronunciation and phonetics appendix of the Manual.
2 Which goes counter to the Manual claim (p. 24) that "With exact mode selected, the character string you type will be treated as an exact match, i.e. case sensitive and with hyphens, accents, and special characters exactly as you type them".
3 Most entries generated by the search for words with German /E:/, for example, are English /E/-full words.
4 For example, if diphthongs are bi-phonemic, what is the composition of /açW/: /a+çW/, /aç+W/ or /a+ç+W/?
1. Abercrombie,D. 1978. "The indication of pronunciation in reference books". In P.Strevens (ed.). 1978. In honour of A.S.Hornby. London: Oxford University Press. pp.119-126.
2.Alshawi,H., H.B.Boguraev & D.Carter. 1989. "Placing the dictionary on-line". In H.B.Boguraev & T.Briscoe (eds). 1989. Computational lexicography for natural language processing. London: Longman. pp. 41-63.
3. Berg,D.L. 1991. A user's guide to the Oxford English Dictionary. Oxford: Oxford University Press.
4. Brazil, D. 1987. "Representing pronunciation", in J.M.Sinclair (ed.). 1987. Looking up. London: Collins. pp. 160-166.
5. Chevillet,F. 1993. "Étude de transcriptions phonetiques des editions de l'Oxford English Dictionary". Études Anglaises 46.3 pp. 313-327.
6. Esling,J. 1990. "Computer coding of the IPA: supplementary report". Journal of the IPA 20.1 pp. 22-26.
7. Gimson,A.C. 1981. "Pronunciation in EFL dictionaries". Applied Linguistics 2.3
pp. 250-262.
8. Jassem,W. & P.Lobacz. 1989. "IPA phonemic transcription using an IBM PC and compatibles". Journal of the IPA 19.1 pp. 16-23.
9. Kretzschmar,W.A. Jr. 1994. "The Oxford Concise Dictionary of Pronunciation". Rask 1
pp. 83-93.
10. Kröger,P.R. 1981. "A phonetic orthography for computer applications". Notes on Linguistics 18. pp. 18-23.
11. MacMahon,M.K.C. 1985. "James Murrey and the phonetic notation in the New English Dictionary". Transactions of the Philological Society 1985.pp. 72-112.
12. Magay,T. 1979. "Problems of indicating pronunciation in bilingual dictionaries with English as a source language". ITL, Review of Applied Linguistics 45/46. pp. 98-103.
13. Manual for the OED2 on CD-ROM. 1992. Oxford: Oxford University Press.
14. Paikeday,T.M. 1993. "Who needs IPA?" English Today 9.1 pp. 38-42.
15. Piotrowski,T. 1987. "Indication of English pronunciation in bilingual dictionaries".
Applied Linguistics 8.1 pp. 40-47.
16. Shelden,H. 1982. "Comments on a phonetic orthography for computers". Notes on Linguistics 22. pp. 17-18.
17.Sobkowiak,W. 1994a."Phonetic Access Dictionaries in EFL: from vision to project". Nordlyd 21. pp. 33-41.
18. Sobkowiak,W. 1994b. "Phonetic-access dictionaries with L1-based simplified transcription". Poster presented at the 6th EURALEX Congress, Amsterdam, 30 August -3 September 1994.
19. Sobkowiak,W. 1994c."Beyond the year 2000: phonetic access dictionaries (with word-frequency information) in EFL". System 22.4 pp. 509-523. [one-page abstract in Cambridge Language Reference News].
20. Sobkowiak,W. 1996a. "EFL Wordstation". In A.Lindebjerg, E.S.Ore & Æ .Reigem (eds). 1996. ALLC-ACH ‘96 Conference Abstracts. Bergen: Norwegian Computing Centre for the Humanities.pp. 243-246. Also in W.Skrzypczak (ed.).1996. New technologies in language education. Toruñ: Department of English Nicholas Copernicus University.
21. Sobkowiak,W. 1996b. "Phonetic transcription in machine-readable dictionaries". In M.Gellerstam et al. (eds). 1996. EURALEX ‘96 Proceedings. Göteborg: Göteborg University, Department of Swedish. pp. 181-188.
22. Sobkowiak,W. 1997. "Consistency in EFL dictionary phonetics". In E.Waniek-Klimczak (ed.). 1997. Teaching English phonetics and phonology II. Accents '97. £ódŸ: Wydawnictwo Uniwersytetu £ódzkiego. pp. 95-102.
23. Sobkowiak,W. (forthcoming a). "Can EFL MRDs teach pronunciation?". Paper submitted to the 8th EURALEX Congress, LiP ge, Belgium, 4-8 August 1998.
24. Sobkowiak,W. (forthcoming b). "When dictionaries talk: proununciation in EFL MM MRDs". Paper submitted to the World Conference on Educational Multimedia and Hypermedia, Freiburg, Germany, 20-25 June 1998.
25. Tench,P. 1992. "Phonetic symbols in the dictionary and in the classroom". In A.Brown (ed.). 1992. Approaches to pronunciation teaching. London: Macmillan. pp. 90-102.
26. Wells, J.C.1985. "English pronunciation and its dictionary representation". In R.Ilson (ed.) Dictionaries, lexicography and language learning. Oxford: Pergamon Press. pp. 45-51.
27. Wells,J.C. 1987. "Computer-coded phonetic transcription". Journal of the IPA 17.2
pp. 94-114.
28.Wells,J.C. 1989. "Computer-coded phonemic notation of individual languages of the European Community". Journal of the IPA 19.1 pp. 31-54.
29. Wells,J.C. 1990. Longman Pronunciation dictionary. London: Longman.
30. Wells, J.C. et al. 1992. "Standardized Computer-Compatible Transcription", Esprit Project 2589 (SAM), Doc. no. SAM-UCL-037. London, Department of Phonetics & Linguistics, UCL.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC