The Respective Roles of Objective Encoding and Critical Tools as Structural Interpreters: the Case of Early Dictionaries

  1. 1. Russon Wooldridge

    University of Toronto

  2. 2. Isabelle Leroy-Turcan

    Université de Lyon III

Early dictionaries are taken here to designate dictionaries whose text is respected and is not open to revision when converted to electronic form; modern dictionaries may be revised when put in electronic form (cf. Oxford English Dictionary). The structure of a print dictionary, whether early or modern, is never perfectly realized: the text contains ambiguities and instances of dysfunction; in this regard the difference between an early and a modern dictionary is one of degree. Objective encoding is here taken to mean encoding of the typographical form of the text.
From its beginnings the monolingual lexicography of modern languages has been characterized by an alphabetical (sometimes derivational) macrostructure and a microstructure containing headword, part of speech, usage labels, definitions and examples (other types of information, such as etymology, are less common). Although there have always been typographical, formal and positional conventions to distinguish the various informational fields of the dictionary microstructure -- headword at the beginning in capitals and/or bolded, part of speech immediately after the headword in abbreviated form, field labels abbreviated and in distinctive typeface, definitions in roman typeface after part of speech (and field label), examples in italics after a definition --, these conventions are imperfectly realized. A good example of this situation in French lexicography can be observed, over a span of three centuries (17th-20th), in the various editions of the pivotal Dictionnaire de l'Académie française.
A question that has been addressed in the international project to computerize the eight complete editions of the Dictionnaire de l'Académie [6, 9], and the principal one that will be discussed at the ALLC-ACH conference, is to determine the degree to which the typographical tagging of the text, done at the time of text capture, can be used to predict, and retrieve, the various information fields.
The typographical encoding of the text (SGML tags) includes the following: large capitals, small capitals, page, column and paragraph, italics (and thus by default romans). The information fields of the dictionary text use typographical, formal and positional conventions, which are, broadly speaking, those described in the second paragraph above. The degree of systematicity and of univocity varies: whereas headwords are always in large capitals at the beginning of a paragraph (systematicity) and all paragraph-initial large capital sequences are headwords (univocity), part-of-speech indications and usage labels are sometimes not abbreviated, and both, as well as definitions, are printed in roman typeface [7].
There are two, complementary, considerations that condition the analysis of the question of tagging: cost and user-friendliness. These lead to the important distinction of pre-tagging and post-tagging. Content tagging (of dictionary information fields in the present instance) is far more costly than formal tagging (typography) since it involves subjective textual interpretation, whereas the latter is completely objective and thus mechanical. If content tagging forms part of an imposed database (pre-tagging), the database user has to accept, or at least tolerate, the imposed interpretation of the text. On the other hand, "post-tagging" of content by the user, effected by means of search filters, allows for different readings of the text [2]. The user-reader still needs however some sort of critical guidance to help with an intelligent and pertinent reading/consultation of the text.
The Académie Project has adopted, as a solution to the need for a balanced mixture of objective text and critical apparatus, the principle of analytical tools and a critical database linked to the dictionary database (examples can be seen in [6]). One of the main analytical tools is the index of metalinguistic keywords [8],[10]. Besides giving a scholarly commentary on the genesis, reception and text of the dictionary, the critical database proposes models of the structuring of the dictionary entries as a guide to data retrieval (cf. [7]).
The decision to limit tagging to a representation of the formal text is based on philological and pragmatic criteria. The "tradition", if one can use such a word for a new field of endeavour, of marking up the electronic text of early dictionaries of English is to include the tagging of information fields. The best-known example is Johnson's Dictionary of the English Language on CD-ROM and the WWW; the director of the project, A. McDermott, acknowledges the arbitrary nature of some of the tagging [3],[4]. The situation is the same for I. Lancashire's Early Modern English Dictionaries Corpus [1]. The sole precedent for the Académie Project in French lexicography is the electronic version of J. Nicot's Thresor de la langue françoyse, 1606 [5], which uses critical tools in the place of content tagging. The Académie Project's philological reluctance to tamper with the text is strengthened by the fact that limited funding -- funding is typically harder to obtain than in the English-speaking world -- makes content tagging impractical. The combination of formal tagging, automatic recognition of headwords, metalinguistic keyword index and critical models makes it possible nevertheless to do lookup, full-text and, non-exhaustively, field searches.
The paper, which will be delivered partly in English and partly in French, will address the problems of retrieving, from the database of the first edition of the Dictionnaire de l'Académie française (1694), items belonging to two classes of information: technical information on grammar, etymology and dialectology -- where the actual contents of the dictionary are often at variance with the avowed methodological choices stated in the preface --, and thematic information on the notional fields of dance, Brittany and nautical terms.
