TOOLS FOR LEXICOGRAPHY, RETRIEVAL, MIDDLE HIGH GERMAN

Frank Queens

Authorship

1. Frank Queens

Universität Trier

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

STARTING POSITION A new Middle High German Dictionary is being worked on by two teams of lexicographers: one team in Göttingen (funded by the Academy of Science in Göttingen) and a second team Trier (funded by the Academy of Science and Literature in Mainz). The aim of the project is to develop a new historical citational dictionary of the German language for the period from 1050 to 1350, consisting primarily of a printed version of four volumes, approximately 1000 pages each. Moreover, an electronic version is planned including comprehensive tools for accessing and searching the dictionary and its sources. The new dictionary is based on a ‘core corpus’ of 150 source texts, an open ‘extended corpus’ of about 500 sources (so that the composition of the corpus is well-balanced with regard to text type, date and location) and a so-called dictionary corpus which contains a set of dictionaries offering citations that are not covered by the core or extended corpus. The texts belonging to the core corpus have been digitized and lemmatized, and an electronic archive of 1,2 Mio. citations that are electronically linked to their full texts of the sources was built. The extended corpus exists either as non-lemmatized electronic full texts or as printed texts. The Digital Middle High German Text Archive, the subject of the preceding paper, is an essential part of the extended corpus. Being one of the most recent projects in the field of historical citation lexicography in Germany, it was possible from the very beginning to base all work for the new Middle High German Dictionary on electronic data processing. Needless to say, this is of great advantage for the making of a dictionary. But what kind of working environment is in fact needed at a lexicographer’s desk? Which tools meet the requirements of the two teams working together on the same material in places that are far apart? Which essential features serve a print edition as well as an electronic edition without taking the lexicographer’s attention off his proper work by the necessity of technical encoding? How to ensure the longevity of data and data structures in view of long-term dictionary projects? To answer these questions, the Trier team of the new Middle High German Dictionary, in connection with the Competence Center for Electronic Retrieval and Publishing Techniques in the Humanities, applied for a research grant from the Deutsche Forschungsgemeinschaft (DFG). The project Internet-based Working Environment for the Production and Publication of Dictionaries at Distributed Places was granted, and work started in March 2002. The first goal of the project is to set up a working environment for the two teams, working in different places. The second goal of the project is to adapt the system to the reqirements of other dictionary projects. In this paper the authors present the technical concept and the realization of the system in its present state, and give a brief perspective for the future work. TECHNICAL CONCEPT The concept intends to set up a client-server architecture: In its center is an Internet compatible relational database system where all dictionary data is stored (register of headwords, source texts, bibliography to the source texts, dictionary entries etc.). The working environment itself is installed on the lexicographer’s computer as client software which contains all the needed tools and features and also manages the data exchange with the database via the Internet. By use of this central database, all the dictionary data is available at the places where the two teams of lexicographers work simultaneously. An XML export device forms the interface to different output devices as typesetting and electronic publication on the WWW. In addition, the XML capability ensures the longevity of the data processed in the project. REALIZATION The lexicographical desktop environment consists essentially of four components: the citation corpus, the bibliography to the source texts, the storage and management of entries, and the entry editor. The citation
112
corpus offers the opportunity to search the database for quotations to illustrate the meaning fo a certain lemma. Of course, truncated lemma input is allowed as well as search restriction on a user-defined source text selection. The result is given as KWIC concordance that can be arranged by various criteria such as source text, word form or date of source text. In the concordance the length of the quotation is freely selectable by the lexicographer, and it is also possible to go to a full-text readout of the source from the concordance directly. This feature is useful to review a quotation in its original full context. Furthermore, quotations can be copied from the citation corpus into the core component of the environment, the entry editor.
Figure 1: Citation Corpus The bibliography to the source texts is an informational tool for the lexicographer, which helps to maintain and retrieve master data about the quoted edition, text type, date and localization of the source texts, and information about transmission and earlier editions. Furthermore the whole management of siglas is done in this component of the system. The third component is used for managing and storing the entries. It consists of an alphabetical list of lemmata/headwords and the corresponding entries. The lexicographer can check the status of an entry: whether an entry to a headword exists already, whether it is ready for publication or not, or whether a headword serves only as a link to another entry. As soon as a certain sequel of entries is ready for publication in an instalment, it is possible to typeset the whole instalment or a number of selected entries by a TUSTEP typesetting routine. The result of this is a postscript-file which represents the printed page of the dictionary in two columns. Since the lexicographers work on XML-tagged data, it is important for them to have a typesetting routine at hand to ensure that an entry meets the measures of entry length required by the space available in a printed dictionary.
113
Figure 2: Entry Editor The main component of the lexicographical environment is the entry editor. This tool allows the manipulation and the commenting of quotations, their grouping according to definitions, and their chronological ordering etc. The wording of a quotation is always protected against irregular handling; therefore it is not necessary to re-check the quotations before an installment is published. However, it is possible to resize the context of a quotation or go back again to a full-text readout of the source text. The entry editor is tag-based which means that all required information is marked up using XML-elements. The entry editor inserts most of these tags automatically, and they are invisible to the lexicographer thus helping him to maintain a clear overview of the structure of an entry. Certain elements can be inserted manually by the lexicographer either via keyboard or via macros. The XML-tagged entries are the basis for different output formats, for the printed and the electronic publication. PERSPECTIVES During the last months, the work of the project has concentrated more on the refinement of the entry editor than on the routines of publication. The teams of the new Middle High German dictionary are working with the system in their different places. It has been tested and improved continually, but it is still under construction. Experience shows that more functions are needed, for example a function that ensures a consistent cross-reference system and handling of links, or a workflow component that keeps track of an entry from its first draft to its final publication. The typesetting routines work sufficiently effective and will certainly convince the committees of the Academies who expect convincing results on the printed page. However, future work will also focus on electronic publishing on the Internet and, of course, adapt the system for the requirements of other dictionary projects.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

TOOLS FOR LEXICOGRAPHY, RETRIEVAL, MIDDLE HIGH GERMAN

1. Frank Queens

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"