Det Danske Sprog- og Litteraturselskab (Society for Danish Language and Literature)
INTRODUCTION
Five years ago the directors of the Carlsberg Foundation, main sponsor of a multi-volume printed edition of medieval source texts to Danish History, suggested that it might now be appropriate to give up the "quill pen" and start using modern technology. Eventually, this broad hint led to the development of a customized editorial system which covers all phases of the ongoing work. This also opens up the way to an an electronic edition which may eventually include already published volumes as well. A main challenge was that the development and implementation had to be done in a way and a pace adapted to the experiences of those of my colleagues responsible for the project. They are highly skilled in history and philology, and in the borderland between them; but with respect to computers and computational methods they are, or rather were, illiterate, except that a few of them had, occasionally, been using a PC as a simple typewriter. Unless they could feel confident that new technology could help them rather than spoil their project, why should they be interested?
BACKGROUND
"Diplomatarium Danicum" is a chronologically arranged scholarly, text critical edition of all known medieval legal documents, so-called diplomas, concerning the then Denmark which included parts of modern Sweden and Germany. In parallel, translations of the documents are published in "Danmarks Riges Breve" ("Letters of the Danish Kingdom"). Since 1938 a total of 32 volumes of each, covering the period 789 AD to 1392, have been published. The rest, up to 1412, is in preparation. The books are edited and published by The Society for Danish Language and Literature.
During the late thirties to early fifties, most of the medieval documents in question (in fact up to the year 1450) were identified in archives all over Europe (the Baltic region, the Vatican, England, etc.), and were acquired in the form of photostat or microfilm copies. During the same period, typewritten transcriptions of the documents up to 1409 were made and corrected by hand. The main tasks of the scholars presently working on the edition is thus to review the transcriptions, their dating, origin etc., and to add several kinds of supplementary information, including an abstract (the so-called regest), text-critical notes, and facts about the original document or documents on which the published version builds. In addition, they have to compile, for each volume, detailed indexes of person and place names, including both the original forms (in Latin, Old Danish, Middle Low and High German, etc.) and modernized / translated forms.
Most of the editors' corrections and additions to the manuscripts were made in handwriting or, to some extent, by typewriter, and their marking-up the text was purely typographic. After editing, the manuscripts were sent to a printing house which had been employed to interpret them for years. And after several rounds of proof-reading, the typographers made up the pages with a runner (i.e. line numbers in the margin); the separately typeset notes were placed at the bottom of each page with references to the relevant line numbers. From a computational point of view referential links of this kind must be regarded as rather fuzzy.
A FEASIBILITY STUDY
In another department of the Society for Danish Language and Literature we are making a corpus based dictionary of contemporary Danish using state-of-the-art systems for computer-aided lexicography. Therefore, I was asked to analyze the existing procedures and give an opinion on possible modernisations. In my survey I had to conclude that if the ultimate goal of the project were just printed volumes, the introduction of computers was unlikely to produce any gain with respect to speed, quality or finances, except perhaps for the compilation of the indexes. However, another view might be taken, namely that of reusability. If the texts could be made available to researchers and other potential users in electronic form as well, the use and usefulness of them were likely to increase.
The aspect of reusability gained further weight when we discovered that the printing house had kept standing matter in the form of floppy disks for earlier volumes back to c. 1376. It might thus be possible to produce an electronic edition based on existing and future digital data covering a reasonably long period starting from about 1376. Furthermore, if in future it were decided to digitize even earlier volumes, which only exist in printed form, the experience gained from converting the typesetter's files to SGML would prove useful for processing scanned text as well.
A USER DRIVEN APPROACH
To make a long story short, it was decided to enter the digital age. And it had to be realized that more drastic changes would be needed than just substituting word-processors for possibly surviving quill pens. When marking up the manuscripts, the editors would have to think in terms of data types and generic encoding rather than presentational and typographical features. Based on my deep respect for the work already done, and for current procedures which had been used and refined through decades, and also on my own experience from a time when it was I who was the ignorant user that did not get what he thought he would, I realized that the success of the project would be heavily dependent on whether the experts who were to work with the system did not just accept, but also appreciate the assets of the new procedures. In order to achieve this, a stepwise approach was taken. Each step involved an in-depth analysis of the procedures currently used, discussions with the editors in order to clarify any obscurities, as well as hands-on experiments.
IMPLEMENTATION
The work plan that was eventually adapted involved a seemingly minor change of procedure, namely that the printing house should key-in the existing transcripts (which already contained some of the supplementary information mentioned above) before, instead of after, the editorial handling of them. They should use an agreed set of codes which was essentially typographic rather than generic. As the data arrived from the printer, it was transformed to SGML using the parser DIPA 1 along with ad-hoc Pascal programs for pre- and post-processing; one of the postprocessing steps involved the automatic linking of most critical notes to the correct reference points in the document text. A couple of years later, the data for 11 years (1399-1409) had been processed and was available in SGML encoded format for post-editing. The DTD is tailored for this specific task and is not conformant to TEI.
The original idea was that the editors and their student assistants should use our dictionary editing system GestorLex 2. However, a closer examination of the program's functionality compared to the needs of the editors of the "Diplomatarium" showed that it would be far from ideal for text types other than dictionaries, even though many of its features were highly appropriate and not available in other commercially available software. Consequently, I started the process of developing - in regular dialogue with the users - a customized software system under Windows, which would eventually cover all phases of the project. The programming tool is Borland's Delphi, and without access to Dave Baldwin's HTML aware ThtmlViewer Component 3 I probably could not have done it.
The first part of the system was the Viewer, a program which in most details shows and prints a typographic version of the documents, taking the SGML encoded version as input. The few deviations from "real" printed text fall in two classes. First, some characters do not exist in the available fonts, typically combinations like "u" with a small "e" or "o" over it; the missing characters are represented by mnemonic codes in curly brackets, e.g "{ue}", "{uo}". Secondly, a few features which depend on details of the page make-up of the printed book would be meaningless to imitate. This is especially true for the critical (foot)notes. If, in the book, a document runs over two pages, so do the notes. Normally every note has an entry of its own, starting with the number of the line it refers to, and a textual reference to the word(s) the note applies to; but in the case of two (or more) notes referring to items in the same printed line, the notes are continuous and just separated by a dash. On the screen and the laserprints, however, every individual note will be separate, and the notes follow immediately after each text. On the other hand, the screen image is enriched by reciprocal hyperlinks between a note and the position it refers to in the text; and words that are marked up as potential index entries may, as an option, be shown in different colors.
As a gentle introduction to SGML, the earliest editing of the converted, SGML-encoded text was done with WordPerfect 5.1 for DOS, the only computer program of which the editors had at least a bit of knowledge. A set of macros was made which supported quick and correct encoding. If saved as "DOS text", the corrected document could then be seen and printed with the Viewer. However, eventually the keying was left to student assistants whose computational frame of reference was MS-Windows rather than DOS, and it was due time for integrating a customized editor in the Viewer. The next step would be to add a database component for the compilation of indexes.
THE INDEXES
At the end of each printed volume there are two indexes, one of person names and one of place names. The connection between text and indexes may be illustrated by the following example, a papal letter of appointment, taken from item 184 in the printed volume for the period 1389-92. Date and place of the document is S. Pietro in Rome 29 January 1390, and its opening salutation reads:
"Bonifacius (episcopus seruus seruuorum dei)/dilecto filio Petro Laurencii cantori ecclesie Arusiensis salutem (et apostolicam benedictionem)."
This gives rise to the following "atomic" entries in the indexes:
Person names:
(I)"Bonifacius 9., pope. 1390: 184"
(II"Petrus see Peder"
(III)"Peder, Petrus: P. Larsen, precentor in Aarhus. 1390: 184"
Place names:
(IV)"Arusiensis see Aarhus"
(V)"Aarhus, Arusiensis - Precentor: 184"
In the printed indexes such "atoms" are combined with other similar references to form compound entries like:
"Petrus (see also Peder and Peter) 1) P. patriarch of .." "Aarhus, Aresiensis, Arhusius, Arusiensis, Aarhusz, Aars, Arhusen, Arusen, Arws: 555, - Bishop: 18 .. Precentor: 18, 136, 184, ..."
From the "Petrus" entry it would seem that the normalized name forms chosen depend on nationality, and this is in fact the case as can be demonstrated by "Iohannes" which will be rendered as Jan (Dutch), Johan (German), John (English) or Jens (Danish), unless the person lives outside the Germanic countries and belongs to the Roman Catholic clergy, in which case the latinized form is preserved. The choice of name-form for the index is at the same time the name used in the parallel translation.
Automated index generation is a major purpose of the present project, and in order to apply it successfully the editors had to be made aware of the concept of data type, and to help me to identify and distinguish those different data types in the texts which are relevant for index generation and thus have to be marked up in order for the system to extract and process them. The following five types were identified:
<PN> Person Name (Bonifacius, Petro)
<SB> Job Description (seruus seruorum dei, cantor)
<SN> Placename (Arusiensis)
<AN> Anonymous (e.g. "the <AN>aldermen</AN> of <SN>Hamburg</SN>"
If several elements belong together they should be marked by a record-delimiter <PST>..</PST>, and the example above should thus be marked up like this:
"..<PST><PN>Petro</PN> <TN>Laurencii</TN> <SB>cantori</SB>
ecclesie <SN>Arusiensis</SN></PST>.."
The record element <PST> (but none of the others) may be nested as in the following constructed example:
"<PST>
<PST><PN>Peter</PN> <TN>Johnson</TN></PST> and
<PST><PN>John</PN> <TN>Peterson</TN></PST>
<SB>aldermen</SB> of <SN>Hamburg</SN>
</PST>"
During the processing of index information a first step is to break such nested records up into equivalent sets of two or more non-nested ones, in this case:
"<PST><PN>Peter</PN> <TN>Johnson</TN> <SB>aldermen</SB>
<SN>Hamburg</SN></PST>"
and
"<PST><PN>John</PN> <TN>Peterson</TN> <SB>aldermen</SB>
<SN>Hamburg</SN></PST>"
Marking up the index-words is a first and necessary step on the way to the indexes. Although manual, the process is fast and easy. You just mark as a block the element to be tagged and press Alt+P for person name, Alt+S for place name (Danish: Stednavn), Alt+R for record etc. As the next step, the words have to be modified in at least two ways before they can go into an index. First, inflected forms have to be reduced to a canonical form, usually the nominative singular (e.g. Petro to Petrus); secondly, they have to be translated to a standard modern form (Petrus to Peder etc.; Iohannes to Jan, Jens, John or Johannes). This, too, has to be done manually, but as more and more names are translated/normalised the process is gradually becoming at least semi-automatic. This is because all reduced and translated forms are stored in a lexical database, which is used by the system to present to the editor a list of one or more proposed forms, from which he can choose by simply pressing a key or clicking with the mouse. The output of the process are records of 16 fields each, namely a unique document identifier and the three versions (text form, canonical form, translated form) of each of the five information types mentioned. However, in many cases (e.g. a placename, a person-name without job-description) most of the fields are empty.
The third step is automatic: According to rules which simulate the principles used by a human indexer, index entries like the ones marked (I) to (V) above are generated and displayed for human control; if not accepted, the editor must go back and correct the database record.
NOTES
1 DIPA, DIctionary PArser, is a context free, character aware chart parser programmed by Peter Molbaek Hansen, associate professor in computational linguistics, University of Copenhagen, in collaboration with the present writer. Among other things it has been used for transforming several printed dictionaries into SGML-encoded format.
2 GestorLex is an SGML based dictionary editing system, programmed by TEXTware A/S for Gyldendal Publishers, Copenhagen, according to specifications made by the Longman Publishers and myself. It is used by several dictionary projects in Denmark and elsewhere. Its core is an SGML-editor with a vertically split window. To the left, the elements are shown, each on its own line and indented according to its level in the tree structure defined by the DTD; this is where the editing is done. To the right, an aligned and currently updated typographical version of the entries is shown.
3 ThtmlViewer, programmed and regularly updated by L. David Baldwin, USA, is a Delphi component for displaying HTML documents. Information and demo version: http://www.pbear.com/
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)
Debrecen, Hungary
July 5, 1998 - July 10, 1998
109 works by 129 authors indexed
Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/
References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm
Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)