Austrian Academy Corpus - Doing literary markup by means of XML

Karlheinz Moerth; Hanno Biber

Authorship

1. Karlheinz Moerth

OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences
2. Hanno Biber

OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The poster we intend to present is supposed to give some details concerning the pilot phase of the newly founded Austrian Academy Corpus (AAC). The AAC is a project which has been started with the perspective of building up a digital corpus tailored to the particular needs of scholars doing research in the field of literary studies. The build-up of electronic text collections has been the domain of linguists so far. Although literary texts are becoming available in ever-increasing numbers, many of these products originated from linguistic endeavours to gather data for lexicographic or learners' purposes. Literary scholars have rather neglected the whole issue. Most of the existing digital literary resources are poorly edited and serve for little more than simplistic word searches. As the exigencies of work on literary texts differ considerably from those of linguistic studies the AAC has been trying to work towards solutions that meet the necessities of the literary domain. The AAC proceeds in its work from the basis of digital data which have been collected at the Austrian Academy of Sciences during the last few years. At that time the department's focus was placed on literary sources which were classified as functional literary text types, i.e. magazines, diaries, sermons, speeches, letters, obituaries and the like. Thus the materials in the corpus display considerable variety as to contents and structure. The data was accumulated primarily for text-lexicographic purposes preparing a phraseological dictionary which was published in 1999. This text-dictionary was based on a literary satirical magazine which appeared for the first time in 1899 in fin-de-siecle Vienna and was published until 1936. This magazine, 'Die Fackel', which was very popular among intellectuals in its time, is still of utmost interest to scholars of German literature trying to understand this crucial period of history. The electronic text of 'Die Fackel' was generated in a first run by means of rather unsophisticated OCR, a process which was completed several years ago. During the compilation of the above-mentioned text-dictionary the quality of this electronic text was improved in several runs of proofreading and adding basic markup which was developed specifically for this purpose. Only very few tags were applied, indicating for example the position of images or identifying special characters, which were not supported by the standard code-pages of the user interface. The texts being digitised at the moment (primarily historical magazines) undergo several stages of processing, first being scanned and made readable by means of OCR, then being corrected several times. The application of markup is the last step in this process. Formatting information of the digitised texts is in the first instance conserved in file formats such as RTF or utilizing standardized style sheets which yield quite reasonable results. With the increasing availability of more advanced text encoding techniques the working group of the AAC started to think about a more general approach to cope with the manifold problems of integrating text and text-related secondary data. We had to look for a markup scheme which allowed for labelling different types of data within one coherent markup system. In their endeavours to find appropriate ways of describing the structure of texts, linking up data and facilitating structured searches the working group of the AAC started doing experiments applying XML (Extensible Markup Language) to their data. The issue of XML, especially the question of XML versus SGML, has been touched upon repeatedly in Humanities computing. XML, as the youngest descendant of the SGML/HTML family, is a subset of SGML and can be viewed as the logic further development of SGML. As a general rule one can assume that SGML-encoded data can easily be transformed into XML-data, to a certain extent even the other way round. The XML related experiments at our department started already in the early stages of XML's appearance on the scene, i.e. in 1998. As yet only parts of the corpora of the AAC have been provided with XML-conforming markup as we are still trying to fathom out the potential benefits of XML for our work. Basic markup (start and end of pages, line breaks), character encoding as well as concatination of hyphenated words was carried out by means of a special conversion engine. There are many different ways to perform such transformations. Many would prefer to accomplish such a task making use of scripting languages such as Perl which, of course, can be a quite practicable and efficient way to do it. We tried to use graphical user interfaces from the very beginning of our work in order to ensure the constant monitoring and controlling of the transformation processes. Therefore we developed our tools using the programming language Delphi, a RAD tool which helped to cut developing time. The absence of any XML compliant browser software in the early phase forced us to seek a solution of our own and consequently to develop a graphical interface to bring our data on screen and display them in a readable form. These experiments showed that the fundamentally strict structure of XML documents makes parsing pretty straightforward. Even in the absence of a definitive set of tags XML-files are still processible without parsing an attached DTD (Document Type Definition). As the texts we are working on contain passages in various languages and characters XML's compliance with the Unicode Standard proved to be extremely helpful. Unicode has brought about the unification of a huge number of diverging standards and will in the future ensure the exchangeability and interoperability of all sorts of textual data. XML does not only allow generic, highly specialised encoding, but also enables the flawless exchange of data among different systems.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000

Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC

Austrian Academy Corpus - Doing literary markup by means of XML

1. Karlheinz Moerth

2. Hanno Biber

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000