The Austrian Academy Corpus - Digital Resources and Textual Studies

paper
Authorship
  1. 1. Hanno Biber

    OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

  2. 2. Evelyn Breiteneder

    OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

  3. 3. Karlheinz Moerth

    OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The Austrian Academy Corpus - Digital Resources and
Textual Studies

Hanno
Biber

Austrian Academy Corpus
hanno.biber@oeaw.ac.at

Evelyn
Breiteneder

Austrian Academy Corpus
evelyn.breiteneder@oeaw.ac.at

Karlheinz
Moerth

Austrian Academy Corpus
karlheinz.moerth@oeaw.ac.at

2002

University of Tübingen

Tübingen

ALLC/ACH 2002

editor

Harald
Fuchs

encoder

Sara
A.
Schmidt

In this paper we will describe the aims, the main research objectives and the
crucial computational aspects in establishing a large electronic text corpus. In
the first part the concept of our corpus approach will be given and its
backgroud as well as its consequences discussed. In the second part of the paper
noteable features of the digitization processes based upon the application of
XML-Schemas will be discussed.
The Austrian Academy Corpus (AAC) is a newly founded institution based within the
Austrian Academy of Sciences in Vienna. Its aim is to set up a text corpus and
also to conduct research in the field of electronic text corpora. Electronic
text collections to date have been generally focused on linguistic studies and
lexicography,and designed and set up for language-orientated research. Recently,
the perspective has changed towards providing resources for scholars from
various fields within the humanities. The AAC is attempting to establish a
corpus that meets the needs of textual studies and conveys essential information
about language and history. The AAC functions as an example of an experimental
corpus that is predominantly designed for textual studies. It will be a
complexly structured text collection in which sources from a variety of fields
will be included. The AAC also aims to include a wide range of significant texts
from various cultural domains, which will be carefully selected as being of key
historical and cultural significance and relevance. The AAC will create,
structure, provide and analyse selected text sources from the past two
centuries, taking advantage of the latest standards and techniques in electronic
text processing. The AAC intends to digitally store a wide selection of
different sources of scholarly, journalistic and political texts which were of
considerable influence in the period between 1848 and 1989. It has started the
digitisation and structured integration of texts, amongst which are for example
several influential and notable literary and political journals, such as “Die
Weltbühne” or “Die Aktion”, published in Berlin in the first decades of the last
century, and the Austrian journal “Der Brenner”, published in Innsbruck, as well
as many other sources. The famous satirical magazine “Die Fackel”, published by
Karl Kraus in Vienna, will constitute the core of the AAC and will be a starting
point for future selections of texts. Images and manuscripts will be included in
the corpus, where necessary, because the original graphical and typographical
information is important for the meaning and interpretation of digitised texts.
This is particularly the case with complex text structures such as newspapers or
literary journals which comprise a whole variety of functionally different text
types within their structure.
Digital resources in the form of electronic text corpora should be regarded as
structures for representing complex information. The electronic text collections
established at the AAC so far and its future projects will focus on electronic
representations not only of literary texts, literary magazines, journals and
newspapers but also on a carefully considered selection of texts from several
other cultural and social domains. Special emphasis will be placed on areas that
have been rather neglected in humanities computing to date. Journals and
newspapers pose an especially difficult task when it comes to their
representation in digital form. An equally difficult task is the analysis and
description of the media’s decisive historical influences and contexts. The
study and detailed investigation of texts has always been crucial for our
understanding of historical processes. The knowledge of texts and the
accessibility of textual knowledge can be furthered by means of large text
corpora like the AAC.
The text selection for the AAC, which will take place at the same time as the
corpus work, will be guided by thematic and empirical criteria, as well as
factors specifically related to the type of text. The specificity of text type
is therefore, amongst others, a decisive factor not only for the selection of
texts but also for their categorisation in a corpus: letters by Oskar Kokoschka,
anecdotes by Max Liebermann, writings of Adolf Loos, narrations by Adalbert
Stifter, feature articles by Daniel Spitzer, funeral sermons, electoral
speeches, propaganda and advertising slogans, pop song lyrics, political
speeches, comic books, instructions, travel guides, TV programmes, mailing
catalogues, and so on.. In recent years, the establishment of large German
language corpora has been restricted to the field of linguistic and
lexicographic studies. So far, there have not been any large-scale initiatives
in the area of text-centred studies. Although more and more literary texts are
becoming available, many of these came into existence as by-products of efforts
to amass data for lexicographic research. Generally speaking, the historical
period on which the Austrian Academy Corpus is working is poorly documented in
terms of digital literary texts. This applies even more when it comes to
collective text ‘carriers’ such as magazines, papers, yearbooks, commemorative
volumes and similar materials.
Among the sources being digitised for the AAC are a considerable number of
historical literary magazines of major importance. One example is the journal
“Der Brenner”, which was published by Ludwig Ficker in Innsbruck from 1910 until
1954. Among its contributors are figures as renowned as Carl Dallago, Theodor
Haecker, Else Lasker-Schüler, Adolf Loos, and Georg Trakl. Other sources on
which the AAC is working were published in Berlin, for instance, the journal
“Die Aktion” edited by Franz Pfemfert between 1911 and 1932. Among its
contributors were Peter Altenberg, Hermann Bahr, Walter Benjamin, Max Brod,
Richard Dehmel, Salomo Friedlaender, Georg Heym, Kurt Hiller, Max Oppenheimer,
Egon Schiele and August Strindberg. Another journal to be mentioned here and
perhaps the most important one in the pipeline is the weekly Berlin journal “Die
Schaubühne” (1905 - 18), later renamed “Die Weltbühne” (1918 - 33,) which was
edited by Siegfried Jacobsohn, Kurt Tucholsky and Carl von Ossietzky. Among the
writers who contributed to “Die Weltbühne” were Henry Barbusse, Bertolt Brecht,
Alfred Döblin, Lion Feuchtwanger, Arthur Koestler, Heinrich Mann, Alfred Polgar,
Romain Rolland, and Leon Trotsky.
To produce a digital version of, for example, the literary journals “Der Brenner”
or “Die Weltbühne”, the original text has to undergo the usual stages of
electronic processing. After being scanned, the text is made readable by means
of OCR. The structure of the text (pages, paragraphs and lines) is identified by
automatic routines. Application of markup is the last step in this process. Tags
encoding content are carefully inserted by literary scholars especially trained
for this task. This process, which takes several runs, is accompanied by
proofreading against the original and constant checking and validating of the
achieved results. Literary encoding projects in the past have employed SGML,
very often following the TEI Guidelines (P3). The AAC makes extensive use of
XML’s modular system of specifications. Aside from the basic XML specification,
several other specifications exist, all of them having more or less defined
place within the overall framework provided by XML. The exact nature of some of
these specifications is not yet clear (XLink, XML Query), as development work
still continues apace. Those that are classified as recommendations are XSLT
(Extensible Stylesheet Transformations) and XPath (a language for addressing
parts of an XML document). The implications of others such as XLink (Extensible
Linking Language), XPointer (an abstract language that specifies locations), and
XQL (Extensible Query Language) for literary computing will have to be
considered in due course. As XML comes of age, the issue of a standard way of
defining the structure of documents becomes more and more important. Both
traditional DTDs (document type definitions) and XML Schemas are formats which
model document structures. Whereas DTDs in the traditional sense have been
around for some time and are widely accepted in the field of SGML-based
text-encoding, XML Schemas must be regarded as a fledgling technology that still
has to win its spurs.
XML Schemas are commonly regarded as an attempt at an XML answer to the problem
of defining the structure, content and semantics of documents. There are several
arguments in favour of XML Schemas, among which are XML syntax, object
orientation, inheritance, polymorphism and datatyping. Firstly, XML Schemas
follow XML syntax rules, which makes it possible to parse them with XML tools.
Nowadays, authors of XML documents often regard traditional DTDs as unwieldy and
inconsistent with the structure of XML. With XML Schemas, validating parsers can
be built on the basis of XML syntax. Secondly, XML Schemas may include explicit
restrictions on the data types an element may hold. They allow the text
programmer to attribute data types such as strings, numbers (integer, floating
point), date and time formats, boolean and others to elements constituting an
XML document. In addition, XML Schemas are also intended to allow the definition
of new data types to future refine the markup scheme being used. For the corpus
holdings of the AAC such applications are investigated and implemented for the
benefit of various corpus-based linguistic and textual studies.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2002
"New Directions in Humanities Computing"

Hosted at Universität Tübingen (University of Tubingen / Tuebingen)

Tübingen, Germany

July 23, 2002 - July 28, 2008

72 works by 136 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20041117094331/http://www.uni-tuebingen.de/allcach2002/

Series: ALLC/EADH (29), ACH/ICCH (22), ACH/ALLC (14)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None