AAC - Digital Resources in Textual Studies

poster / demo / art installation
Authorship
  1. 1. Hanno Biber

    AAC-Austrian Academy Corpus, OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

  2. 2. Evelyn Breiteneder

    AAC-Austrian Academy Corpus, OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

  3. 3. Karlheinz Moerth

    AAC-Austrian Academy Corpus, OEAW Österreichische Akademie der Wissenschaften / Austrian Academy of Sciences

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


AAC - Digital Resources in Textual Studies

Hanno
Biber
AAC (Austrian Academy Corpus), Austrian
Academy of Sciences
hanno.biber@oeaw.ac.at

Evelyn
Breiteneder
AAC (Austrian Academy Corpus), Austrian
Academy of Sciences
evelyn.breiteneder@oeaw.ac.at

Karlheinz
Moerth
AAC (Austrian Academy Corpus), Austrian
Academy of Sciences
karlheinz.moerth@oeaw.ac.at

2001

New York University

New York, NY

editor

encoder

Sara
A.
Schmidt

corpus
text encoding
XML
literature

The Austrian Academy Corpus (AAC) is a newly founded institution based at the
Austrian Academy of Sciences in Vienna. It was designed to set up a text
corpus and to conduct research in the field of electronic text corpora. The
AAC working group has had expertise in digital text studies and in
lexicography for more than ten years. For a long time electronic text
collections were primarily focused on linguistic studies and on
lexicography. Only recently has the perspective changed towards providing
material for scholars interested in texts from various fields of the
humanities. The AAC has been trying to find solutions that meet the needs of
textual studies and convey essential information about language and history.
The aim of the proposed poster is to investigate the potential of digital
resources for textual studies in various fields of the humanities. The
electronic text collections established at the AAC so far and its future
projects will mainly focus on electronic representations not only of
literary texts, literary magazines, journals and newspapers but also on a
carefully considered selection of texts from many cultural and social
domains.
The poster will consider the advance of new systems of digital representation
and its implications for the study of language, literature and cultural
history. Digital resources in the form of electronic text corpora should be
regarded as structures for representing complex information. Journals and
newspapers pose an especially difficult task when it comes to representation
in digital form. An equally difficult task is the analysis and description
of the mediaís decisive historical influences and contexts. The poster,
which will be a digital projection, will comprise three parts that will show
the range of interests pursued in the AAC research group. The first part
will deal with the general organisational structures of the AAC. The second
part will be concerned with the specific selection criteria for the great
variety of texts which will form the AAC. Finally, the third part will
examine practical issues in digitising the magazine ìDie Weltb¸hneî, giving
special attention to the applicability of XML Schemas in literary
computing.

1) AAC Structures
Research projects in the field of humanities computing rely heavily on
cooperation, collaboration and the constant exchange of knowledge and
expertise. The AAC will eventually be accessible on the internet, where
innovative and collaborative presentation techniques and current graphic
design developments will be utilised. However, being a research unit within
the Austrian Academy of Sciences, the AAC is also set within the wider
framework of a trilateral research scheme organized and planned at the
Austrian Academy of Science in Vienna, the Berlin-Brandenburg Academy of
Sciences and Humanities in Berlin and the Swiss Academy of Humanities and
Social Sciences in Berne. In such a wider perspective, the common and
individual settings and conditions of the German language will have to be
taken into consideration, as will their historical and contemporary
literatures of various kinds. The constant cultural exchange between the
three countries opens particular research areas and fields of study for the
linguists and scholars engaged in the establishment of digital resources and
in computing activities in the humanities in the three countries. Whereas
the efforts undertaken in Berlin and Berne are predominantly concerned with
providing selected data and texts of the 20th century mainly for
lexicographic purposes, the Austrian Academy of Sciences intends to tackle
the problem of digital representation of scholarly, journalistic and
political texts which were of considerable influence between 1848 and 1989.
The Berlin-Brandenburg Academy of Sciences and Humanities has started its
project of a Digital Dictionary of the 20th Century German Language, the
main task being to develop a dictionary system and a prime source for
linguistic and lexicographic information. The Swiss Academy of Humanities
and Social Sciences in Berne has recently joined the efforts of this
trilateral cooperation.

2) AAC Selection
To set up a text corpus several conditions and considerations are required.
In the past twenty years, electronic text corpora have been built up in
academic institutions of many European countries, such as France, Norway,
Sweden, Slovenia, Spain, the Czech Republic, and the UK. The setting up of
these corpora is motivated by the stateís will to document the national
language in a comprehensive manner and to make the corpora available for
scientific, especially linguistic, application. The AAC has a different
starting point. For the construction of an Austrian corpus one must consider
complicated issues relating to the history of the past two hundred years on
the one hand and to our own specific interests on the other. The text
selection for the AAC, which will take place at the same time as the corpus
work, will be guided by thematic and empirical criteria, as well as factors
specifically related to the type of text. The specificity of text type is
therefore a factor for the choice of texts, but also for their
categorisation in a corpus: letters by Oskar Kokoschka, anecdotes by Max
Liebermann, writings of Adolf Loos, narrations by Adalbert Stifter, feature
articles by Daniel Spitzer, funeral sermons and electoral speeches,
propaganda slogans and advertising slogans, pop song lyrics and political
speeches, comic books, instructions, travel guides, TV programmes, mailing
cataloguesthese and other text types as well as the various kinds of text
ëcarrierí are important for the choice of text.
In recent years, the establishment of large German language corpora has been
restricted to the field of linguistic and lexicographic studies. So far,
there have not been any large-scale endeavours in the area of text-centred
studies. Although more and more literary texts are becoming available, many
of these came into existence as by-products of efforts to amass data for
lexicographic research. Generally speaking, the historical period on which
the AAC is working is poorly documented in terms of digital literary texts.
This applies even more when it comes to collective text ëcarriersí such as
magazines, papers, year-books, commemorative volumes and similar materials.
To our knowledge there do not exist any large amounts of digitised
historical magazines or papers in the German language. The sources being
digitised for the AAC at the moment are historical literary magazines of
major importance. In the first instance there is ìDer Brennerî, which was
published by Ludwig Ficker in Innsbruck from 1910 until 1934. Among the
contributors to the ìBrennerî are figures as renowned as Carl Dallago,
Theodor Haecker, Else Lasker-Sch¸ler, Adolf Loos and Georg Trakl. The other
two magazines on which the AAC is working were both published in Berlin. The
journal ìDie Aktionî (1911 -1932) was edited by Franz Pfemfert. Among its
contributors were Peter Altenberg, Hermann Bahr, Walter Benjamin, Max Brod,
Richard Dehmel, Salomo Friedlaender, Georg Heym, Kurt Hiller, Max
Oppenheimer, Egon Schiele and August Strindberg. The last journal to be
mentioned here and perhaps the most important one of those being worked on
at the moment is the weekly Berlin journal ëDie Schaub¸hneí (1905 - 1918),
later renamed ëDie Weltb¸hneí (1918 - 1933,) which was edited by Siegfried
Jacobson, Kurt Tucholsky and Carl von Ossietzky. Among the writers who
contributed to the ëWeltb¸hneí were Henry Barbusse, Bertolt Brecht, Alfred
D–blin, Lion Feuchtwanger, Arthur Koestler, Heinrich Mann, Alfred Polgar,
Romain Rolland and Leon Trotsky.

3) AAC XML-Schemas
To produce a digital version of the magazine ìDie Weltb¸hneî, the original
text has to undergo the usual stages of electronic processing: After being
scanned, the text is made readable by means of up-to-date OCR. Then pages,
paragraphs and lines are identified by automatic routines. The application
of markup is the last step in this process. Tags describing contents are
carefully inserted by literary scholars especially trained for this job.
This process, which takes several runs, is accompanied by proofreading
against the original and constant checking and validating of the achieved
results. Literary projects in the past used to employ SGML, very often in
connection with the TEI guidelines. The AAC also makes extensive use of
XMLís modular system of specifications. Aside from the basic XML
specification, several other specifications exist, all of them having their
more or less well-defined place within the overall framework. The exact
nature of some of these sub-specifications is not yet clear (XLink, XML
Query), as everything is very much in a state of flux at the moment. Those
that are classified as recommendations are XSLT (Extensible Stylesheet
Transformations) and XPath (a language for addressing parts of an XML
document). The implications of others such as XLink (Extensible Linking
Language), XPointer (an abstract language that specifies locations), and XQL
(Extensible Query Language) for literary computing will have to be
considered in due course. As XML comes of age, the issue of a standard way
of defining the structure of documents becomes more and more important. Both
traditional DTDs (document type definitions) and XML Schemas are
technologies that provide such descriptions of document structures. Whereas
DTDs in the traditional sense have been around for some time and are widely
accepted in the field of SGML-based text-encoding, XML Schemas must be
regarded as a fledgling technology that still has to win its spurs.
XML Schemas are commonly regarded as an attempt at an XML answer to the
problem of defining the structure, content and semantics of documents. There
are several arguments in favour of XML Schemas, among which are XML syntax,
object orientation, inheritance, polymorphism and datatyping. Firstly, XML
Schemas follow XML syntax rules, which makes it possible to parse them with
XML tools. Nowadays, authors of XML documents often regard traditional DTDs
as unwieldy and inconsistent with the structure of the overall XML system.
With XML Schemas, validating parsers can be built on the basis of XML
syntax. Secondly, XML Schemas may include explicit restrictions on the data
types an element may hold. They let the text programmer attribute data types
such as strings, numbers (integer, floating point), date and time formats,
boolean and others to elements constituting an XML document. In addition,
XML Schemas are also supposed to allow the text worker to define new data
types to refine the markup system being used. The AACís experiences in
applying this new technology, focusing on the issue of DTDs and XML Schemas
in processing text corpora will be described and some details will be given
of the pilot phase of the AACís project of establishing a corpus.

Selected references

Istvan
Deak

Weimar Germany's Left-Wing Intellectuals: A Political
History of the Weltbühne and its Circle

Berkley and Los Angeles

1968

Thomas
Dietzel

Hans-Otto
Hügel

Deutsche literarische Zeitschriften, 1880-1945 : ein
Repertorium

München, New York, London, Paris

1988

Document Object Model (DOM) Level 3 Core Specification.
Version 1.0
W3C Working Draft 01 September, 2000

2000

()

Extensible Markup Language (XML) 1.0 (Second
Edition)
W3C Recommendation 6 October 2000

2000

()

Extensible Stylesheet Language (XSL) Version 1.0.
W3C Working Draft 27 March 2000

2000

()

XML Path Language (XPath) Version 1.0
W3C Recommendation 16 November 1999

1999

()

XML Schema Part 0: Primer
W3C Candidate Recommendation 24 October 2000

2000

()

XML Schema Part 1: Structures
W3C Candidate Recommendation 24 October 2000

2000

()

XML Schema Part 2: Datatypes
W3C Candidate Recommendation 24 October 2000

2000

()

XSL Transformations (XSLT) Version 1.0
W3C Recommendation 16 November 1999

1999

()

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags