Single Source Processing Of Historic Corpora For Diverse Uses

poster / demo / art installation
  1. 1. Thorsten Trippel

    Department of Linguistics - Universität Tübingen (University of Tubingen / Tuebingen)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The objective of this paper is the restructuring of an existing corpus
of historical Portuguese for multiple purposes.
The Tycho Brahe Parsed Corpus of Historical Portuguese (*) includes 40
texts (in a total of 1,851,619 words) written by Portuguese authors
born from 1496 to 1845. The initial goal of this corpus was to provide
annotated texts for the research of language change in European
Portuguese. First, a morphological coding system to facilitate
automatic search of lexical items was developed. This morphological
coding system provided a mapping between linguistic description needs
and computational demands for an automatic morphological tagger
(cf. BRITTO et al., 1999; GALVES & BRITTO 2003; FINGER, 1998). Another
objective of this first system was to provide input for the following
stage, syntactic annotation (BRITTO, 2002).
The texts in the corpus are made available in three formats:
annotated for parts of speech
annotated for syntactic structures
The transcript texts were prepared according to the requirements of
part of speech tagging and syntactic annotation, that is, they were
made machine processable. The processibility depended on a format
normalization of the original material, which included correction of
typographical errors, character encoding problems and modifying
historic varying orthography into standard (modern) orthography. The
original forms were preserved and marked-up in this process; the
modifications, unfortunately, were included without special
markup. This method was adopted because the purpose of this first
preparation was, crucially, to make the texts adequate for automated
annotation, which would be applicable only to the modified material.
However, as the corpus started to be available for varied researchers,
other uses of the presented data became apparent. Many of the included
texts are only available with restricted access or at a limited number
of libraries (some sources are originals, handwritings or single
preserved copies), which made them highly attractive for the demands
of historians and other researchers in the humanities. For these
purposes, different requirements are necessary, such as readability,
preservation of the original data and text design structures, easy
access; on the other hand, linguistic markup such as POS tags turns
out irrelevant. Another field that opened up was lexicography: the
rich corpus material was used for the creation of a lexicon of
historical Portuguese, mining the information on the modifications for
automatic syntactic tagging.
All this indicated that the ideal structure of the corpus would be one
that integrated different versions of the same original
material. Remodelling the original corpus into such a structure is the
aim of the present work.
Using the sources for all these different and possibly more purposes
--- typological analysis, philological, philosophical and
philanthropic research, lexicography --- requires the inclusion of all
available information on the source in one single source document. The
single source approach was taken because of easy maintainability of
the resource --- some of the resources being OCRed with all limitation
and possible errors --- and easy distributability in different output
formats. What is more, the problem of property rights and archiving
can be solved once and the original source can be better preserved if
researchers find necessary information already in an electronic
Therefore, all available information needs to be made explicit and
well structured in a machine readable format, enabling automatic
consistency check of the resource to discover annotation errors or
systematic problems, and to enable machine processing for
transformation into the required data formats for the intended
purpose, such as output formating for readability but also output
format for automatic tagging.
All this had to be done with a minimum of information loss --- if any
--- and with the preservation of the original data. Additionally,
inconsistencies needed to be editable in order to correct them for
later purposes.
A simple strategy was taken for this purpose:
1. Linebreaks and pagebreaks were to be marked up, if available
typographic variations such as italics or bold face characters as
2. Editions had to be marked up for the altered typography, where the
original was properly marked up and the editions were identifiable.
3. Metadata for these sources was to be added to the files, including
the title and location of the original, format information and
editorial and technical data.
The method of choice for single source representation is, of course,
using XML markup (BRAY, 2000), creating a document grammar such as a
DTD or Schema and transforming the document into an XML document which
can be validated against this document grammar. Only the original
spelling variation was marked up in the source documents, but not the
modernized versions. For the preservation of all available information
as well as to mark-up the information on the modernization of
orthography a simple parser was implemented in perl.
Linebreaks and page-breaks were marked up similar to the TEI P4
standard, as well as other typographic variations.
The strategy for marking up editions was adopted from the TEI P4 (see
Section 6.5.2 Regularization and Normalization) with the exception
that the original and regularized version both were put into elements
instead of attributes, grouped in another element for the
variation. The reason for this was to express the comparable status of
the regularized and original versions and for ease of
implementation. Nevertheless with a simple transformation a TEI
conformant mark-up could be produced.
The metadata that was already available in the source documents were
easy to categorize using the DublinCore Metadata set (see DUBLINCORE
2003), therefore this simple structure was used instead of a standard
TEI-Header. In TRIPPEL and BAUMANN, 2003 a mapping between the
TEI-Header and various other metadata standards has been described.
When the data was available in this standard normalized XML format,
XSL-Transformation (CLARK, 1999) into HTML were possible to create:
1. Historic text presentation (research in the humanities) for web interfaces
2. A normalized text presentation for further tagging and syntactic parsing
3. A lexicon of historic orthographic variations
4. A word frequency lexicon
5. A catalogue of texts, as a library access function to the corpus.
The re-structured corpus will be available via a web interface soon,as well as a concordance for interactive search within the data.
This work was made possible through funding by:
DAAD within the PROBRAL project,
German Research Foundation (DFG) in the project "Texttechnological
modelling of information: Theory and Design of multimodal lexica"
The Tycho Brahe Parsed Corpus of Historical Portuguese is available
to scholars without fee for educational and research purposes at
The annotation scheme in the Tycho Brahe Parsed Corpus of Historical
Portuguese was designed by a team lead by Helena Britto and
Charlotte Galves, at the University of Campinas (Instituto dos
Estudos da Linguagem, Universidade de Campinas IEL, UNICAMP); it is
strongly inspired on the scheme designed by Anthony Kroch and Ann
Taylor for the Penn-Helsinki Parsed Corpus of Middle English
Philological consulting was provided by Ivo Castro and Ana Maria
Martins from the Classical University of Lisbon. The institute of
mathematics and statistics of the university of São Paulo (IME-USP)
provides support and computational resources; the part-of-speech
tagger used for the automatic annotation of the corpus was implemented by Marcelo Finger. The construction of the corpus is part of the project Rhythmic patterns, parameter setting and language change, coordinated by Charlotte Galves.
BRAY, Tim, PAOLI, Jean, SPERBERG-McQuenn, C. M., MALER, Eve (2000) - Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation.
BRITTO, Helena (2002) - "The Tycho Brahe Corpus and the basis for an
automated parser for Portuguese data";
BRITTO et al. (1999) - Morphological annotation system for automatic tagging of electronic textual corpora: from English to Romance languages; in: Centro de Lingüística Aplicada (ed) Proceedings of the 6th International Symposium of Social Communication, Santiago de Cuba: Editorial Oriente, 582-589.
CLARK, James (1999) - XSL Transformations (XSLT) Version 1.0 - W3C Recommendation.
DUBLINCORE (2003) Dublin Core Metadata Element Set, Version 1.1: Reference Description; <>.
FINGER, Marcelo (1998) - Tagging a morphologically rich language; in:
P. Sojka, V. Matousek, K. Pala, and I. Kopecek (ed)
Proceedings of the 1st Workshop on Text, Speech and Dialogue
TDS 98. Brno: Masaryk University, 39-44.
GALVES, Charlotte & BRITTO, Helena (2003) - A Construção do Corpus Anotado do Português Histórico Tycho Brahe: o sistema de anotação morfológica.
Sperberg-McQueen, C. M. and Lou Burnard (2001) - Guidelines for
Electronic Text Encoding and Interchange - TEI
P4. <>
TRIPPEL, Thorsten & BAUMANN, Tanja (2003) - Metadaten für
Multimodale Korpora: Verwendung im ModeLex-Projekt- ModeLex Technical
Report No 4. Bielefeld: Bielefeld University.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None