The linguistic annotation of texts in non-standardized languages: Medieval Romance documents analyzed with Phoenix

  1. 1. Matthias Kopp

    Universität Tübingen (University of Tubingen / Tuebingen)

  2. 2. Martin Gleßgen

    Universität Zürich (University of Zurich)

Phoenix provides a tool for historical and comparative linguistics which allows for lexicological, graphematical and morphological annotations of complex texts written in non-standardized language varieties. This tool manages an interactive relationship between a textual and a lexicographical (as well as onomastic, grammatical or morphological) data basis. It allows the export of such annotated and enriched data, e.g., in forms of a glossary or a dictionary. Special features are provided for identifying linguistic changes and developments in terms of time, space and textual genres.

Phoenix provides features that are partially covered by other tools for simultaneously processing textual and interpretative data bases or performing lemmatization and identifying as well as quantifying graphematical or morphological items. Its strength lays in its managing of complex textual data (i.e. highly segmentated data with critical apparatus and great variety in hierarchical structures) and in the modelling of the consecutive steps required for a philological and linguistic analysis of ancient texts. The emulation of the methodological procedure of traditional modus operandi guarantees the quality of work, regarding diachronical linguistics. A comparison with other tools developed for reference corpora (gatto/OVI-TLIO, stella/TLF-ATILF, IDS/DUDEN, SIL) shall emphasize the distintiveness of Phoenix.

The annotation of untagged textual data without any comparative terms (lexica of forms) follows three steps:

Graphical variances with identic functions in a corpus (e.g. y = i) are automatically merged; this step simultaneously allows for developing and verifying hypotheses about the characteristics of graphematical variance.
The occurrences identified in step one then get tagged by lemmas, word classes or graphematic groups.
The annotated occurrences are transformed into an interpretative file. This lexicological file is structured in accordance with the entries of a single lemma; the occurrences of a lemma are registered according to their morphological and semantic qualities (flectional paradigm, polysemy). These elements can be exported in a glossary, complemented by lexicographical means or interpreted in terms of linguistic change.
The morphological and lexematic annotations can also be based on comparative resources. As soon as a lexicon of the analyzed language variety exists, an identification of word classes and lemmata can be obtained with the help of a complementary tool (TreeTagger, developed by Stein and Schmidt); Phoenix is able to desambiguate ambiguous results.


Phoenix is based on parameterized modules of TUSTEP (TUebinger System von TextverabeitungsProgrammen) and on procedures written in the TUSTEP Scripting Language. Analyzing text data with Phoenix means to carry out the following steps: First a list of wordforms is generated from the text source which has to be available as valid XML i.e. conformant with a "Phoenix"-DTD viz. -Schema. By means of pointers this list is linked to the source text. During the generation of the ordering keys used for the sorting of the entries user defined equivalents are taken into account: forms differing graphically, in spelling or notation may therefore be treated as initially identical i.e. belonging together. The definition of equivalents is done with respect to diasystematic qualities of the text data (era, region, type of text) and may be targetedly applied on respective parts of a complete corpus.

The probably different occurences collected within one lemma in the next step are reworked interactively: attributes are assigned to one occurence (or concurrently to multiple occurences). They describe the occurence(s) in detail providing for example the lemma, word class, grammatical aspects or grapho-phonetic characteristics. Decisions may be reached with respect to the permanently available context an occurence is derived from.

The enrichment of the source text is done by means of tags connecting every qualified occurence of the source text with an entry in an index file. At any time the index file may be integrated into the source data, if needed, e.g., for export puposes. In the course of work on a corpus the index file serves as reference (cf. 'Attributliste', 'Liste Lexem'). Finally the information collected in the index file is transformed into a lexicological repository.

The internal format based on xml-standards also guarantees the export of data into other applications. TUSTEP - Phoenix' programming language - allows the processing of large corpora. The textual data basis, encoded in an xml-scheme, can be transformed into a digital as well as a printed edition, which corresponds to the high standards of the typographical tradition for scholarly editions.

Combining diasystematic qualities of a text (time, space, social prestige, textual genre) and attributed linguistic data allows cross searches. The identification and quantification of linguistic changes is to be developed, by integrating other existing tools featuring respective functionalities.

Research on non-standardized languages requires the integration of the analysis of a text and its edition; these two steps overlap here - while they are seperated when texts of contemporary language are studied. The ability to integrate these steps (and to change for example qualifications and resulting glossaries in the course of an edition) is reflected in the close relation between source data and lexicographic database. This is one of the characteristcs by which Phoenix differs from other tools providing comparable features.

