Building and processing a multilingual corpus of parallel texts

poster / demo / art installation
  1. 1. Peter Stahl

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Building and processing a multilingual corpus of
parallel texts


Universität Würzburg


University of Tübingen







The need for parallel texts has been discussed in numerous articles in recent
years. This paper focuses on how to set up a corpus of parallel texts, how
to do research on such texts and how to export parallel texts to different
file types.
Parallel texts should be tagged in order to mark structural characteristics
and other additional information.

Setting up parallel texts
The example below is taken from a joint research project between the German
departments of the Universities of Würzburg (Germany) and Jyväskylä
(Finland). The aim of the project was to describe and analyse problems of
contrastive word formation and to explore possibilities of text analysis.
For that purpose a corpus was set up consisting of complete and contemporary
Finnish and German literary and documentary texts and their translations
into the other language. The sheer number of 800,000 word forms makes it
obvious that commercial word processing programs cannot handle such a large
amount of text data. Therefore, we chose as a software tool the 'Tuebingen
System of Text Processing Programs' (Tustep).
One decisive factor in building a parallel corpus of two or more texts is -
if you do not work with semantic features - that there must be the same
number of structural tags in all basic files that relate to each other. For
our purposes it was sufficient to restrict the alignment to paragraphs only.
The following example demonstrates how a German text and its Finnish
translation are aligned so that both can either be read on the PC-screen.
The extract consists of three short paragraphs which are taken from the
novel Der Tangospieler by Christoph Hein (1989):

5.7 |<p>"Was ist mit Ihrer Hand?" fragte der Beamte, der vor
5.8 |ihm saß und ihm zusah.
5.23 |<p rend=missing>Dann betrachtete er sein Werk, es
6.1 |sah aus wie die Unterschrift eines Achtjährigen. Er
6.2 |nickte zufrieden.
6.29 |<p>Am Bahnhof ging er zum Schalter und verlangte eine
6.30 |Fahrkarte nach Leipzig.

The Finnish version (Sästäjä) reads:

5.7 |<p>"Mikä kättä vaivaa?" kysyi virkailija, joka istui
5.8 |häntä katsellen.
5.25 |<p>Sitten hän tiiraili aikaansaannostaan. Se näytti
5.26 |kahdeksanvuotiaan nimikirjoitukselta. Hän nyökkäsi
6.24 |<p>Asemalla hän meni luukulle ja pyysi lipun Leipzigiin.

Both texts show their original structure as they were published. In order to
do research easily on both texts at the same time it is necessary to combine
them into one single file and to align all related units. The files are
reformatted; the marked units are given a running number and merged into
one. The newly created file can be edited by using the Tustep text

Searching in parallel texts
The Tustep text editor provides all basic functions common to other word
processing programs. In addition to that it can handle complex instructions
for pattern matching. You can search for
several specific strings such as character strings or words or
parts of words, at the same time excluding other character strings
which you do not want to see (an instruction such as
so,,,-pf-kl-st-fr--ist- shows only ( so) the clusters 'pf', 'kl',
'st', 'fr' with the exception of 'ist');
abstract strings such as capital or small letters, any standard or
extended ASCII character, digits, identical characters, elements
which depend upon characters on their left or right border;
exclusions can also be made (the instruction
so,,,/>*>*>=02>=01/ shows any
combinations of two small letters (>*), followed by the
second one (>=02) and then the first one again. Thus,
patterns like 'assa', 'elle', and 'niin' are displayed);
any character in combination with a frequency declaration
(><3 meaning 'minimum of 3', ><0 'may be
missing', <>9 'maximum of 9', <>0 'any
number of');
any combination of the three above. And there are many other

If you want to find all German nouns in our text above, for example, which
are movated by using the suffix '-in', you could use an instruction like
which searches all lines of the current file from the first to the
forty-fifth position, i.e. only in the German column, to determine whether
there is a capital letter (<*), followed by any number of
(<>0) small letters (>*), a member of character group
>1, of string group >2, which has on its right side
(>|) an optional (><0) member of character group
>3. Before entering this search instruction the three groups just
mentioned have to be defined. The group >1 contains all small letters
with the exception of the vowels, which cannot occur before the suffix
'-in', >2 holds the strings in and innen, and >3 contains all
characters that could possibly follow the noun such as a blank or a
punctuation mark (.,;:!?"). Detailed information on the syntax of
instructions and other topics is given by Stahl (1996).

As a result, this so-instruction displays the first set of occurrences that
fulfil the pattern matching requirements in Christoph Hein's novel. By
changing so in the instruction above to sa (show around) the context of each pattern is
The user is able to find these forms only by means of pattern matching and
not with the help of semantic tags. He or she is completely independent and
not restricted to any predefined information.

Exporting parallel texts
The next example is taken from the material of an international intensive
course on "Multilingual text processing" which was given in Galway in 1997
with support from the EC Erasmus program. It is based upon Die Nachtwachen von Bonaventura by August Klingemann together
with its translations into English by Gillespie (1972), into Italian by
Collini (1990), and into Finnish by Kolehmainen, Oikarinen and Rahikainen
(1997). Among other tasks, one aim of the course was to align the four texts
horizontally sentence by sentence, and to export them to a PostScript file
for printing and to a HTML file for web publishing. For economic and safety
reasons the four text files are kept separate from each other. The beginning
of the German original text file reads:
[1-9] Erste Nachtwache
[2-9] Die Nachtstunde schlug; ich hüllte mich in
meine abenteuerliche Vermummung, nahm die Pike und
das Horn zur Hand, ging in die Finsterniß hinaus und
rief die Stunde ab, nachdem ich mich durch ein Kreuz
gegen die bösen Geister geschützt hatte.

The three translations contain the same structure. The original and the
translations are exported to PostScript by typesetting them with the
Tustep-typesetting program (#Satz). Only a few of all typesetting
possibilities are made use of when the four texts are processed to produce
an output of aligned sentences. First, they are typeset in narrow columns
one after the other to determine the linebreaks. Four new destination files
are generated which show the final layout of all lines. The text blocks
holding the grammatical sentences, however, are still of different length.
These units of all text files are compared with each other to find out which
is the longest among the four languages; if necessary empty space is
inserted into the shorter ones. When the text is now typeset again with the
inserted empty lines, all units are of equal length:
D. Lewis and P. Stahl describe the source code, which you can download from
(> Programme >
Nützliches), as well as possibilities of evaluating a literary

Output to other formats
The same four basic text files that were used above can also be combined into
one single HTML-file. To do this task you need about 40 lines of Tustep
3 |
6 |XX .[>/-.[0>=02!1-.[>/>/-.[>=02>=03!1-.
7 |*EOF*
8 |*EOF
9 |
11 |AA .[.
12 |AS1 .[.
13 |ES1 .-.
14 |AES 11
15 |A1 DEIF
16 |SSL 3
17 |*EOF
18 |
20 |
22 |<HTML>
23 |<HEAD><TITLE>Bonaventura</TITLE></HEAD>
24 |<BODY><TABLE WIDTH="100%">
25 |*EOF
26 |
27 |#COPY,-STD-,BV,+,-,*
28 |X .[<>0>/D.<<TR>>[>=02D.
29 |XX .[<>0>/</-<>0>/].
30 |XX <<TD WIDTH="25%" VALIGN="TOP">>[>=02-<=02].
31 |XX .#F+.<<B>>.#F-.<</B>>.#/+.<<I>>.#/-.<</I>>.
32 |XX .<Ä.Ä.<Ö.Ö.<Ü.Ü.\..
33 |XX .>ä.ä.>ö.ö.>ü.ü.ß.ß.
34 |XX .&`/.&<=01grave;.&´/.&<=01acute;...`
35 |XXX -#.<<-"-#.>>-"-·-.-___- -[0-[-
36 |*EOF
37 |
38 |#CONVERT,*,BV,0,-
39 |</TABLE></BODY>
40 |</HTML>
41 |*EOF
42 |
This program file contains all necessary commands to achieve our goal. In the
first line a temporary file is created and given the file name 'BV'. The
four basic text files are opened for reading (l. 2) and copied to the new
destination file 'BV' (l. 5-8), which is then sorted (l. 10-19). A HTML
header is written to the file 'BV' (l. 21-25); the standard file containing
the sorted units is added to the same destination file, and while the
copying takes place, several source strings are exchanged for new
destination strings to produce HTML tags and entities (l. 27-36). Finally,
three closing tags are added (l. 38-41). A permanent non-Tustep file is
created ('BV.HTM') which 'BV' is exported to (l. 43-44). A web browser
displays the result:
All examples show that text files provide the basis for all other tasks. They
can be aligned in such a way that a user is able to search the text for
words and patterns he or she is interested in, or they can be exported to
other coding schemes.






The night watches of Bonaventura

University of Edinburgh Press


Nachtwachen von Bonaventura

Reclam Verlag




Ensimmäinen Yövartio


Available at: > Programme
> Nützliches.



Zugriff auf multilinguale Texte: Das Evaluieren einer
literarischen Übersetzung unter Anwendung von Tustep





Maschinelle Verarbeitung altdeutscher Texte V. Beiträge
zum Fünften Internationalen Symposion Würzburg 4. bis 6. März 1997

Max Niemeyer Verlag


Tustep für Einsteiger. Eine Einführung in das "Tübinger
System von Textverarbeitungs-Programmen"

Verlag Königshausen & Neumann

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"New Directions in Humanities Computing"

Hosted at Universität Tübingen (University of Tubingen / Tuebingen)

Tübingen, Germany

July 23, 2002 - July 28, 2008

72 works by 136 authors indexed

Affiliations need to be double-checked.

Conference website:

Series: ALLC/EADH (29), ACH/ICCH (22), ACH/ALLC (14)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None