Library - Slovenian Academy of Sciences and Arts
Jožef Stefan Institute
1. Introduction
The paper presents the project of digitization of the
Slovenian Biographical Lexicon (SBL). We first
describe the up-conversion of the source OCR text to a
richly structured XML, encoded according to the Text
Encoding Initiative Guidelines TEI P5 (TEI, 2008), using
the module on biographical and prosopographical
data. Next, some more challenging aspects of the conversion
process are discussed, in particular the extraction
of meta-data, the expansion of abbreviations into their
fully inflected forms, the diachronic nature of the text,
and the effects of some language technology tools developed
for the Slovenian language on the efficiency of the
information retrieval for Slovenian texts.
2. The SBL
The Slovenian Biographical Lexicon summarizes the
lives and work of notable figures from Slovenia’s cultural
history. It gives a picture of Slovenia’s cultural life,
from its beginnings up to the contemporary time by including
those who participated in its cultural development,
were of Slovenian nationality or born in Slovenia,
and were active in the homeland or abroad, as well as
persons of foreign origin who with their work among the
Slovenians influenced the Slovenian cultural life. It comprises
15 volumes plus index, with over 3,000 pages or
16 mio characters, and covers 5,031 biographical entries
or, as some of the entries are family names, over 5,100
persons. Its publication spanned almost 70 years (1925-
1991). SBL aims to cover not only a person’s biography
but also to give information about the important literature
depicting their life and work or to direct a user to the
whereabouts of their unpublished work, photographs,
letters etc. The data in the SBL articles were always
checked against the primary material source. As such,
the SBL is a reliable reference for any relevant scientific
research in the fields of humanities, social sciences and
the history of natural sciences. In order to widen the availability of SBL, the Slovenian
Academy of Sciences and Arts (SASA)1
and the Scientific
Research Centre of the SASA (SRC SASA)2
is undertaking
the digitization of the SBL in order to make it
freely available on-line, also enabling the kind of information
retrieval not allowed by the nature of printed text.
3. Up-conversion and TEI P5 encoding
The SBL was first scanned and the text OCRed. The
OCR was then semi-manually corrected for errors and
then via a series of automatic steps converted into a rich
TEI encoding. The first step, via the OpenOffice texteditor
and associated XSLT TEI stylesheet,3
converted
the source text into a basic TEI structure (Erjavec and
Ogrin, 2005). Since some metadata were already available
in the form of an Access database, this was exported
into the XML format, following the TEI P5 module on
biographical and prosopographical data. Then additional
metadata, such as the information for the <floruit> element
defining the exact period for more than one occupation
for a particular person in the entry, are added into
the metadata structure manually. From here, metadata
are added to the entries, and, in a related process, the text
of the entries is further annotated.
Figure 1: TEI-XML document excerpt with the
<listPerson> structure
3. 1. The TEI structure
The encoding scheme of the SBL follows the TEI P5
Guidelines, in particular the module on biographical
and prosopographical data. P5 introduces elements for
a structured biographic entry, for which the information
contained in the text is extracted and encoded separately
using the <person> element, which contains information
on the name(s), sex, nationality, faith, dates of birth
and death, facts about residence, occupation, important
life events, etc. In addition to the text, each entry of the
lexicon thus also contains metadata, useful for searching
and organising the SBL.
3. 2. Extracting metadata
Since some metadata not strictly connected with the person,
who is the topic of the entry, such as other named
persons in the entry articles, have been manually extracted
from the SBL text, we are exploring the possibilities
of automatic encoding to speed up the encoding
process. Partially, this can be done on the basis of existing
indexes of the SBL and via external resources, such
as gazetteers. These, coupled with the power of regular
expressions in Perl, can extract the relevant terms, such
as dates, with a fairly high accuracy. A more principled
solution would be to implement a general Named Entity
Recognition (NER) system for Slovenian, which would
recognize and categorise persons and places names,
dates and other numeric expressions. Such a system does
not yet exist for Slovenian, and we are exploring the possibility
of writing NER rules for one of the widely used
human language technology toolsets, such as GATE4
or
NooJ5
(Bekavac, 2002).
3. 3. Abbreviations
SBL is written in encyclopaedic style, which means
dense language and many abbreviations. There are
several types of abbreviations: a) bibliographic abbreviations
(e.g. RDHV for Razprave Znanstvenega društva
za humanistične vede); b) abbreviations to denote the
authors of SBL articles (e.g. Rš for Fran Ramovš); c)
abbreviations to refer to a biographical entry within an
article (e.g. A. for Abraham); d) abbreviations for certain
geographic names (e.g. Lj. for Ljubljana or Clvc
for Celovec); e) general abbreviations (e.g. the names of
months).
While the use of abbreviations was appropriate for the
printed books, it will only impair readability and make
searching more difficult in the digital edition. The abbreviations
are therefore expanded into their full forms and
encoded with the TEI <choice>, <abbr> and <expan>
elements. The basic resource for the expansion
are lists of bibliographic abbreviations and abbreviations
denoting authors of particular articles from the printed
edition, and an additional abbreviation lexicon, semiautomatically
compiled from the SBL.
A problem that occurs in the expansion of abbreviations
stems from the fact that the Slovenian language, like all
Slavic languages, is a highly inflective language. This means abbreviations have to be expanded into their appropriate
full forms in the correct inflection, which depends
on the context of the abbreviation. So, for example,
“Rojen v Lj.” (Born in Ljubljana), should be expanded
to “Rojen v Ljubljani”, with “Ljubljana” in the locative
case. This problem is dealt with by automatic morphosyntactic
tagging of the text, and then, on the basis of the
tag assigned to an abbreviation, generating the appropriate
inflected form of the lemma. This work uses tagging
and lemmatisation models automatically obtained from
annotated corpora and morphological lexicons, which
have the advantage of generalising to out-of-vocabulary
words (Erjavec and Džeroski, 2004).
3. 4. Language change
Another challenge regarding the language of SBL is due
to its long publication period of almost 70 years. In this
period the language, particularly its lexical aspects, has
changed significantly. Changes affect particularly geographic
names (e.g. Curih instead of Zürich) and terms
denoting occupations and activities, from spelling variants
to complete substitution of words. These changes
will have to be taken into account to ensure adequate
information retrieval. The plan is to add normalised
(contemporary) forms to the forms in the text, by encoding
them with the appropriate <choice>, <reg> and
<orig> elements. The annotation will be automatic, but
based on a lexicon of variants, semi-automatically compiled
from the SBL. This lexicon, in itself an interesting
diachronic language resource, is being built on the basis
of a textual analysis of SBL, taking into account external
language resources, such as dictionaries and gazetteers,
e.g. of old and new names of places.
4. Access
The SBL is to be made freely accessible for full-text and
structured searching. We are exploring the possibility
of using the open source Apache Solr6
search platform
based on the Lucene Java search library, which enables
such kinds of queries, accepts TEI / XML documents and
enables different plugins and extra features, such as the
lemmatization tool for Slovenian. No research has been
done yet, however, on its effects upon the efficiency of
the information retrieval for the Slovenian language.
References
Bekavac, Božo: Strojno obilježavanje hrvatskih tekstova
– stanje i perspektive. (Computer annotation of
Croatian texts – current state and perspectives) Suvremena
lingvistika. 53-54 (2002), p. 173-182.
Cankar, Izidor et al. (eds) (1925-1991): Slovenski biografski
leksikon. Ljubljana: Slovenska akademija znanosti
in umetnosti.
Erjavec, Tomaž and Džeroski, Sašo (2004): Machine
Learning of Morphosyntactic Structure: Lemmatising
Unknown Slovene Words. Applied Artificial Intelligence
18 (1), p. 17-40.
Erjavec, Tomaž and Ogrin, Matija (2005): Digitalisation
of Literary Heritage Using Open Standards. In: Paul
Cunningham, Miriam Cunningham (eds.). Innovation
and knowledge economy: issues, applications, case studies,
(Information and communication technologies and
the knowledge economy). Amsterdam [etc.]: IOS Press,
2005, p. 999-1006.
TEI Consortium, eds. (2008) TEI P5: Guidelines for
Electronic Text Encoding and Interchange. TEI Consortium.
http://www.tei-c.org/Guidelines/P5/.
Notes
1
http://www.sazu.si/
2
http://www.zrc-sazu.si/
3
http://www.tei-c.org/wiki/index.php/TEI_OpenOffice_
Package
4
http://gate.ac.uk/
5
http://www.nooj4nlp.net/
6
http://lucene.apache.org/solr/
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO