A Computerized Corpus of Karelian Dialects: Design and Implementation

poster / demo / art installation
  1. 1. Dmitri Evmenov

    St. Petersburg State University

In my presentation, I intend to cover in detail the main
project I am working on at the moment, namely designing and
implementing the computerized corpus of Karelian language
During the decades of scientifi c study of Karelian language,
initiated in mid-19th century by Finnish scholars and largely
expanded by Russian/Soviet linguists later on, a large volume
of material, most remarkably dialectal speech samples, was
amassed. Those data are however to a large extent essentially
inaccessible to research due to lack of representative and
accessible solutions allowing for representation of that rich
material. Therefore in my research project I aim at developing
and building an annotated computerized corpus of Karelian
dialects as well as developing recommendations regarding the
corpus’ further expansion.
During the fi rst stage of implementation a moderately sized
“pilot corpus” is to be built so that different strategies and
tools for annotation could be developed and tested with its
help. The pilot corpus is to be expanded later on by feeding in
other available source materials.
The pilot corpus shall contain dialectal speech samples
belonging to one dialectal group, the Olonets Karelian (Livvi),
mostly because there’s more extant dialectal speech samples
recorded for Olonets Karelian than for other groups, namely
Karelian Proper and Ludik dialects. Also, albeit certainly
endangered due to numerous reasons, Olonets Karelian
yet shows less signs of attrition and interference with
neighbouring languages (Veps, Finnish, and Russian) than the
above mentioned two different dialectal groups.
The representativeness of the pilot corpus is to be achieved,
above all, by proportional inclusion of dialectal speech samples
from all language varieties found in the areal where Karelian
language is spoken. In order to better account for dialectal
variation in case of Karelian language, it appears reasonable to
include into corpus dialectal material from each administrative
division unit (volost), the volume being 100 000 symbols per
one such unit).
It is intended to employ demographic criteria alongside with
geographic ones during material selection. In grouping the
informants in terms of their age, it appears reasonable to
follow the division into “elder” (born in 1910-1920s), “middle”
(born in 1930-1940s) and “younger” (born in 1950-1960s)
groups. As for gender representativeness, equal representation
of male and female informant’s speech in the corpus appears
impossible, at least for elder groups. The informant’s education
level, place of studies and career biography are all to be taken
into consideration as well.
It is necessary also to include the information that the
informant provides about her linguistic competence (which
language she considers her native, how many language and to
which extent she knows and can use) and performance (the
domains where she uses Karelian language, in terms of Joshua
Fishman’s domain theory).
In the beginning of pilot corpus’ implementation it is intended
to use the already published samples of Karelian dialectal
speech, while at later stages other published and specially
transcribed materials are to be added, mostly those now
stored in the archives of the Institute of Language, Literature
and History of Karelian Research Center of Russian Academy
of Sciences (Petrozavodsk, Russia).
Every block of dialectal material included into the corpus is to
be accompanied by metadata, including the following:
- informant data (gender, age, place of birth, duration of stay
in the locality where the record was made, duration and
circumstances of stay away from Karelian language areal,
native language according to informant’s own judgment,
informant’s judgment regarding her mastery of Karelian,
Russian and other languages, language choice habits for
various usage domains)
- data on the situation of speech sample recording
(researcher and informant dialogue, recording of the
informant’s spontaneous monological speech, recording of a
dialogue where two or more informants are participating);
it appears reasonable to develop a taxonomy of situations in
order to encode it later on in corpus
- the theme of the conversation recorded and transcribed;
in this case it also appears reasonable to develop and
enhance an appropriate taxonomy to be employed for data
encoding at a later stage of corpus expansion.
The detailed way of representing the data in the corpus
(“the orthography”) is to follow the universally accepted
Standard Fenno-Ugric Transcription (Suomalais-Ugrilainen
Tarkekirjoitus), although there are certain challenges,
mainly stemming from not so easily encodable diacritic
signs combinations, that require their own solutions to be
developed; implementation details will surely depend on a
chosen technology and/or software platform to be chosen
for use. It should be mentioned though, that the borders of
prosodic phrases normally marked in transcribed speech
samples will be saved in the corpus as well and used later on
for purposes of syntactic annotation. For morphological annotation, a united intra-dialectal
morphological tag set is to be developed; pilot corpus will be
annotated manually, while later stages might be annotated with
the help of existing parsing and tagging software, inasmuch as
it is applicable for our purposes.
Other design and implementation details, now being actively
worked upon, are also to be included into the presentation.

  • Keywords: None
  • Language: English
  • Topics: None