How to convert paper archives into a digital data base? Problems and solutions in the case of the Morphology Archives of Finnish Dialects

  1. 1. Mari Siiroinen

    University of Helsinki

  2. 2. Mikko Virtanen

    University of Helsinki

  3. 3. Tatiana Stepanova

    University of Helsinki

The Morphology Archives of Finnish
The paper archives referred to are the Morphology Archives of
Finnish Dialects, which contain about 500,000 paper fi le cards
classifi ed and arranged on the basis of almost a thousand
linguistically based catalogue codes. The data the fi le cards
contain are derived from spontaneous dialectal speech by
linguistically trained fi eld-workers. The data has been analysed
and encoded to determine, for example, types of infl ection and
word formation and their use in the sentence, sound changes
in word stems, and particles and their related uses.
The data gathering was accomplished during a 30-year period
(from the 1960s to the 1990s).
The paper archives are located in the Department of Finnish
Language and Literature in the University of Helsinki. There are six
copies of the data in the archives in six differ-ent universities
and research institutions in Finland, Sweden and Norway. The
Morphology Archives of Finnish Dialects are closely related to two
other archives of Finnish dialects, namely the Lexical Archive of
Finnish Dialects (Research Institute for the Languages of Finland)
and Syntax Archives of Finnish Dialects (University of Turku).
The purpose of the Morphology Archives of Finnish Dialects has
been to facilitate research on the rich morphology of Finnish
and to provide researchers with well-organised data on the
dialects of different parishes. The Archives cover all the Finnish
dialects quite well since it consists of 159 parish collections
equally distributed among Finnish dialects. The archive
collections have served as data sources for over 300 printed
publications or theses.
For additional information about the Archives, see www. /hum/skl/english/research/ma.htm.
The Digital Morphology Archives of
Finnish Dialects
Plans to digitize the data in the Archives were fi rst made in the
1990s. The digitization project fi nally got the funding in 2001. A
project to create a digital database of Finnish dialects (Digital
Morphology Archives, DMA) was then launched. The project was
funded by the Academy of Finland, and it ended in 2005.
The digitization was not implemented simply by copying
the paper archives, but the objective has been to create an
independent digital archive, which also contains data not
included in the paper archives, in particular to ensure suffi cient
regional representation.
The Digital Morphology Archives currently contain 138,000
clauses in context (around one million words) from 145 parish
dialects of Finnish. So far a total of 497,000 morphological
codes have been added to the dialectal clauses (approx. 4
codes for each clause). In the parish collections, which are
coded thoroughly, each example has been assigned from 5 to
10 codes. This increase in the number of codes will improve
the possibilities of using the DMA for research purposes. The
Digital Morphology Archives are unique in that all the data is
derived from spoken language.
The database was implemented using MySQL, while the search
system is built on HTML. The data are stored in the Finnish
IT Centre for Science (CSC) ( /) and has been
accessible in current form via the internet to licensed users
since 2005. Licences are granted to students and researchers
upon request.
An internet search facility developed jointly with the Language
Bank of Finland (CSC) allows quick and straightforward searches
both from the entire material and from individual parishes or
dialect areas. Searches can also be made directly from the
dialect texts. These word and phrase searches can also be
targeted at dialect texts without diacritic marks. Searches can
also be refi ned by limiting them to certain linguistic categories
according to a morphological classifi cation containing 897
For additional information about the Digital Morphology
Archives, see /english/research/software/dma
Licence application form: /english/customers/university/useraccounts/
languagebank/?searchterm=language%20bank Problems and Solutions
One of the problems encountered during the process has been
that digitizing the data manually is very slow. In fact, the data
in the digital data base still only cover about 5.5% of the paper
archives. Scanning of the paper fi le cards has been proposed as
a solution. The new challenge then would be to fi nd a powerful
enough OCR program, as the paper cards have mostly been
written by hand.
Another problem has been the presentation of Finno-Ugric
phonetic transcription, which includes symbols and diacritics
that are not part of the ISO-Latin-1 character-set. As Unicode
is not yet supported by all programs, characters in the ISOLatin-
1 character set were chosen to replace some of the
Finno-Ugric phonetic symbols.
The second phase of digitization was launched in September
2007, with new funding. By the end of 2009, it is estimated that
at least 250,000 new dialectal clauses with linguistic coding will
have been added to the digital archive.

