The LInguistic and Cultural Heritage Electronic Network (LICHEN): A New Electronic Framework for the Collection, Management, Online Display, and Exploitation of Multimodal Corpora

  1. 1. Lisa Lena Opas-Hänninen

    University of Oulu

  2. 2. Matti Hosio

    University of Oulu

  3. 3. Ilkka Juuso

    University of Oulu

  4. 4. Tapio Seppänen

    University of Oulu

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The international, interdisciplinary and multilingual
LICHEN project, initiated by the Department of English
and the MediaTeam research group (Dept. of Electrical and
Information Engineering, MediaTeam 2006) at the University
of Oulu and the SCOTS corpus project at the University of
Glasgow (Scottish Corpus of Texts and Speech, 2006), focuses
on the languages and cultures of the northern circumpolar
region. Its underlying assumption is that language and culture
are as important to the survival and well-being of populations
as more obvious ecological, social and health issues and thus
it is also a member of the Circumpolar Health and Wellbeing
research programme run by the Centre for Arctic Medicine,
University of Oulu (Thule Institute 2006).
The aim of the project is two-fold: firstly, the project aims to
collect, preserve and disseminate information about the
languages spoken in the circumpolar region, thus also enabling
research on them. This will also help to promote the linguistic confidence and self-image of the speakers of these languages,
strengthening their cultural awareness and facilitating
cross-cultural communication between these peoples in an age
of rapid global change (Winsa 1998).
Secondly, and more importantly, the project aims to create an
electronic framework for the collection, management, online
display, and exploitation of existing corpora of the languages
of the circumpolar regions, which is also applicable to other
corpora that represent regional, social and other varieties of
languages. Humanities computing researchers, in particular,
have long recognized the need for new, more sophisticated
tools to aid scholarly research of textual data, not to mention
tools that would be able to handle multimodal data. Although
a number of tools have been developed, they suffer from various
restrictions, e.g. they are only applicable to the data they were
developed for, importing data is laborious, user interfaces and
encoding standards are outdated, considerable expertise in
programming is assumed, no support for multilinguality is
included, or they promise more than they offer. While there
have been some very promising advances made in this direction
(e.g. TAPOR Tools 2006), it is clear that more tools are needed.
The framework being developed in this project is intended to
be the equivalent of an extendable toolbox for corpus linguists.
It will attempt to offer much-needed functionality in an
easy-to-use package, which is shaped and built-on according
to real user needs. Initially emphasis will be given to the
implementation of the text capabilities of the system, but other
modalities (such as audio and video) are also taken into account.
The idea is to facilitate queries into a multimodal database using
both proven and novel ways of finding and displaying
information (Seppänen 2006). Metadata and metadata
visualisation, particularly in conjunction with the new
modalities, will be essential in achieving this. While we support
the use of best practices for the collection, preservation and
presentation of corpus data, we also recognize that some data,
particularly legacy data, may not be in a position to do so and
the shell must also support such data (Kretzschmar et al. 2006).
The system will also make migration to and from other tools
straightforward by offering import and export features for
commonly used programs. It will enable users to bring in their
own data, which they can keep private or make public using
the built-in web functionality. The database will also be capable
of handling several different versions of any document (for
example, revisions, interpretations or translations); these are
linked, a feature that can be made use of in queries. Queries
can be made using regular expressions, which may combine
free-form text (words, phrases) and part-of-speech tags, for
The system is implemented in Java making it platform
independent and taking advantage of the many technological
components developed for that language. The ultimate goal of
the development of the computing tools is a shell which can be
adapted to any language. Therefore, support for multiple
languages and a variety of character encoding schemes are
The main focus of the project is on Meänkieli and Kven, two
Finnic minority languages spoken in Sweden and Norway,
respectively, and Scots and Scottish English. At present we
have about 150 hours of tapes in Meänkieli and 100 hours of
tapes in Kven. More Kven material is currently being collected.
We also have access to both the structure and contents of the
Scottish Corpus of Texts and Speech (SCOTS) at the University
of Glasgow, currently totalling 3.5 million words of spoken
and written Scots and Scottish English.
The project began in 2004 and a prototype of the shell has now
been constructed. We will demonstrate this shell, showing some
of the basic functionality of the system while looking into
concrete research questions focusing on the image of
Scottishness as presented in Irvine Welsh’s Trainspotting
(1996). Since the story is found in novel, play and movie
versions, it affords an excellent testbed for a toolbox that can
handle multimodal data. Using a few key scenes as examples,
our research focuses on the questioning of national stereotypes
in terms of landscape, language and culture. We are thus
interested in the comparison of the images presented in the
three versions, which will also demonstrate the ability of the
toolbox to support versioning. While we concentrate on
linguistic features of Scottish English, we also demonstrate
how easy-to-use access to sound and images linked to the
transcription of the movie and the linking between the three
versions of the text greatly facilitate research such as ours that
must take into consideration images created both on all levels
of a text and across texts. Finally, we demonstrate the possibility
of making use of online dictionaries as an added tool in the
analysis of data from within the toolbox.
Since the shell is still in prototype form, we would welcome
this opportunity to discuss our needs and goals at DH2007, thus
drawing on the considerable expertise of the conference
participants in order to ensure that our tools benefit as wide a
range of users as possible.
TAPOR Tools . . Accessed 2006. <http://tapor.human>
Trainspotting. Dir. Danny Boyle.
4 Play. Vintage, 2001.
Hodge, John. Trainspotting. DVD. Vintage, 1996. Opas-Hänninen, and B. Plichta. "Collaboration on Corpora for
Regional and Social Analysis." Journal of English Linguistics
34.3 (2006): 175-205.
MediaTeam. Oulu Research Group . . Accessed 2006. < ht
Scottish Corpus of Texts and Speech . . Accessed 2006. <htt
Seppänen, Tapio. "Multimedia Information Retrieval." Plenary
talk at Digital Humanities 2006, Paris, 5-8 July 2006. 2006.
Thule Institute . . Accessed 2006. < http://thule.oulu
Winsa, B. "Language Attitudes and Social Identity. Oppression
and Revival of a Minority Language in Sweden." Applied
Linguistics Association of Australia 17 (1998).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None