Oxford University
Max Planck Institute for Psycholinguistics - University of Nijmegen
Utrecht University
Hungarian Academy of Sciences
University of Helsinki
This paper proposes the need for an infrastructure to make
language resources and technology (LRT) available and readily
usable to scholars of all disciplines, in particular the humanities
and social sciences (HSS), and gives an overview of how the
CLARIN project aims to build such an infrastructure. Why we need a research
infrastructure for language resources
Problems of standards for textual representation, interoperability
of tools, and problems with licensing, access and sustainability
have dogged the Humanities since the invention of the digital
computer. Language resources such as text corpora exhibit a
variety of forms of textual representation, metadata, annotation,
and access arrangements. Tools are usually developed for ad
hoc use within a particular project, or for use by one group of
researchers, or for use with only one text or set of data, and
are not developed suffi ciently to be deployed as widely-used
and sustainable services. As a result, a large amount of effort
has been wasted over many years developing applications with
similar functionality. Veterans of ACH and ALLC will know
that the problems which are addressed by this paper are not
new ones. What a persistent and sustainable infrastructure,
as part of the e-Science and Cyberinfrastructure agenda, can
offer is perhaps the fi rst realistic opportunity to address these
problems in a systematic, sustainable and global fashion.
The Summit on Digital Tools in the Humanities in
Charlottesville, Virginia in 2006 estimated that only 6% of
scholars in the Humanities go beyond general purpose
information technologies (email, web browsing, word
processing, spreadsheets and presentation slide software), and suggested that revolutionary change in humanistic research
is possible thanks to computational methods, but that this
revolution has not yet occurred. This is an exciting time in
humanities research, as the introduction of new instruments
makes possible new types of research, but it is clear that new
institutional structures are needed for the potential to be
realised.
CLARIN is committed to boost humanities research in a
multicultural and multilingual Europe, by allowing easy access
and use of language resources and technology to researchers
and scholars across a wide spectrum of domains in the
Humanities and Social Sciences. To reach this goal, CLARIN is
dedicated to establishing an active interaction with the research
communities in the Humanities and Social Sciences (HSS) and
to contribute to overcoming the traditional gap between the
Humanities and the Language Technology communities.
The CLARIN proposal
The proposed CLARIN infrastructure is based on the belief
that the days of pencil-and-paper research are numbered,
even in the humanities. Computer-aided language processing
is already used by a wide variety of sub-disciplines in the
humanities and social sciences, addressing one or more of the
multiple roles language plays, as carrier of cultural content
and knowledge, instrument of communication, component of
identity and object of study. Current methods and objectives
in these disparate fi elds have a lot in common with each other.
However it is evident that to reach the higher levels of analysis
of texts that non-linguist scholars are typically interested in,
such as their semantic and pragmatic dimensions, requires an
effort of a scale that no single scholar could, or indeed, should
afford.
The cost of collecting, digitising and annotating large text or
speech corpora, dictionaries or language descriptions is huge
in terms of time and money, and the creation of tools to
manipulate these language data is very demanding in terms
of skills and expertise, especially if one wants to make them
accessible to professionals who are not experts in linguistics
or language technology. The benefi ts of computer enhanced
language processing become available only when a critical
mass of coordinated effort is invested in building an enabling
infrastructure, which can then provide services in the form
of provision of all the existing tools and resources as well
as training and advice across a wide span of domains. Making
resources and tools easily accessible and usable is the mission
of the CLARIN infrastructure initiative.
The purpose of the infrastructure is to offer persistent services
that are secure and provide easy access to language processing
resources. Our vision is to make available in usable formats
both the resources for processing language and the data to
be processed, in such a way that the tasks can be run over
a distributed network from the user’s desktop. The CLARIN
objective is to make this vision a reality: repositories of data
with standardized descriptions, language processing tools
which can operate on standardized data, with a framework for
th resolution of legal and access issues, and all of this available
on the internet using Grid architecture.
The nature of the project is therefore primarily to turn
existing, fragmented technology and resources into accessible
and stable services that any user can share or customize for
their own applications. This will be a new underpinning for
advanced research in the humanities and social sciences - a
research infrastructure.
Objectives of the current phase
CLARIN is currently in the preparatory phase, which
has the aim of bringing the project to the level of legal,
organisational and fi nancial maturity required to implement
the infrastructure. As the ultimate goal is the construction
and operation of a shared distributed infrastructure to make
language resources and technology available to the humanities
and social sciences research communities at large, an approach
along various dimensions is required in order to pave the way
for implementation. The fi ve main dimensions along which
CLARIN will progress are the following:
• Funding and governance, bringing together the funding
agencies in all participating countries and to work out
a ready to sign draft agreement between them about
governance, fi nancing, construction and operation of the
infrastructure.
• Technical infrastructure, defi ning the novel concept of a
language resources and technology infrastructure, based
on existing and emerging technologies (Grid, web services),
to provide a detailed specifi cation of the infrastructure,
agreement on data and interoperability standards to be
adopted, as well as a validated running prototype based on
these specifi cations.
• Languages and multilinguality, populating the prototype
with a selection of language resources and technologies for
all participating languages, via the adaptation and integration
of existing resources to the CLARIN requirements, and in a
number of cases the creation of specifi c essential resources.
• Legal and ethical issues relating to language resources will
have to be examined and thoroughly understood, and the
necessary legal and administrative agreements proposed to
overcome the barriers to full exploitation of the resources.
• Focus on users, the intended users being the humanities
and social sciences research communities.
This fi nal dimension is in many ways the most important, and
will be explored in the most detail in this paper. In order to
fully exploit the potential of what language technology has to
offer, a number of actions have to be undertaken: (i) an analysis
of current practice in the use of language technology in the humanities will help to ensure that the specifi cations take into
account the needs of the humanities, (ii) the execution of a
number of exemplary humanities projects will help to validate
the prototype and its specifi cations, (iii) the humanities and
social sciences communities have to be made aware of the
potential of the use of language resources and technology to
enable innovation in their research, and (iv) the humanities and
language technology communities have to be brought together
in networks in order to ensure lasting collaborations between
the communities. The objective of this cluster of activities is
to ensure that the infrastructure has been demonstrated to
serve the humanities and social sciences users, and that we
create an informed community which is capable of exploiting
and further developing the infrastructure.
Concluding remarks
CLARIN still has a long way to go, but it offers an exciting
opportunity to exploit the achievements of language and
speech technology over the last decades to the benefi t
of communities that traditionally do not maintain a close
relationship with the latest technologies. In contrast to many
European programmes, the main benefi ciaries of this project
are not expected to be the big ICT-oriented industries or the
bigger language communities in Europe. CLARIN addresses
the whole humanities and social sciences research community,
and it very explicitly addresses all the languages of the EU
and associated states, both majority and minority languages,
including languages spoken and languages studied in the
participating countries.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Oulu
Oulu, Finland
June 25, 2008 - June 29, 2008
135 works by 231 authors indexed
Conference website: http://www.ekl.oulu.fi/dh2008/
Series: ADHO (3)
Organizers: ADHO