CLARIN for DH Scholars

lightning talk
  1. 1. Leon Wessels

    CLARIN ERIC, Utrecht University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The increased application of data-driven approaches has been a game changer in the Humanities. By using large quantities of research data and various tools to process and analyse these data, Digital Humanities scholars can address questions that were previously considered too complicated or time-consuming to answer. But the development of the DH field has also reshaped the needs of researchers. DH scholars desire increasingly larger, sufficiently annotated sets of research data and advanced tools to process them. This abstract introduces a number of services offered by CLARIN, the Common Language Resources and Technology Infrastructure, that are particularly interesting for the DH community.As a European Research Infrastructure Consortium (ERIC) established by the European Strategy Forum on Research Infrastructures (ESFRI), CLARIN is a non-commercial Research Infrastructure providing single sign-on access to natural language resources and tools free of charge for all academic researchers. Countries or regions can join CLARIN as member, observer, or third party. As of June 2020, the CLARIN consortium consists of 21 full members, three observers, and a third party, covering a total of 25 participating countries (figure 1). Within each national consortium a number of institutes (universities, academies, research institutes, museums, archives, etc.) contribute resources, tools, and knowledge to the CLARIN infrastructure. Figure 1: This map shows countries and regions participating in CLARIN as member, observer, or third party, and various types of CLARIN centres.CLARIN aims to cater the needs of the entire DH community. Clearly, these needs differ widely. Some DH researchers, for example, like to collect, curate, and deposit their own data sets and to develop their own tools, while others prefer to focus on solving a piece of the puzzle of the various aspects of human society and culture, leaving the technical development to others. Surely, there is no such thing a one-size-fits-all solution to cater to the needs of all different types of DH researchers. However, a number of developments would benefit the DH community at large. For instance, making resources and tools as sustainable, openly available, interoperable, and easily findable as possible. From the outset, it has been CLARIN’s strategy to make its resources and tools available according to the later defined FAIR data principles: Findable, Accessible, Interoperable, and Reusable. CLARIN is a distributed infrastructure. Researchers can deposit their resources in one of the certified CLARIN centres offering open repositories, each devoted to a specific research field. The metadata of each deposit gets automatically harvested by the Virtual Language Observatory (VLO,, a faceted search engine allowing everyone to find resources deposited in all of the CLARIN repositories. Currently, the VLO gives access to over 900,000 records. Once a relevant resource has been found it can be immediately processed by a number of analytical tools for part-of-speech tagging, distant reading, topic modelling, named entity recognition, machine translation, and various other tasks using the CLARIN Language Resource Switchboard ( The Switchboard takes into account features like modality, format, and language to match resources to tools. In addition to a full inventory of all resources made available through the various CLARIN repositories, CLARIN also provides user-friendly overviews of key resources. These overviews of CLARIN Resource Families ( provide information on the availability, language, size, annotation, and license of the resources. Currently, ten corpora families, five families of lexical resources, and three tool families are offered. The corpora families include parliamentary corpora, newspaper corpora, literary corpora, spoken corpora, and computer-mediated communication corpora (e.g. social media posts). The overviews are initiated based on the input of domain experts across the world and continue to be manually curated. But CLARIN offers more than just data and tools. Knowledge is a key component of the CLARIN infrastructure. A coordinated system of Knowledge centres provides knowledge and expertise to researchers. Each Knowledge centre has its own specific area of expertise, such as data management, language learning analysis, speech analysis, treebanking, and several languages. Several funding instruments allow researchers, teachers, and developers to collaborate and to teach each other, stimulating cross-country and cross-disciplinary collaboration. Dozens of recorded presentations, tutorials, discussions, and other videos have been made available on VideoLectures.NET (, teachers, developers, citizen-scientists, policy makers, politicians, journalists, and other people interested in getting to know more about CLARIN are invited to have a look at the CLARIN Value Proposition ( Those who would like to know what is going on in their national CLARIN consortium can have a look at the list of participating consortia ( or at the Tour de CLARIN webpage (, an ongoing initiative to highlight the activities in national consortia and Knowledge centres.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at Data for this conference were initially prepared and cleaned by May Ning.

Conference website:


Series: ADHO (15)

Organizers: ADHO