Data mining digital libraries

Lars Gunnarsønn Johnsen; Magnus Breder Birkenes; Arne Martinus Lindstad

Authorship

1. Lars Gunnarsønn Johnsen

University of Bergen
2. Magnus Breder Birkenes

National Library, Norway
3. Arne Martinus Lindstad

National Library, Norway

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The central theme for this workshop is data mining and the connection between metadata and data in the context of digital libraries. Digital resources and search engines raise several questions about the relationship between metadata and the data they describe. For example, what is the relationship between metadata keywords and classification categories (e.g. Dewey)? How should topics found by topic modeling algorithms be labelled? With readily available search engine technology, using document relevance based on content words, is there a need for library classification systems at all, like Dewey or UDC?
While there may be overlap between metadata and the texts described contentwise, metadata typically contain information not found within the text, such as author, geolocation and time data. In addition, subject or topic words typically consist of carefully constructed language models in the form of thesauri dedicated specifically towards specialized literary collections within different fields. The question is then how search engines may benefit from such metadata with a language model, and for what kind of library user?
In this workshop, we invite colleagues to discuss the application of various methods related to digital library resources, including the structure of the metadata itself, as well as digital book collections. Many resources are available to libraries in digital form, like journals and new book titles, while some libraries also have launched digitization programs to create digital libraries, using scanners and OCR technology.
Both the text data and the metadata of digital libraries can be scrutinized with data mining techniques, opening up the material for large-scale, quantitative analysis. This makes such collections highly relevant for Digital Humanities studies.

Background
The ongoing trend towards increased digitization in society in general poses numerous challenges at many levels, but also opens up for vast opportunities within many fields, including the library sector.
At the National Library of Norway, a mass digitization project was initiated in 2006, with the goal of digitizing the entire collection of books, newspapers, movies, radio- and television-broadcasts, music etc., in sum everything published in the public domain in Norway of all media types, i.e. the entire cultural heritage of Norway. For books, the goal is to have the entire stock digitized by 2017. Thus far, some 435.000 of 450.000-500.000 books have been digitized. When all books and newspapers have been digitized, we estimate that our Norwegian text corpus will consist of some 80 - 100 billion tokens, which is big for a rather small language like Norwegian with approximately 5 million speakers. In comparison, the Google Books corpus contains approximately 500 billion tokens for English.
The National Library cooperates with scholars of literary studies and linguistics in developing and applying methods of data mining to the digital collection. We develop services that make the content available for quantitative research, without challenging intellectual property rights. One such service is NB N-gram for Norwegian (see
http://www.nb.no/sp_tjenester/beta/ngram_1/), comparable to Google Ngram Viewer for English and other languages.

Workshop leaders
Lars G. Johnsen: Research librarian at the Nation Library of Norway, PhD in linguistics. Fields of interest: semantics, grammar, philosophy of language, probability theory and applications. Email:
Lars.Johnsen@nb.no, Phone: +47 23 27 61 84

Arne Martinus Lindstad: Research librarian at the National Library of Norway, PhD in linguistics. Fields of interest: corpus linguistics, language change, comparative syntax, negation. Email:
arne.lindstad@nb.no, Phone: +47 23 27 62 11

Magnus Breder Birkenes: Research librarian at the National Library of Norway, PhD in linguistics. Fields of interest: corpus linguistics, history and dialectology of the Germanic languages. Email:
magnus.birkenes@nb.no, Phone: +47 23 27 60 54

Target audience
Librarians, research librarians, scholars of literary studies, corpus and computational linguists

Length and Format
Half day:

09.00 - 09.30
Introduction and opening discussion

09.30 - 10.30
Slot 1: Structure of metadata, Data mining and library classification systems

10.30 - 11.00
Coffee break

11.00 - 12.00
Slot 2: Metadata and modeling

12.00 - 12.30
Wrap-up and final discussion

Budget
Coffee and snacks for the coffee break (max. 50€)

Technical requirements
A projector for presentations. Internet connection. We will bring our own computers.

Call for papers (cfp)
Are you interested in automatic classification of documents and what implications this has for libraries? How may search engines (like
ElasticSearch) benefit from library metadata? Do you have any experience with developing public/academic web services on top of large amounts of library data? If these questions appeal to you, this workshop may be of interest. The central theme for this workshop is data mining and the connection between metadata and data in the context of digital libraries.

We invite papers on topics such as:
The structure of subject headings and descriptors, used in book classification (e.g. in building thesauri)
The relationship between topic words and library classification systems
The relationship between content words and topic words (of existing metadata, or as output from topic modeling algorithms)
Automatic classification of digital documents
Authorship attribution
Development of computational services for research and the general public
Legal issues arising with different data mining practices
Please send us an abstract of max. 500 words that is situated within the above context.

Program Committee
Oddrun Ohren (National Library of Norway)
Koenraad De Smedt (University of Bergen)
Anders Nøklestad (University of Oslo)
Elise Conradi (National Library of Norway)

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2016

"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website: https://dh2016.adho.org/

Series: ADHO (11)

Organizers: ADHO

Data mining digital libraries

1. Lars Gunnarsønn Johnsen

2. Magnus Breder Birkenes

3. Arne Martinus Lindstad

ADHO - 2016

"Digital Identities: the Past and the Future"