What remains to be done - Exposing invisible collections in the other 6500 languages and why it is a DH enterprise.

paper, specified "short paper"
Authorship
  1. 1. Nick Thieberger

    University of Melbourne

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

In a recent overview of issues in the digital humanities, Manfred Thaller notes the importance of: “(1) access to the information needed to tackle a research question, (2) the analysis of that information by tools reflecting the methodological requirements of the specific discipline and research problem and (3) the publication of the new information gained by the analytical process.” (Thaller 2012:11)1
For most of the world's 7000 languages there are few records available via the internet. Efforts to increase the documentation of these small languages have led to the development of tools and repositories over the past decade. I suggest that Thaller’s desiderata are reflected in the language documentation activities of the creation of archives, metadata systems and the ability to locate, store, retrieve and re-use language records. The network of language archives represented by the Open Language Archives Community (OLAC) has adopted a common metadata system that each archive serves for OLAC’s aggregation, allowing more specific searches than can be provided by google, for example. However, not all digital language archives currently provide metadata to OLAC, rendering their collections invisible to the aggregated search. While their webpages may be accessible to web- searches, they do not allow the targeted search by language that is the focus of OLAC’s aggregator. Other repositories (including many institutional repositories—national libraries and archives, mission archives and so on) have language content that is not noted in the collection’s catalog, and the catalog may not be available for web-harvesting. Finally, there are collections still held by their creators and not in a repository at all.
This paper discusses two approaches to making collections of primary language material locatable and accessible. While the methods are generalisable to any discipline, this paper describes an index of records of language material for collections that have no such metadata and for which no other mechanism is foreseeable. The first approach builds a traceable index of a researcher’s discoveries in existing repositories, for example, a state library or archive, using established aggregation services. The second is a survey that aims to locate and digitise smaller collections that are currently outside established institutions, typically still in the care of the researcher.
The language index

The language index provides metadata in an Open Archives Initiative-compliant form, allowing records2 to be found in generic language searches3. Not all repositories can provide metadata using ISO-639-3 language codes, so it is useful to provide a mechanism whereby researchers can build this resource as they discover new material. In general, repositories are just unaware of standards rather than being reluctant to share data, hence the need for them to either change their metadata system (which is unlikely) or for an index of the kind described here, that points to their collections.
The paper will demonstrate the index as it is currently implemented in the catalog of the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), and will discuss the issue of persistence of the links provided and future possible alternatives to exposing collections that are otherwise invisible to aggregation.
The survey

In an effort to locate what I have called invisible collections I launched a survey4 in 2012 asking respondents to identify language collections that needed to be either (or both) digitised or described using OLAC’s service. I tried to keep the questions as simple and easy to answer as possible. In this paper I will report on the findings of this survey, in particular noting that they point to the need for training; for simple metadata entry tools; for standards-compliant metadata repositories; and for recognition of collections of primary material as a form of scholarly output.
The survey questions were as follows:
1. Do you know of recordings of small or endangered languages that are not yet digitised? These could be in personal collections or in established repositories that do not plan to digitise their collections. If so, please provide as much detail as you can about the number and type of recordings (reel to reel, cassette, DAT etc), the content, and the state of their current storage. Can you provide information about who to contact about these collections?
2. Do you know of collections whose catalogs are not available through federated searches (that is, they are only available if you visit their website and not anywhere else on the web) and for which we could provide a reference to make it easier to find them?
3. Do you know of repositories of manuscripts that have received little attention from linguists but which are likely, in your opinion, to have linguistic records in them? These may include, for example, missionary archives or State administrative archives.
4. Please include your name and contact email so we can follow up with you if necessary (email addresses will not be added to any lists). (Please indicate if you allow us to publish an anonymised version of your response).
The survey form was publicised among linguistic networks. It is now nominated as a future activity of the international network of language archives, DELAMAN5 which should ensure wider coverage. As a first step, it has revealed an interesting variety of collections, each with characteristics that are significant for the effort of making such collections available. At a time when funding for digitisation is difficult to obtain, it is important to recognise that unique cultural heritage recordings such as these are at risk of being lost. A summary of some responses and an observation about the broader significance of each is given below.
(1) 22 tapes of a Sudanese language held in Washington DC by a retired linguist – how to get them digitised and where to store them then? 22 tapes are sort of manageable. There are also a large number of notes that need to be scanned. For a retired researcher it may not be easy to access the equipment needed to do this work.
(2) Several hundred cassettes in a Solomon Islands language, particularly valuable as they are recorded by a speaker, so capturing lots of natural speech. Digitising such a collection is a serious undertaking needing significant funds.
(3) The tapes are in Stockholm, stored in a box but the recorder is based in Chicago and is still an active academic. A basic problem of access of the collector to their own material.
(4) Colorado, USA, a dozen reel-to-reel and two dozen cassette tapes with a senior linguist concerned to make the collection safe and not being sure what to do.
(5) Tapes were deposited with a national Cultural Centre in a small Pacific country that may or may not have the resources to look after them. It does not publish its catalog (if it actually has one) so it is not clear if these tapes need to be digitised or not, or what conditions may be placed on access to them.
(6) A recent MA in Linguistics at one of the PARADISEC consortium universities, tapes stored in boxes. Paper transcripts may have been thrown out. Shows lack of communication even within our own departments.
(7) A collection of [language] tapes stored in a Harvard University repository which may not prioritise digitising it (but could if funding were made available).
(8) Researchers who have small collections, less than twenty tapes, and digitise them themselves by connecting a tape player to a digital recorder. Problem of methods used in digitisation, may damage the tape and not result in the best digital file.
One reason that these collections are not digitised is clearly the lack of importance placed by academia on the re-use of primary research materials. If it were an acknowledged research output to create archived and accessible collections of primary data, counting towards promotion and tenure, then it is more likely that cases like those listed above would no longer occur. It is clear that much remains to be done to extend the reach of digital language archives, assisting in locating legacy collections, describing and digitising them, connecting with source communities/individuals, creating a means for online annotation (crowdsourcing) and of valuing the collections (both monetarily or academically). I conclude by discussing an online service for providing small metadata snippets pointing to these otherwise invisible collections. This paper presents these efforts based around the digital archive Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC).
References

Thaller, Manfred (Ed.). (2012). Controversies around the Digital Humanities. Historical Social Research Vol. 37 (2012), No. 3.
www.language-archives.org/item/oai:paradisec.org.au:JL1-link
www.language-archives.org/item/oai:paradisec.org.au:JL1-link
www.paradisec.org.au/PDSCSurvey.html
www.delaman.org

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO