Digital Dutch East Asia Company (VOC) Linguistic Archive: Modeling a community-sourcing platform for historical linguistics research.

paper, specified "short paper"
  1. 1. Anna Pytlowany

    University of Amsterdam

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Overview
The primary aim of the project “Digital Linguistic Archive of the Dutch East India Company (VOC)” is to bring together in one online platform all Dutch documents from 1600-1825, which were written either by VOC employees or under the auspices of the Company, and relate to the languages of the “newly discovered” territories, mainly South East Asia and Africa.

The secondary, yet more ambitious goal is to create an online platform enabling researchers to contribute and exchange knowledge about the source documents, possibly including social editions of some texts. Such collaboration would bring together researchers from different countries and languages, and result in opening access to a vast range of unpublished linguistic material.

2. Source material
The archives of the Dutch East India Company (VOC) are spread between The Hague, Jakarta, Kaapstad, Colombo, Madras and London. Their historical and cultural significance was recognized by UNESCO in 2003, when they were added to UNESCO's Memory of the World Register, a 'World Heritage List' for the preservation of valuable archives and library collections. Considering the size of the VOC archives – comprising 34 million pages in total, including 1.4 kilometers of shelf material in the Hague and 2.5 km in Jakarta – one can only be disappointed at the scarcity of available documents relating to the languages of the Dutch maritime empire.

However, a closer inspection of various independent library records yields surprising finds: some of these Dutch manuscripts and printed books may be found scattered in private and public collections from Paris, to London, to Venice, to Sydney. Up to this point, only a few isolated studies on these manuscripts have been available. An online database detailing all bibliographical information available would help unravel the documents’ provenance, itineraries, and interconnections.

So far, these documents have never been assembled and arranged into one comprehensive collection. Compiling such an inventory would be highly desirable because it would allow a better evaluation of the scope and content of Dutch colonial linguistic heritage. Last but not least, online publishing of digital editions would open access to a vast range of previously unpublished linguistic material.

3. The need for collaboration
One of the most significant challenges to face in the modeling of this project is the potential multitude of languages and scripts involved. Although the language of the main body of texts is predominantly Dutch, the grammars and vocabularies, by their very nature, each contain at least one other “exotic” language, often written in the native, non-Latin script. Inasmuch as one part of the multi-layered text may be accessible to a particular researcher, chances are that the remaining levels require further work to make them readily understandable.

This is perhaps the most compelling characteristic of the archival material of this type: it lends itself perfectly to a collaborative, crowdsourced (or rather: community-sourced) online undertaking. The tasks may include: adding new documents, transcribing or translating existing documents, proofreading and validating transcriptions, making annotations, and adding references and annotations.

Let’s demonstrate it on a real-life example: a newly discovered and digitised 17th century treatise on Tamil letters, recently made available online by the Utrecht University Library (Ms. 1479):

Fig. 1: A newly digitized manuscript: user story

A famous researcher on Tamil would be quite interested in studying it, but she does not know Dutch, or maybe is not familiar with the 17th century Dutch paleography. However, a paleography student may find it motivating and worthwhile to practice his skills on a real-life manuscript; a retired English teacher from Holland may then contribute the English translation.

Once the text is rendered into English, it opens new possibilities for a wide community of non-Western scholars who may be interested in early descriptions of their native languages. They, in turn, can contribute their transcription of parts written in non-Latin scripts, as well as annotations regarding the linguistic content of the documents.

3. Content organisation
The sources will be indexed by title, author, language, printer / publisher (where applicable), and the relevant linguistic category. This will enable a dynamic visualization of interrelations between any selected two criteria by means of relationship graphs, to help users understand the networks and collaboration patterns.

Additional tools, such as geo-referencing, annotations, etc. can also be developed. For the purposes of paleographic research and comparison, a sample of the handwriting from the text would also be provided.

4. Challenges
However, before any IT solutions can be developed, other key issues will have to be addressed, notably in relation to copyrights and intellectual property rights. How can the existing digitized object from different libraries be brought together? How to ensure access to documents behind a paywall? Could the added value of knowledge and online traffic act as trade-off for libraries?

The other methodological questions will concern ways of organising and managing the crowdsourcing community based on previous experience from comparable projects. The issues of authority, access and quality control will have to be addressed in order to ensure adherence to rigorous scholarship standards.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO