The Corpus of Historical Mapudungun: Digital Tools for New-World Language Change

paper, specified "short paper"
  1. 1. Benjamin Joseph Molineaux

    University of Edinburgh

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1 Historical linguistics and the Americas
Minority, non-European languages — such as indigenous American ones — are critically under- represented in the literature on language change. This not only narrows our view of the historical interaction of peoples and languages predating European expansion, but also limits our under- standing of language change as a whole. In the absence of the hundreds of years of philological study available for Old World languages, digital methods emerge as ideal means for systematically compiling and exploring the available data for historical languages in the New World. This said, carefully tagged, historically-oriented corpora for Native American languages are still few and far between, despite the general growth in Digital Humanities scholarship throughout the region.

2 Mapudungun, a prime candidate
Mapudungun (arn, ISO:639-3), the ancestral language of the Mapuche people of present-day Chile and Argentina, has been fairly well documented for over 400 years. While the types of available historical material for the language are fairly typical — including missionary, military and ethnographic works (see Villena 2017: for an overview) — the historical depth, as well as the number of texts available for Mapudungun is better-than-average for the region. Current Mapudungun varieties are mostly well described and remain in use, nevertheless, there is very little explicit work on their history. This availability of historical and contemporary data, coupled with a gap in historical research, makes Mapudungun an excellent candidate for a first corpus-based historical account of a language of the Americas. The building of such a resource, The Corpus of Historical Mapudungun (CHM), is the focus of an ongoing research project at the Angus McIntosh Centre for Historical Linguistics, Edinburgh, and the topic of this talk.
Beyond the empirical and methodological advantages to working with historical data for Mapudungun, there are also theoretically interesting reasons to do so. Features such as nominal incorporation, verb serialisation, agglutination and polysynthesys raise interesting questions about the diachrony of units of sound and meaning, which cannot be probed by better-studied, yet typologically distinct languages.
Overall, the rich word-internal morphology of Mapudungun challenges traditional domains for sound change. In particular, the fact that morphological transparency is paramount to the language’s system, means that there is little indication of word-level reduction and neutralisation processes, which are key to the study of sound change in Indo-European languages (Molineaux 2017, 2018). These facts of Mapudungun will be key both to our analysis of change in the language and to the practicalities of corpus-design.

3 Building the corpus
The majority of texts included in the CHM are printed material dated between 1606 and 1930, making up some 400k words (see Using both archival images and newly digitised materials from the
Biblioteca Nacional de Chile and the
Archivo Rodolfo Lenz in Santiago, Chile, the relevant texts have undergone optical character recognition. This was done via the
Digital Humanities Dashboard (Tarpley 2018) and followed up by hand-checking. The result is a collection of digital text covering the first 324 years of written Mapudungun. While the texts are gathered from all major language areas, they inevitably remain imbalanced in nature — both temporally and spatially —, as tends to be the case for historical corpora.

As the objective of the corpus is to provide a view into the synchrony and diachrony of lexical, morphological and phonological features, texts are being parsed at all three of these levels. The first stage of this process — lemmatisation — identifies the key root-elements, as well as the part-of-speech (POS) category for each word, providing a single identifiable label and reducing both morphological and spelling heterogeneity (see 1).





‘to play’




‘to play’


<w xml:lang=“arn” lemma=“kuden” pos=“V” corresp=“play”>kudekefuingu</w>
<w xml:lang=“arn” lemma=“kuden” pos=“V” corresp=“play ”>kuthekalape</w>

The second stage is morphological parsing, which identifies individual morphemes beyond the root and labels them according to function (as in 2). The result of both these processes is a TEI-standard XML text with the relevant tags embedded. A full 10% of the total word-types in the corpus texts is soon to be completed, after which an AI algorithm, developed at the University of Edinburgh’s NLP Group, will be trained to tag the remainder of the material both at the level of the lemma and the morpheme. Additional hand corrections will be necessary in order to complete the process.









<w><m baseForm=“kude” type=“root” corresp=“play”>kude</m><m baseForm=“ke” type=“habit”>ke</m><m baseForm=“fu” type=“BI”>fu</m><m baseForm=“ingu” type=“ind.3.d”>ingu</m></w>
<w><m baseForm=“kude” type=“root” corresp=“play”>kude</m><m baseForm=“ka” type=“cont”>ke</m><m baseForm=“la” type=“neg”>la</m><m baseForm=“pe” type=“”>pe</m></w>

The final stage of the tagging will be grapho-phonological parsing (Kopaczyk et al. 2018), which entails providing sound values for each word (as in 3), following a list of spelling-based rules for each text. The results should effectively reconstruct the phonic structure of each text, such that it can be compared with others from different periods and locations, helping to map phonological change from the bottom up.
The front end of the corpus — soon to be available in beta form — will provide search options (in both English and Spanish) at all three levels of tagging (word, morpheme and sound), as well as allowing users to correlate these features across texts and with relevant non-linguistic metadata such as date, location, author, genre, etc. A simpler browser version will also be available for non-linguists, allowing for texts to be browsed and downloaded with parallel translations.




Valdivia 1606



Augusta 1916


<m><c ipa=“v”>v</c><c ipa=“ɨ”>ú</c><c ipa=“t”>t</c><c ipa=“a”>a</c></m>
<m><c ipa=“f ”>f</c><c ipa=“ɨ”>ü</c><c ipa=“ʧ”>ch</c><c ipa=“a”>a</c></m>

4 Applications
In this talk I will give example applications of the CHM to language change, looking into (a) the spread of Quechua lexical borrowings and (b) the evolution of morpheme-boundary epenthesis. More generally, however, the CHM paves the way for the application of digital methods to the history of minority, non-standard languages, creating transferable tools, and foregrounding under-studied typological features. Such outcomes will broaden our understanding of language change overall, by allowing a detailed view of the interaction of the lexicon, morphological structure, and sound systems over time and space.
Locally, the project will provide teachers, learners and advocates of Mapudungun with a repository of words in their historical usage and forms, which may be used to enhance word-building strategies, revitalise dialectal vocabulary and more generally improve transmission, both through written and spoken medium.

Augusta, F. J. (1916).
Diccionario araucano-español y español-araucano. Santiago: Imprenta Universitaria.

Kopaczyk, J., Molineaux, B., Los, B., Karaiskos, V., Alcorn R. and Maguire, W. (2018). Towards a grapho-phonologically parsed corpus of medieval scots: database design and technical solutions.
13(2). 255–269.

Molineaux, B. (2017). Native and non-native perception of stress in Mapudungun: Assessing structural maintenance in the phonology of an endangered language.
Language and Speech
60. 48–64.

Molineaux, B. (2018). Pertinacity and change in Mapudungun stress assignment.
International Journal of American Linguistics
84(4). 513–558.

Tarpley, B. (2018). Digital humanities dashboard.
Center of Digital Humanities Research, Texas A&M University.

Valdivia, L. de (1606).
Arte y gramatica general de la lengua que corre en todo el Reyno de Chile, con un vocabulario y confessionario. Seville: Thomás López de Haro.

Villena, B. (2017). Fuentes para el estudio del mapudungún: propuesta de periodización.
Lenguas y Literaturas Indoamericanas
1(19). 141–167.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO