Training NLP Models for the Analysis of 16th Century Latin American Historical Documents: Tagtog and the Geographic Reports of New Spain

poster / demo / art installation
Authorship
  1. 1. Patricia Murrieta-Flores

    Digital Humanities Hub-Department of History - University of Lancaster

  2. 2. Raquel Liceras-Garrido

    Digital Humanities Hub-Department of History - University of Lancaster

  3. 3. Mariana Favila-Vázquez

    Templo Mayor Museum - Instituto Nacional de Antropología e Historia (National Institute of Anthropology and History)

  4. 4. Katherine Bellamy

    Digital Humanities Hub-Department of History - University of Lancaster

  5. 5. Jorge Campos

    @tagtog.net

  6. 6. Juan Miguel Cejuela

    @tagtog.net

  7. 7. Bruno Martins

    Computer Science and Engineering Department of IST - University of Lisbon

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The latest advances in Text Mining, Natural Language Processing (NLP) and Machine Learning (ML) have opened a world of possibilities in processing massive amounts of texts that would otherwise take a lifetime to read and investigate, enabling researchers to revisit old sources from a new perspective. The purpose of this poster is to present a) the annotation model created for investigating topics related to economy and society in 16
th century New Spain; and b) the use of
tagtog (

https://www.tagtog.net/
), an online tool for (semi-)automated annotation, to create/curate the resources required for developing NLP tools capable of supporting historical research, through statistical learning.

Figure
. Ameca Report (Nueva Galicia) annotated with tagtog

1. Gathering information from New Spain

The project “
Digging into Early Colonial Mexico: A large-scale computational analysis of sixteenth-century historical sources”
(

http://www.lancaster.ac.uk/digging-ecm

) explores advanced computational techniques in conjunction with spatial analysis methods. Using a Big Data approach, the project focuses on analysing a 16
th century corpus known as “Las Relaciones Geográficas de la Nueva España” (hereafter RGs), consisting of documents related to New Spain, an area which encompasses present-day Mexico and Guatemala. These accounts include textual and pictorial information that portray the colonial situation of the domains ruled by the Spanish Crown, describing the life of their inhabitants and the state of the territories five decades after Spanish arrival.

This corpus contains around 2,800,000 words compiled from answers to the RGs questionnaires which were conducted between 1577 and 1585, covering topics including economy, resources, environment, traditions, geography, government, military organisation and language, among many others.

2. Annotating the RGs WITH Tagtog

Analysing the multiple facets contained in the RGs requires a combination of techniques from NLP, ML and Text Mining. In order to identify, extract and analyse such information, our project partnered with tagtog, a collaborative text annotation tool for adding metadata to text, improving processing, querying and classification. In this particular case, (semi-)automatic annotations were produced by teaching tagtog the linguistic nuances of our domain. This tool supports two learning methods: built-in ML and dictionaries. Patterns were defined using dictionaries and a domain-specific ML model was trained to annotate the RG corpus automatically. Tagtog’s interactive interface was used to perform initial annotations, and for curating the results of the ML model.

Figure
. Entity types. Please note that the colours correspond to those used in Figure 1

We first created an annotation model with a set of 40 tailored entity labels (information types) based on data relevant to our research questions with the aim of: a) training the machine to automatically recognise topics we are interested in analysing; b) improving versatility when exploring the text; and c) expanding the scope of techniques such as Named Entity Recognition that are only able to identify proper names. Training the machine in this way will make it possible to extract information and answer questions, such as:

What were the main health issues and remedies in New Spain?
What were the most common resources and how were these distributed?
How did Spanish and indigenous populations view and interact with the physical environment?

The RGs present many challenges for NLP and ML. Firstly, there are complexities associated with a corpus written in 16
th century Spanish, especially considering that many techniques used to analyse large corpora have been developed and tested using modern English texts. Secondly, the corpus is multilingual. The reports are peppered with words in different native languages (e.g. Nahuatl) alongside Spanish. Finally, as it is a questionnaire, the topics discussed are very diverse.

Tagtog has allowed us to overcome some of these problems by designing our own annotation model, taking into account that annotations must be accurate and plentiful in order to represent the various contexts in which entity references can appear. As it is also a multi-user tool with an easy-to-use interface, it guarantees anyone could participate and annotate full documents. Beyond entity annotations, tagtog also supports different annotations that can be used to enrich existing annotations and increase their accuracy: at entity level (spans, labels, relations) and/or document level (labels). The tool’s flexibility also minimized the time required to teach tagtog to annotate relevant information automatically. However, there is still much to be done. We need to carry out more experiments with this tool, comparing, for instance, its performance between different documents. The challenges we face in this endeavour offer us the opportunity to test existing techniques from NLP, ML and Corpus Linguistics, with the aim of improving and developing these tools for use in the humanities.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.