Evaluation of Multilingual BERT in a Diachronic, Multilingual, and Multi-Genre Corpus of Bibles:

paper, specified "long paper"
  1. 1. José Calvo Tello

    Göttingen State and University Library

  2. 2. Javier de la Rosa

    The National Library of Norway

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

BERT and its Application to Digital Humanities

Transformers (GPT, BERT) have become a central piece in NLP (Vaswani et al., 2017; Alammar, 2018; Tunstall et al., preprint). These language models bring new possibilities for pre-training algorithms with no labelled data, which can then be fine-tuned to new tasks (transfer learning) with fewer labels (few-shot learning). Their linguistic prowess has spurred discussions about their limitations, biases and societal and environmental impact (Bender et al., 2021; Carlini et al., 2021; Underwood, 2021).

These algorithms have also sparked the interest of the DH community. Specifically, BERT models (Devlin et al., 2019)

have been explored mainly for the study of English (Han and Eisenstein, 2019; Sims et al., 2019; Fonteyn, 2020; Underwood, 2021)

and German literature (Jannidis and Konle, 2020; Pagel et al., 2021; Ehrmanntraut et al., 2021), or in multilingual settings

(de la Rosa, et al., 2021).

However, its applicability to other Humanities domains remains questionable. Because of the vast amounts of data required for pre-training, these models are usually fed with contemporary text types (web, journalistic, or administrative documents) from resource-rich modern languages. In contrast, the Humanities deal with diverse and heterogeneous datasets, non-standard orthography, historical genres, multilingual datasets, and often from low-resource languages. To assess the performance of multilingual models in this context, we analyze the abilities of a multilingual BERT model (mBERT) pre-trained on Wikipedia for 102 languages (Devlin, 2018)

on a multi-genre, diachronic Bible dataset.

Corpus of Bible Translations

Building multilingual corpora usually involves collecting texts produced originally in each analyzed language and period (Odebrecht et al., 2019; Novakova and Siepmann, 2020; Burnard et al., 2021). However, cultural and historical differences hinder the comparability of the results (Schöch et al., preprint). To circumvent this limitation while accounting for low-resource languages, we choose a corpus of translations of Bibles. Bibles as research objects have a long tradition in Corpus Linguistics and Digital Humanities (Radday, 1973; Neumann,
1990; Holmes, 1991; Resnik et al., 1999; Christodouloupoulos and Steedman, 2015; Walczak, 2015; Lee and Yeung, 2016; Calvo Tello, 2020).

The corpus comes from Zefania-XML-bible

and Bible Gateway.

It contains 221 translations (11,455 books, e.g. Genesis and Psalms) in 54 languages from all continents, including historical ones such as Latin, Gothic and Syriac, and artificial languages (e.g., Esperanto).

Figures 1a-b show the number of translations per language and the historical distribution for languages with translations before 1900.

For the analysis, we apply five metrics on two fronts:

We apply the default mBERT tokenizer and a simple typographic tokenizer to the Bible dataset. The tokens identified by the mBERT tokenizer can be characters, sub-word units (part-tokens), or whole-word units resembling typographic tokens. For languages using white-space delimited writing systems, the higher the number of whole-word units, the easier to interpret for humans, as the resulting tokenized text resembles quite closely the original one. To formalize this notion, we calculate the ratio of mBERT part-tokens by typographic-tokens (Figure 2). We also count the number of unknown tokens, i.e. tokens that are not part of the tokenizer vocabulary and cannot be split into its constituents. Low scores in these metrics represent better results.

Classification tasks

We apply classification to three annotated categories for each book: genre (e.g., historical, law, letter, Zimmermann, 2010), historical group (e.g., Gospel, Pentateuch), and Testament (Old and New). We create balanced datasets and guarantee each language has at least the same number of Bibles per category. We then split in a training (80%), validation (10%), and test set (10%) and build models for each language as well as combined ones for each century (except for translations pre-1500). We assess the performance of each model using the F1-macro metric (the higher, the better).

Hypothesis Testing
We use the five tokenization and classification metrics to test several hypotheses on the capabilities of mBERT. First, we hypothesize that languages in the mBERT pre-train dataset should obtain better results in the five metrics. Thus, we expect languages in the pre-train dataset to obtain low values in the tokenization metrics, and high F1 classification scores. To test this hypothesis, we group Bibles by whether their languages were part of the pre-training of mBERT and run Welch’s t-tests on the tokenization (Figures 3a-b) and classification (Figures 3c-e) metrics. This hypothesis is supported by three of the five criteria: the number of unknown-tokens (Figure 3a) and the classification results on genre and Testament (Figure 3c and 3e).

Second, we hypothesize that the Wikipedia sizes correlate with the metrics. While this is rejected for tokenization (Figures 4a-b), the classification results show statistical moderate correlations (Figures 4c-e). Despite having smaller or no Wikipedia, some languages obtain good overall results, such as Haitian, Jamaican, Gothic or Esperanto.

Consequently, in our third hypothesis, we expect texts in Romance and Germanic languages to score better than the rest of the languages. Figure 5 shows the classification results.

When compared in binary groups, Romance and Germanic languages do obtain better results than the rest for the five metrics (Figures 6a-e). 

Fourth, based on Muller et al.(2021), we hypothesize that translations using Latin script will score better than translations in other scripts. Figure 7 shows the number of unknown-tokens, notably high for Coptic and Syriac scripts.

When compared in binary groups, the five criteria support this hypothesis (Figures 8a-e). This might explain the good overall results for some low-resource languages with small or no Wikipedia, but written in Latin script.

Fifth, we expect better results for contemporary translations (20th and 21st century), as that constitutes both the core of the pre-training and Bibles datasets. We obtained the year of publication for 68% of the corpus. Figure 9 shows a positive trend over time, reaching stability after the 18th century.

However, when compared in binary groups (20th-21st vs. rest), this is only supported by the unknown-tokens metric (Figures 10a).


Our evaluation effectively exhibits biases in favor of some high-resource families of languages (Germanic, Romance), although other languages (Haitian, Jamaican, Esperanto) perform reasonably well. At the historical level, the scores are high and stable not only for the 20th and 21st century as hypothesized, but since the 18th century, probably due to a more consistent spelling since then. However, the strongest bias is not towards language or period, but toward (Latin) script. Therefore, transliteration into Latin script and modernization could be mandatory steps for
DH corpora interested in using mBERT. The
community needs to discuss
in which cases
modernization and transliteration
are acceptable and


ways these limitations could be effectively mitigated.


Alammar, J.

(2018). The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
Visualizing Machine Learning One Concept at a Time
http://jalammar.github.io/illustrated-bert/ (accessed 7 November 2021).

Bender, E. M., Gebru, T., McMillan-Major, A. and Mitchell, M.

(2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
. Virtual Event Canada: ACM, pp. 610–23 doi:10.1145/3442188.3445922. https://dl.acm.org/doi/10.1145/3442188.3445922 (accessed 3 November 2021).

Burnard, L., Schöch, C. and Odebrecht, C.

(2021). In search of comity: TEI for distant reading.
Journal of the Text Encoding Initiative
(Issue 14). Text Encoding Initiative Consortium doi:10.4000/jtei.3500. https://journals.openedition.org/jtei/3500 (accessed 10 September 2021).

Calvo Tello, J.

(2020). What is a Genre? A Graph Unified Model of Categories, Texts, and Features. Ottawa: ADHO https://hcommons.org/deposits/item/hc:31713/ (accessed 12 October 2020).

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., et al.

(2021). Extracting Training Data from Large Language Models.
ArXiv:2012.07805 [Cs]
http://arxiv.org/abs/2012.07805 (accessed 7 November 2021).

Christodouloupoulos, C. and Steedman, M.

(2015). A massively parallel corpus: the Bible in 100 languages.
Language Resources and Evaluation
(2): 375–95 doi:10.1007/s10579-014-9287-y.

Devlin, J.

Multilingual BERT Docummentation
. https://github.com/google-research/bert/blob/a9ba4b8d7704c1ae18d1b28c56c0430d41407eb1/multilingual.md.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
ArXiv:1810.04805 [Cs]
http://arxiv.org/abs/1810.04805 (accessed 22 October 2021).

Ehrmanntraut, A., Hagen, T., Konle, L. and Jannidis, F.

(2021). Type- and Token-based Word Embeddings in the Digital Humanities. http://ceur-ws.org/Vol-2989/long_paper35.pdf (accessed 12 October 2021).

Fonteyn, L.

(2020). What about Grammar? Using BERT Embeddings to Explore Functional-Semantic Shifts of Semi-Lexical and Grammatical Constructions. https://2021.computational-humanities-research.org/cfp/ (accessed 12 October 2021).

Han, X. and Eisenstein, J.

(2019). Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 4238–48 doi:10.18653/v1/D19-1433. https://aclanthology.org/D19-1433 (accessed 7 November 2021).

Holmes, D. I.

(1991). Vocabulary Richness and the Prophetic Voice.
Literary and Linguistic Computing
(4): 259–68 doi:10.1093/llc/6.4.259.

Jannidis, F. and Konle, L.

(2020). Domain and Task Adaptive Pretraining for Language Models. https://dh-abstracts.library.cmu.edu/works/10214 (accessed 12 October 2021).

Lee, J. and Yeung, C. Y.

(2016). An Annotated Corpus of Direct Speech.
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
. European Language Resources Association (ELRA), pp. 1059–63 http://www.lrec-conf.org/proceedings/lrec2016/pdf/1061_Paper.pdf (accessed 20 April 2019).

Muller, B., Anastasopoulos, A., Sagot, B. and Seddah, D.

(2021). When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
. Online: Association for Computational Linguistics, pp. 448–62 doi:10.18653/v1/2021.naacl-main.38. https://aclanthology.org/2021.naacl-main.38 (accessed 8 November 2021).

Neumann, K. J.

The Authenticity of the Pauline Epistles in the Light of Stylostatistical Analysis
. (Society of Biblical Literature Dissertation Series 120). Atlanta, GA: Scholars Press.

Novakova, I. and Siepmann, D.

Phraseology and Style in Subgenres of the Novel: A Synthesis of Corpus and Literary Perspectives
. Cham, Switzerland: Palgrave Macmillan.

Odebrecht, C., Burnard, L., Colorado, B. N. and Schöch, C.

(2019). European Literary Text Collection (ELTeC): Release with 10 collections of at least 50 novels. Zenodo doi:10.5281/ZENODO.4274954. https://zenodo.org/record/4274954 (accessed 17 February 2021).

Pagel, J., Sihag, N. and Reiter, N.

(2021). Predicting Structural Elements in German Drama. http://ceur-ws.org/Vol-2989/long_paper35.pdf (accessed 12 October 2021).

Radday, Y. T.

The Unity of Isaiah in the Light of Statistical Linguistics
. (Collection Massorah. Série 2, Etudes Quantitives et Automatisées 1). Hildesheim: Gerstenberg.

Resnik, P., Olsen, M. B. and Diab, M.

(1999). The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’.
Computers and the Humanities
(1): 129–53 doi:10.1023/A:1001798929185.

de la Rosa, J., Pérez, Á., Sisto, M. de, Hernández, L., Díaz, A., Ros, S. and González-Blanco, E.

(2021). Transformers analyzing poetry: multilingual metrical pattern prediction with transfomer-based language models.
Neural Computing and Applications

doi:10.1007/s00521-021-06692-2. https://doi.org/10.1007/s00521-021-06692-2 (accessed 22 November 2021).

Schöch, C., Erjavec, T., Patras, R. and Santos, D.

(preprint). Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.
Modern Languages Open

doi:10.5281/zenodo.4742420. https://zenodo.org/record/4742420 (accessed 10 September 2021).

Sims, M., Park, J. H. and Bamman, D.

(2019). Literary Event Detection.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
. Florence, Italy: Association
for Computational Linguistics, pp. 3623–34 doi:10.18653/v1/P19-1353. https://aclanthology.org/P19-1353 (accessed 7 November 2021).

Tunstall, L., Werra, L. von and Wolf, T.

Natural Language Processing with Transformers: Building Language Applications with Hugging Face.

Sebastopol: O’Reilly Media.

Underwood, T.

(2021). Mapping the Latent Spaces of Culture
Using Large Digital Libraries to Advance Literary History
. Humanities Commons https://tedunderwood.com/2021/10/21/latent-spaces-of-culture/ (accessed 3 November 2021).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I.

(2017). Attention Is All You Need.
ArXiv:1706.03762 [Cs]
http://arxiv.org/abs/1706.03762 (accessed 8 November 2021).

Walczak, B.

(2015). Role of the Bible in the development of languages and linguistics. An outline of the issue.
Forum Lingwistyczne
(2). Wydawnictwo Uniwersytetu Śląskiego / University of Silesia Press https://www.journals.us.edu.pl/index.php/FL/article/view/6066 (accessed 15 November 2021).

Zimmermann, R.

(2010). Theologische Gattungsforschung. In Zymner, R. (ed),
Handbuch Gattungstheorie
. Stuttgart: Verlag J.B. Metzler, pp. 302–05.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO