Accuracy is not all you need

paper, specified "long paper"
  1. 1. Ross Deans Kristensen-McLachlan

    Aarhus University

  2. 2. Ida Marie S. Lassen

    Aarhus University

  3. 3. Kenneth Enevoldsen

    Aarhus University

  4. 4. Lasse Hansen

    Aarhus University

  5. 5. Kristoffer L. Nielbo

    Aarhus University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Discussions around diversity and bias in language representations are a hot topic in contemporary natural language processing. Countless papers have pointed out that these representations can be shown to contain specific biases, such as in the case of both so-called static embeddings (Bolukbasi et al., 2016) and more state-of-the-art contextual approaches (Zhao et al., 2019).
In this paper, we contribute to on-going attempts to quantify and measure the effects of this bias for specific NLP tasks. We present results of an experiment which tests the effect of data biases on the performance of different NLP frameworks for named entity recognition (NER) in Danish. While the immediate results deal specifically with only this language, the methods employed can be adapted to other linguistic and cultural contexts, given appropriate modifications. We choose to focus specifically on biases in the sense of representational harms (Barocas et al., 2017) by investigating the performance differences of NER in Danish for different social groups, namely gender and ethnicity (Shah et al., 2019). Despite this quite narrow focus, we aim to contribute to wider discussions in the field of
fairML and bias in NLP.

Recent work has pointed out how unintended bias in NLP systems leads to systematic differences in performance for different demographic groups (Borkan et al, 2019). Building on these insights, frameworks, fairness metrics and recommendations for the field have been developed to quantify and mitigate bias (Borkan et al., 2019; Shah et al., 2019; Zhao et al., 2019; Blodgett et al., 2020; Czarnowska et al., 2021). More specifically relevant for this work, Zhao et al. (2018) have shown how data augmentation on training data can eliminate gender bias for coreference resolution. With our work, we propose another use of data augmentation, namely as a method to test the robustness of NLP models as well as uncover potential social biases in the model.


We define bias as systematic difference in error –
error disparity – as a function of a given sensitive features (Shah et al. 2019). In other words, bias in the model is measured through difference of performance accuracy when data is augmented with different gender and ethnicity features.

In Enevoldsen et al. (2021) a range of contemporary Danish NLP frameworks were subjected to a series of data augmentation strategies to test their robustness. These augmentations included random keystroke augmentation to simulate spelling errors; and spelling variations specific to the Danish language. Additionally, among the augmentation strategies were the following name augmentations:

Substitute all names (PER entities) with randomly sampled Danish names, respecting first and last names.
Substitute all names with randomly sampled names of Muslim origin used in Denmark (Meldgaard, 2005), respecting first and last names.
Substitute all names with sampled Danish male names, respecting first and last names.
Substitute all names with sampled Danish female names, respecting first and last names.

The purpose of these augmentations was to test specifically the robustness of named entity recognition in each of the Danish NLP frameworks given data which had been augmented relative to gender and ethnicity. If a framework performed just as well (or better) with these augmentation as without, this is taken to indicate robustness. If a framework performs worse, we are able to quantify exactly where the model is failing and, hence, where potential biases reside.
Table 1 illustrates general performance of contemporary Danish NLP frameworks on a range of tasks. We can see that larger, transformer-based models consistently outperform other models, particularly on NER tasks. At a general level, these results underline three well-known trends in deep learning and NLP: 1) larger models tend to perform better; 2) higher quality pre-training data leads to better models; and 3) multilingual models perform competitively with monolingual models (Brown et al., 2020; Raffel et al., 2020; Xue et al., 2021).


The results from the full range of augmentation strategies can be found in Enevoldsen et al. (2021). In Table 2, we see only the results following name augmentation. From this table, we can see that the NER performance of every model is affected to some degree by the data augmentations. What is immediately apparent, though, is that not all models are affected equally and not all augmentations cause as pronounced effects. Our results seem to demonstrate that Danish language models are relatively robust to the effects of gender. However, the same can not be said for Muslim names, which cause significantly worse performance for all models. This suggests that Danish NLP models contain a greater relative bias in terms of ethnicity than gender.


There are some notable limitations to the current study. Firstly, we have presented experimental results for a single, comparatively small Indo-European language. Nevertheless, we predict that similar results could be obtained for different languages, given an appropriate change of experimental conditions. Secondly, by making augmentation on male and female names we continue the folk conception of gender, where gender is understood as binary and static. Finally, we have not further separated Muslim names into typically male Muslim names and typically female Muslim names. Were our results stratified in this manner, we would be in the position to conduct a more intersectional analysis into the relative effect of gender and ethnicity. This more nuanced perspective is the goal of a future experiment.
However, the present results already paint a complex but essentially coherent picture. We have demonstrated that data augmentation is a simple and transparent way of testing the robustness of contemporary language models on tasks like named entity recognition. Moreover, we have shown that these data augmentation tasks can be used to test for specific biases in language models with respect to categories such as gender, race, and ethnicity. Doing so puts us in the position to evaluate the performance of specific language models not only in relation their overall performance metrics such as accuracy and micro-F1 scores but also their robustness and biases relative to these categories.
Our proposal is that in the domain of NLP and language modelling
accuracy is not all you need. This is especially true for researchers who apply NLP tools for text analysis in the humanities and social sciences. For example, a researcher working in the field of gender history might need their models to be particularly robust with respect to gender; a scholar of social media might have a specific reason to require that their model is particularly robust to different ethnicities represented in their data. Rather like the trade-off between precision and recall or between speed and accuracy, there may need to be a trade-off between the overall accuracy of a particular language model and its robustness in terms of diversity. By reflecting on bias in model selection we advocate for a move of responsibility for model biases back to the technologists developing and deploying these models instead of pushing it to the communities affected by biases in NLP systems (Blodgett et al., 2020).

Barocas, S., Crawford, K., Shapiro, A., & Wallach, H. (2017). The problem with bias: Allocative versus representational harms in machine learning, Proceedings of SIGCIS 2017.
Blodgett, S. L., Barocas, S., H. D. III, & Wallach, H. M. (2020). Language (technology) is power: A critical survey of “bias” in NLP, URL: arXiv:2005.14050.
Bolukbasi, T., Chang, K., Zou, J. Y., Saligrama, V. & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings, URL: arXiv:1607.06520.
Borkan, D., Dixon, L., Sorensen, J., Thain, N., & Vasserman, L. (2019). Nuanced metrics for measuring unintended bias with real data for text classification, URL: arXiv:1903.04561.
Brown, T. B. , Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R, Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020) Language models are few-shot learners. URL: arXiv:2005.14165.
Czarnowska, P., Vyas, Y., & Shah, K. (2021). Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics, URL: arXiv:2106.14574.
Enevoldsen, K., Hansen, L., & Nielbo, K. (2021). Dacy: A unified framework for Danish NLP, URL:, arXiv:2107.05295.
Gaut, A., Sun, T., Tang, S., Huang, Y., Qian, J., ElSherief, M., Zhao, J., Mirza, D., Belding, E. M., Chang, K., & Wang, W. Y. (2019). Towards understanding gender bias in relation extraction, URL: arXiv:1911.03642.
Meldgaard, E. V. (2005). Muslimske fornavne i danmark. URL: fornavne/, Københavns Universitet.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020) Exploring the limits of transfer learning with a unified text-to-text transformer, URL: arXiv:1910.10683.
Shah, D., Schwartz, H. A., & Hovy, D. (2019). Predictive biases in natural language processing models: A conceptual framework and overview, URL: arXiv:1912.11078.
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2020) mT5: A massively multilingual pre-trained text-to-text transformer. URL: arXiv:2010.11934.
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods, URL: arXiv:1804.06876.
Zhao, J, Wang, T., Yatskar, M., Cotterell, R., Ordonez, V., & Chang, K.W. (2019). Gender bias in contextualized word embeddings, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp. 629–634. URL doi:10.18653/v1/N19-1064.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO