Exploring Lexical Diversities:

paper, specified "long paper"
  1. 1. Andreas Blombach

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

  2. 2. Stephanie Evert

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

  3. 3. Fotis Jannidis

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

  4. 4. Steffen Pielström

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

  5. 5. Leonard Konle

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

  6. 6. Thomas Proisl

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The assumption that literary texts are particularly complex is one of
the most important premises of work in literary studies (for example
Koschorke 2016, Nan Da 2019). This complexity can be perceived on many
different levels, with lexical diversity being one of many determining
factors. Different disciplines have proposed different measures over
time, but only recently some attempts have been made to consolidate
research findings into a comprehensive overview (for example Jarvis
2013; Tweedie/Baayen 1998). Here, we propose a multi-dimensional model
of lexical complexity. We provide a definition for each dimension and
suggest a best-practice operationalization for most. These
operationalizations are validated by comparing a collection of texts for
adult readers with a collection of comparable texts aimed at children.
Finally, we illustrate the usefulness of our approach in application to
literary texts. Though we work with German texts, previous work on
variability with different languages including Chinese and Japanese has
shown that these measures are not language specific (Pielström et al. in


The validation corpora (Weiß & Meurers 2018) contain German non-fiction text from the educational magazine “Geo” (


), a publication conceptually comparable to the “National Geographic”, and its offshoot for children called “Geolino”. For literary texts, we compare highbrow novels

(161 works, approx. 17 mio. tokens) with “dime novels” (1167 works in six different genres, approx. 40 mio. tokens), both under copyright. Dime novels are a type of fiction mass-produced in long-lasting series and sold in kiosks rather than book stores.

Aspects of complexity and measurement
Quantifying diversity is no trivial task. As Jarvis (2013b) points out, existing measures of lexical diversity often lack an underlying construct definition and intuitive concepts of diversity vary. Jarvis proposes six dimensions to properly define the construct: variability, volume (which we do not consider separately), evenness, rarity, dispersion, and disparity. Additionally, we look at innovation, surprise, and density.

The most intuitive indicator of lexical diversity is the variability of the words used in a text. The most widely known measure is the type-token ratio (TTR).
TTR depends systematically on sample size. Among the solutions proposed for this problem, standardized TTRs (STTR) calculated from fixed-length text chunks provide a practical and intuitive solution (Fig. 1).

Figure 1: STTR in GEO and GEOlino

A text containing many rare words will generally be perceived as more difficult and more complex than a text with a higher proportion of very common words. We use a simple approach to model rarity. For each text, we compute the proportion of content words not included in the 5,000 most frequent content words from a large web corpus that covers many different registers, the DECOW16BX (Fig. 2, Schäfer and Bildhauer 2012, Schäfer 2015). 

Figure 2:
in GEO and GEOlino

According to Jarvis (2013b), the perceived lexical diversity is higher if the occurrences of a particular type are more dispersed, whereas a more clustered pattern produces an impression of redundancy. To measure this effect, we again use a window-based approach (Fig. 3). Inside a window, we calculate a dispersion score based on the Gini coefficient (Gini 1912) for each type and use the arithmetic mean of this score over all types with a frequency greater than one as dispersion measure for the whole text (see Blombach et al. in preparation for a detailed description).

Figure 3: Dispersion in GEO and GEOlino


Lexical disparity follows the intuition that repetition also shows in the occurrence of

words on a semantic level. To measure global disparity, a document is segmented and a vector is then generated for each segment by averaging over the vectors of the content words. The disparity of a segment is then calculated from the pairwise euclidean distance of all its segments. The document's disparity is the mean over all its segment disparities (Fig. 4).

Figure 4: Disparity in GEO and GEOlino

A text containing a higher proportion of content words can be considered denser and therefore more complex (Fig. 5).

Figure 5: Density in GEO and GEOlino


Most of the measures suggested here (variability, rarity, dispersion, and density) are implemented in our textcomplexity toolbox

that contains additional complexity measures as well.

We have also created an interactive “


which allows users to visually explore our data, including correlations between different measures and the influence of parameters such as window size, case sensitivity and the inclusion or exclusion of punctuation.

Application to Literature
Fig. 6 shows the measures of lexical complexity applied to six genres of dime novels and a set of highbrow novels. Counter to our expectations, science fiction and fantasy equal or even surpass the highbrow novels in some respects (disparity, density, dispersion and rarity). We assume that we have different forms of lexical complexity at work here: In science fiction and fantasy, a noun-heavy prose is depicting new worlds with new words. In high literature on the other hand, high variability shows the influence of a stylistic ideal which aims to avoid repetition and show elegance. There might be a difference in the scope which authors control for complexity, for example variability. We found less repetition in small windows in genre texts, whereas variability in highbrow literature increases with window size. 
Fig. 7 shows that genre similarities can be perceived immediately using this kind of representation. A multi-dimensional model of lexical complexity allows a clearer understanding of genre differences.

Figure 6: Diversity per aspect. All dimensions have been scaled to values between 0 and 1

Figure 7: Radarplots, highlighting the similarities between genres


Blombach, A., Evert, S., Jannidis, F., Konle, L., Pielström, S.
Proisl, T. (in preparation): Lexical Complexity in Texts. A Multidimensional Model.

Da, N. Z. (2019): The computational case against computational literary studies. In:
Critical Inquiry, 45(3), p. 601–639.

Falk, I., Bernhard, D.
Gerard. C. (2014): From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers. In:
Proceedings of LREC 2014.

Gini, C. (1912):
Variabilità e Mutuabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche. C. Cuppini, Bologna.

Jarvis, S. (2013a): Capturing the Diversity in Lexical Diversity. In:
Language Learning 63 (1), p. 87–106.

Jarvis, S. (2013b): Defining and Measuring Lexical Diversity. In: Jarvis, Scott / Daller, Michael (Eds.):
Vocabulary Knowledge. Human Ratings and Automated Measures. Amsterdam: John Benjamins. (= Studies in Bilingualism 47)

Klosa, A.
Lungen, H. (2018): New German Words: Detection and Description. In:
Proceedings of the XVIII EURALEX, p. 559–569. Ljubljani.

Koschorke, A. (2016):
Komplexität und Einfachheit. p. 1–10. Stuttgart.

Ney, H., Essen, U.
Kneser, R. (1994): On structuring probabilistic dependences in stochastic language modelling. In:
Computer Speech & Language, Volume 8, Issue 1, p. 1-38.

Pielou, E.C. (1966): The measurement of diversity in different types of biological collections. In:
Journal of theoretical biology. 13: p. 131–144. doi:10.1016/0022-5193(66)90013-0

Pielström, S., Hodošček, B., Calvo Tello, J., Henny-Krahmer, U., Jannidis, F., Schöch, C., Du, K., Uesaka, A. and Tabata, T. (in preparation): Measuring Lexical Diversity of Literary Texts.

Schäfer, R. (2015): Processing and Querying Large Web Corpora with the COW14 Architecture. In:
Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server), p. 28–34.

Schäfer, R.
Bildhauer, F. (2012): Building Large Corpora from the Web Using a New Efficient Tool Chain. In:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), p. 486–493.

Weiß, Z.
Meurers, D. (2018): Modeling the readability of German targeting adults and children: An empirically broad analysis and its cross-corpus validation. In:
Proceedings of the 27th International Conference on Computational Linguistics, p. 303–317, Santa Fe, New Mexico, USA.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO