A mixture model for a uni-modal word frequency distribution

  1. 1. Harald Baayen

    Max Planck Institute for Psycholinguistics - University of Nijmegen

  2. 2. Fiona Tweedie

    University of Glasgow

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


[1] report a bi-modal density function for the word frequency distribution of morpholo-gically complex words with the Dutch prefix ONT-. They argue that this distribution is a mixture of two distinct distributions. They hypothesize that the words in the higher-frequency ranges,typically denoting well-known concepts of the language, follow the lognormal distribution [2], and that the complex words in the lowest-frequency ranges, typically regular and fully predictable nonce-formations, have a distribution that is highly skewed to the left.

The aim of this paper is to show that mixture distributions can also arise in the case of uni-modal distributions. We show that the English translation equivalents for the Dutch suffix -HEID provide an objective means for distinguishing between the conceptual and anaphoric uses of this suffix. Our data suggest that the conceptual function of -HEID is instantiated by a lognormal distribution, whereas its anaphoric use is realized approximately by the inverse Gauss-Poisson distribution [3]. These results provide further support for the hypothesis advanced by Baayen and Lieber [1] that semantic differences can be reflected in word frequency distributions.


The Dutch suffix -HEID, which, like -NESS in English, coins abstract nouns from adjectives, has two distinct semantic functions. First, -HEID is used to create words for concepts. For instance, the Dutch word SNEL-HEID, literally QUICK-NESS, has SPEED as its English translation equivalent. Thus the concept of 'speed' is realized by a monomorphemic word in English, just as the concepts 'house' and 'tree' are realized by monomorphemic words. In Dutch, the concept for 'speed' happens to be realized by a morphologically complex word. We will henceforth refer to this use of -HEID as its conceptual use.

Second, -HEID is also used to refer to states of affairs that have been introduced previously in the discourse. For instance, if John has been described as grateful, this situation can later be referenced to by 'John's gratefulness'. We will refer to this use of -HEID as its anaphoric use. The two functions of -HEID, concept-formation versus anaphoric discourse referencing, are not mutually exclusive. One and the same word can realize both functions. Nevertheless, the two functions can be distinguished quantitatively. Baayen and Neijt [4], using contextual clues for a sample of low and high-frequency nouns in -HEID in a corpus of newspaper Dutch, show that the conceptual function is somewhat more typical for high-frequency words, while the anaphoric use is somewhat more typical for low-frequency words.


The present paper is a distributional study of the two semantic functions of -HEID in which the English translation equivalents of nouns in -HEID are used to gauge which function is most typical for a given complex word. Although -HEID is most commonly realized by -NESS, there are a great many other translation equivalents in English. Some examples are listed in Table 1. Apart from monomorphemic translation equivalents, we find phrasal equivalents for lexical gaps, and some 35 different suffixal equivalents, of which -NESS occurs most often, followed by -ITY.

Using the CELEX lexical database [5], we obtained a total of 2368 nouns in -HEID. Of these nouns, 702 words occurred in the Van Dale Dutch-English dictionary [6], and 1666 words are only attested in the corpus of 42 million words underlying the CELEX lexical database. The left-hand panels of Figure 1 show the estimated density functions (using as bandwidth the log frequency range devided by 2(1+log2(n), with n the number of words, see Venables and Ripley, 1994:137, [7]) for these distributions using logarithmically transformed frequencies.

Table 1. Examples of English translation equivalents for the Dutch suffix -HEID




monomorphemic equivalent





phrasal equivalent



incidental circumstance

dirty trick

equivalent in -ITY





equivalent in -ION





Figure 1. Density estimates for frequency distributions of sets of words in -HEID.

The top left panel plots the density function for the complete distribution of 2368 words. The majority of types is concentrated around the lowest log frequencies, but the slight bulge for the medium log frequency ranges suggests that we may be dealing with a mixture distribution of anaphoric words and concept words. When we remove the words that occur in the dictionary as prime candidates of being concept words, we obtain the distribution plotted in the center left panel. The probability mass of medium-frequency words is now substantially reduced. For this distribution we hypothesize that the anaphoric function is most prominent. These anaphoric words are approximately inverse Gauss-Poisson (gamma = -0.5350, b = 0.0269, c = 0.0068)-distributed [3], [8]. The bottom panel plots the density of the 702 words that appear in the dictionary with an English translation equivalent. This set of concept words is roughly lognormally distributed (m = 3.83, s = 1.72).

It might be argued that we have thusfar only illustrated the truism that dictionaries list more frequent words. In order to show that this difference in frequency ties in with a more interesting difference in the semantic function of words in -HEID, we partitioned the 702 words that appear in the dictionary into three sets, the 171 words with only a word in -NESS as translation equivalent, the 184 words with a word in -NESS as translation equivalent as well with at least one other kind of translation equivalent, and the 347 words with one or more translation equivalent other than a word in -NESS. Given the results of Baayen and Neijt [4], our hypothesis is that words with only a translation in -NESS will be more likely to typically enjoy anaphoric use and hence will have the lowest frequencies of use, while the words with no translation equivalent in -NESS such as SNELHEID, 'speed', are somewhat more likely to instantiate the conceptual use of -NESS, and concomitantly will have the highest frequencies of use.

The right-hand panels of Figure 1 plot the density functions of these three sets, which approximately follow the lognormal distribution. Both t-tests and Kolmogorov-Smirnov two-sample tests show that the three sets differ significantly (p < 0.002 for all comparisons). As expected, the set of words with only a translation equivalent in -NESS has the lowest mean log frequency (m = 3.00, s = 1.44), the set with a translation equivalent in -NESS and at least one other kind of translation equivalent has an intermediate mean log frequency (m = 3.77, s = 1.61), while the set of words with no translation equivalent in -NESS has the highest mean log frequency (m = 4.26, s = 1.75). Since this last set only includes monomorphemic and phrasal translation equivalents as well as translation equivalents with various unproductive suffixes such as -TH, -DOM, and -HOOD, this set has the highest likelihood of containing words in -HEID with primarily a conceptual use.

Note that as we move from the higher to the lower frequency ranges, the translation equivalents with -NESS become increasingly important. We take this to indicate that their anaphoric function becomes more and more dominant as well. For the set of words with only -NESS as translation equivalent for -HEID, the anaphoric function probably is more important than the conceptual function, although the fact that they are listed in the dictionary by itself suggests that they also enjoy some conceptual use. From this perspective, the words that do not occur in the dictionary, and that in general easily translate with-NESS, are prime candidates for being almost exclusively used anaphorically.

Additional evidence for the hypothesis that the higher-frequency formations in -HEID enjoy primarily conceptual use can be obtained from the general correlation between frequency and polysemy. It is well known that higher-frequency words tend to have more shades of meaning than lower-frequency words [9]. For our translation-based approach, this general correlation predicts that higher-frequency words should have more translation equivalents than lower-frequency words. If formations in -HEID indeed carry a conceptual function, they should be subject to this same correlation.

Figure 2 shows that there is indeed a positive correlation between frequency and number of translation equivalents for our words in -HEID by means of boxplots with the number of translations on the horizontal axis and log frequency on the vertical axis. The left-hand panel concerns the words without a translation equivalent in -NESS. These words correspond to the bottom right panel of Figure 1. The right-hand panel concerns the words with a translation equivalent in -NESS. The leftmost box and whiskers configuration corresponds to the upper right panel of Figure 1, the remaining configurations correspond to the center panel of Figure 1. For both panels we observe that log frequency increases with the number of translation equivalents. The observed positive correlations are statistically reliable according to the Pearson and Spearman correlation tests. Note the main effect of the presence of a translation equivalent in -NESS on log frequency: words without a translation equivalent in -NESS (left panel) have higher frequencies than words with such a translation equivalent (right panel). Clearly, the higher-frequency words in -HEID are subject to the general correlation between polysemy and frequency that holds for the lexicon in general. As we move down the frequency range, the semantic richness of the words in -HEID decreases. For the lowest frequencies, only the bare rule-governed anaphoric function remains.


The above analyses provide evidence that the frequency distribution of words in -HEID is a mixture distribution of content words and anaphoric words. By using a bilingual dictionary, it is possible to make considerable progress in disentangling the two distributions. Unfortunately, however useful dictionaries may be, they provide a first approximation only. It is quite likely that at least some of the higher-frequency words among the formations that do not appear in the dictionary are also used to some extent as content words. Similarly, anaphoric use cannot be denied even to the words which are prime candidates for expressing concepts, the words without a translation equivalent in -NESS and with three different translation equivalents. We are therefore currently investigating whether the statistical methods for analyzing mixture distributions [10] can be fruitfully applied. Given that we are probably dealing with a mixture of a lognormal distribution of concept words and an inverse Gauss-Poisson distribution for anaphoric words, a mixture model should provide a better fit to the frequency distribution as a whole than any of the LNRE models reviewed in [8]. A good mixture model, moreover, will allow us to specify for each frequency interval what the balance is likely to be between conceptual and anaphoric use of -HEID. We believe that the statistical results, which will be presented at the conference, are potentially relevant not only for the study of morphology, but also for techniques of word sense disambiguation in information retrieval as well as for the study of human morphological processing in psycholinguistics.


1. Baayen, R. H. and Lieber, R.: 1997, Word frequency distributions and lexical semantics, Computers and the Humanities, p. 30, pp. 281-291.

2. Carroll, J. B.: 1967, On Sampling from a Lognormal Model of Word Frequency Distribution, In: H. Kucera and W. N. Francis (Eds.), Computational Analysis of Present-Day American English, Brown University Press, Providence, pp. 406-424.

3. Sichel, H. S.: 1975, On a Distibution Law for Word Frequencies, Journal of the American Statistical Association, p. 70, pp. 542-547.

4. Baayen, R.H. and Neijt, A.: 1997, Productivity in context: a case study of a Dutch suffix, Linguistics, p. 35, pp. 565-587.

5. Baayen, R. H., Piepenbrock, R., and Gulikers, L.: 1995, The CELEX lexical database (CD-ROM), Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.

6. Martin, W. and Tops, G. A. J. (Eds.): 1986, Van Dale Groot Woordenboek Nederlands-Engels, Van Dale Lexicografie, Utrecht/Antwerpen.

7. Venables, W. N. and Ripley, B. D.: 1994, Modern applied statistics with S-plus, Springer, New York.

8. Chitashvili, R. J. and Baayen, R. H.: 1993, Word Frequency Distributions, In: G. Altmann and L. Hrebicek, L. (Eds.), Quantitative Text Analysis, Wissenschaftlicher Verlag Trier, Trier, pp. 54-135.

9. Koehler, R.: 1986, Zur linguistischen Synergetik: Struktur und Dynamik der Lexik, Brockmeyer, Bochum.

10. Titterington, D. M., Smith, A. F. M. and Makov, U. E.: 1985, Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, Chichester.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC