The Time Course of Language Change

  1. 1. Patrick Juola

    Dept of Mathematics and Computer Science - Duquesne University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Although an infinite number of monkeys at typewriters might produce
the complete works of Shakespeare, we can say with confidence that
the one that does produce _Hamlet_ is somewhat unusual and a
primate of marked preferences. For example, it obviously doesn't
like to type the letter 'z' and is rather
attached to the letter sequence space-'t'-'h'-'e'-space. These
preferences can be quantified accurately and, for instance, form
the basis of most compression programs (such as Stuffit or WinZip).
Common letters, frequent words, and hackneyed phrases are replaced
with short codes and abbreviations, while less frequent items -- in
the counting of the person/programming doing the measurements --
are left untouched or even lengthened. Of course, the measurements
may not be exactly accurate, and the inaccuracies of measurement, such
as those that would be produced by using the preferences of one
monkey to compress the output of a second, can provide a precise
and detailed assessment of the "distance" between two monkeys.

Although information theory has known the above since the 1940s,
the volume of text required for reliable measurements (thousands
or millions of pages) has so far been prohibitive. However,
recent work (Wyner, 1996; Juola, 1997, 1998) provides a method
to apply this to the detailed measurement of "linguistic distance"
between two reasonably-sized (a few hundred to thousand characters).
Although the mathematics behind this new technique is somewhat complex,
the results are profound. This technique has been shown to be
effective for problems ranging from authorship attribution, through
genre identification, language identification, and even extending
to the inference of linguistic familial relationship. This paper
describes an experiment in using this technique to determine
rate of language change in _National Geographic_ magazine.

_National Geographic_ is an ideal experimental corpus for several
reasons. First, unlike a book (which may be the product of thirty
years of continuous writing and rewriting), magazine text can be
fairly accurately dated. Second, _NG_ is a closely-edited, high
prestige journal written in relatively formal English, and therefore
language change within this journal can be evidence of fairly widespread
and large-scale acceptance of the changes. Third, it has been
in continuous publication since 1888, yielding over one hundred
years of data. And, not least important for a researcher on a budget,
the entire corpus is available on CD-ROM for under $200 (US).
The experiment described here used data from the years 1939-2000,
taking between one and seven excerpts from each January issue.
The distance between each excerpt pair was determined and
plotted against the number of years between each excerpt pair.
For example, two excerpts, both from 1954, would have been
recorded as being zero years apart. Similarly, two excerpts
from 1964 would be zero years apart, while a document from
1964 would be ten years apart from one from 1954. Of course,
no document was paired with itself (which would have resulted in
a meaninglessly low near-zero difference).
Regressing measured differences to a linear fit with the number
of years then yielded a measurement of the effective "linguistic
distance" (change) per year, or the speed (distance over time)
of language change for the period studied.

The results of this experiment are both significant and intriguing.
For example, all experiments show a clear [significant, p < 0.05]
result that language samples show a greater difference with
greater time, even over periods as small as a decade (ten years).
However, the absolute rate of change over any given ten-year
period varied by as much as fivefold, suggesting that some periods
(of the National Geographic) are characterized by significantly
faster change and suggesting that the English language itself
may have undergone similar periods of rapid or slow change.
In particular, the period of the Second World War and its
aftermath (1939-48) displayed particularly LITTLE change (less
than 0.004 bits/year), while the decade immediately following
had the greatest of all measured changes (0.0178 bits/year).
In fact, the language change over the 1940s was itself barely
above the threshhold of "detectably and significantly different
from no change."

We further found that language change itself seems undirected,
in the sense that individual changes do not appear to sum
conveniently to an overall rate of change. The metaphor of
a drunkard walking around a street may be relevant here; although
s/he walks at a comfortable pace of one hundred meters per minute,
it may take much longer than two minutes to walk through a two
hundred meter alley, with the inevitable missteps, retracing
of footsteps, and random wanderings -- and in fact, the drunkard
might just as well enter the alley and come out again from the
same end s/he entered. A similar pattern was found in the
pattern of linguistic change, where the overall effective rate of
change for the period 1939-78 was substantially less than the
effective rate for any single decade within that period.

Each observation above yields up further questions. In particular,
the finding that the observed change is markedly slower over the
40s than over comparable periods is unusual and demands
explanation. The measured difference over this period of time barely
achieved significance; in fact, had a two-tailed, instead of one-tailed,
test been used, the effective rate of change would not have been
significantly different from no change at all.
It is commonly believed and often argued that wars drive
technological innovation -- might they, paradoxically,
hinder linguistic innovation? Although this theory is surprising,
counterintuitive, and therefore attractive, one must not lose sight
of other possibilities in explanation. For instance, the National
Geographic itself may not be representative of the language of the 1940s,
or even of magazines of the 1940s. The acute manpower shortage of
the war years may have limited the number of available writers and
editors, producing a more uniform production staff and document/writing
style. Public interest may have been focused on a more limited set of
geographic areas, possibly resulting in more articles discussing the same
(war-related) topics and fewer articles of broad inquiry. And, of
course, drawing conclusions about "wars" based on a single sample
describing a single war is statistically questionable, if not downright
irresponsible. A better and more scientifically interesting
counterhypothesis would be simply that language, or at least,
the high-prestige register typical of {\em National Geographic}
is a lagging indicator of cultural change, with language change
appearing only after the editor/gatekeepers decide that a new
usage is now acceptable. A less culturally conservative magazine
such as _Mad_ Magazine, also available on CD-ROM, might show
different periods of activity/inactivity. Much further work involving
a greater time period, a larger variety of text, and a detailed
historical inquiry into the publishing industry and the National
Geographic Society in particular, is clearly required.

These caveats aside, the findings are nonetheless intriguing and
suggestive. Similarly, one is free to speculate about the possible
reasons for the 1950s to be (significantly) the time of the fastest
effective language change. One possibility, for instance, is simply
that language change, especially in high-prestige registers, lags
social and technological change. Another possibility is that
the tremendous change of the 1950s merely reflected some sort of
stored-potential, the change that had been suppressed during the
linguistically conservative 1940s. In particular, most of the
technological inventions developed in the 1940s did not become
commonplace consumer items (and thus objects of discussion) for
some time. Finally, the 1950s were the era of the development
and widespread use of television, and it would be odd indeed if
such a far-reaching device did not change language as well as culture.

The direct answers to these social, cultural, and historical
questions aside, it is clear from these results that this technique
can provide an interesting and provocative source of information
about the time course of language change, that may or may not
be correlated with historical events and trends. For example,
the "moving averages" beloved of financial analysts could be
coopted into looking for sudden movements that correspond
to particular events that shape language, be they editorial
staff changes or new events (such as a disputed Presidential
election or a moon landing) that cause people to change what
they write or talk about. Identifying and exploiting such
correlations can provide unending amounts of work for historians,
or alternatively will provide a valuable quantitative adjunct to
the study of history and a useful guide for the reading of historical
texts. At the very least, however, this shows that information theory
can provide a valuable guide to the study of language change viewed
as a historical and cultural process.

References :

Farach, M. et al. (1995). "On the entropy of DNA : Algorithms
and measurements based on memory and rapid convergence." In
the _Proceedings of the 6th Annual Symposium on Discrete Algorithms

Juola, Patrick. (1997). "What can we do with small corpora? Document
categorization via cross-entropy." In the _Proceedings of an
Interdisciplinary Workshop on Similarity and Categorization_,
Edinburgh, UK.

Juola, Patrick. (1998). "Cross-entropy and linguistic typology."
In the _Proceedings of New Methods in Language Processing 3_,
Sydney, Australia.

Juola, Patrick. (2000). "The rate of language change." Presented
at the _Third Conference on Quantitative Linguistics_, Prague,
Czech Republic.

Wyner, A. J. (1996). "Entropy Estimation and Patterns."
In the _Proceedings of the 1996 Workshop on Information Theory_.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC