Rijksuniversiteit Groningen (University of Groningen)
Linguistics, archaeology and cultural history share a fascination for the means with which people express
identity, in particular, the associations with the area in which they live, with their social class and gender,
perhaps with their ethnicity and profession. Computational studies of linguistic variation enable us to see the
degree to which various linguistic devices contribute to this expression of identity. The first task of these
computational studies is inevitably to bring some order to the plethora of data relevant to the study of cultural
expression. In the case of linguistics, this is speech and writing, and the relevant data is available in the form
of dialect atlases, compendia of varying linguistic forms collected in comparable ways from speakers of
varying geographic and social origins.
Given a large amount of dialect data, there is a good chance that one will encounter “noise”, i.e.,
inaccuracy, nongeographic variability, and incompatibility both in the choice of information recorded and in
the level of detail at which it is recorded. In addition, dialectologists have been aware since Kloeke and
Bloomfield that, even abstracting away from the noise, dialect varieties inevitably contain genuine linguistic
features with counterindications, exceptions, and gaps. There are furthermore many linguistic features to
explore, and many ways of combining them. Finally, it may be the case that it is difficult or even impossible
to validate results—there may be no consensus among dialectologists about which aspects of the geographic
distribution of linguistic variation are most significant. The LAMSAS data set, available at
http://us.english.uga.edu/lamsas/, is one such large, “noisy” and rebarbative set. The challenge is to identify
how linguistic similarity is expressed in such data.
We treat both lexical and phonetic differences in this talk, and we also examine the relative
contributions of pronunciation and lexis to dialect differentiation. In order to rise above the atomistic level of
the individual sounds or lexical items, we employ aggregate measures of distance, the (non-)identity of lexical
items on the one hand (essentially the same measure proposed by Seguy, 1971, and elaborated on by Goebl,
1984); and a string similarity measure which we apply to phonetic transcriptions on the other. Because the
measures yield numeric characterizations of lexical/phonetic distance, it may be aggregated over many pairs
of similar concepts. In order to overcome the problem that there is little expert consensus, we propose a
numerical characterization of the fundamental dialectological postulate, namely that of “local coherence”:
nearby language variants tend to be similar to one another. Such a principle requires clarification as to which
language variants are to be included, and it is admitted not generally true (e.g., town Frisian is geographically
not coherent). But we show nonetheless that the principle can be put to beneficial use in exploring dialect data
in an investigative phase of research. We illustrate the results when applied to the entire LAMSAS data set,
and use it to help choose which infrequent data to omit from analysis and also to evaluate the two modest
modifications we propose to Seguy and Goebl’s work.
The results of the lexical analysis confirm neither of the best known dialect divisions for the
LAMSAS area, i.e., neither Kurath’s nor Carver’s. Both Kurath and Carver relied on lexical analysis, where
Carver’s (1987) analysis sees the North-South division as dominating dialect differences on the Atlantic
coast, while Kurath’s saw a significant “Midlands” area corresponding to southern Pennsylvania and
extending into West Virginia and the inland South. In our analyses, Kurath's “Midlands Area” is split into
North and South, and the penultimate aggregation is unstable, confirming sometimes Kurath and sometimes
Carver. Further details confirm Kurath rather more than Carver, e.g., in recognizing a coastal South region.
Phonetic analysis confirms the primary significance of the North-South split.
Finally, we are in a position for the first time to evaluate the usual assumption of variationists that
57
lexical and phonetic variation usually “coincide fairly well” (Kurath and McDavid, 1961) in their association
with extralinguistic variables such as geography (where modern variationists would tend to add social class,
gender and age). In fact the two levels of linguistic structure do tend to correlate to a highly significant degree
(r = 0.65) in the LAMSAS data, and likewise therefore associate with the same geographical areas, but the
lexical data is much less consistent. To achieve the same consistency of measurement, we need to examine
ten pairs of lexicalizations for every single pair of pronunciations. We express our associations, and
ultimately, our personal identify whenever we use language, but the expression is ten times as recognizable in
speech as it is in writing.
REFERENCES
John Nerbonne with Wilbert Heeringa and Peter Kleiweg “Edit Distance and Dialect Proximity” In: David
Sankoff and Joseph Kruskal (eds.) Time Warps, String Edits and Macromolecules: The Theory and
Practice of Sequence Comparison 1999, pp.v–xv.
John Nerbonne and Peter Kleiweg “Lexical Distance in LAMSAS.” Submitted to: John Nerbonne and
William Kretzschmar (eds.) Computational Methods in Dialectometry: Special issue of Computers
and the Humanities, scheduled to appear in 2003.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at University of Georgia
Athens, Georgia, United States
May 29, 2003 - June 2, 2003
83 works by 132 authors indexed
Affiliations need to be double-checked.
Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/