Self-Organizing Maps as an Approach to GIS Analysis of Linguistic Data

  1. 1. William (Bill) Kretzschmar

    English - University of Georgia

  2. 2. Jean-Claude Thill

    University at Buffalo, State University of New York (SUNY)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The nearest-neighbors method of Density Estimation and Complete Spatial Randomness methods in general
will be best applied in models that consider the status of individual linguistic features (e.g. Kretzschmar 1996,
Kretzschmar and Lee 1993). Self Organizing Maps (SOM) uses grouping algorithms in order to model
dialects, not just features. The notion of “dialect” for each method, however, is not equivalent to the
traditional NeoGrammarian or Bloomfieldian sense of the term, but instead derives from the mathematical
procedures used to build groups. In this paper we discuss the application of the technique to data from the
Linguistic Atlas of the Middle and South Atlantic States (LAMSAS. The programming that supports the SOM
project has been carried out with MapObjects for the layered displays, and C++ for statistics, and Visual
Basic for integration of displays and functions; the procedure is fully automated, subject to user-selected
parameters. Besides the implementation of the SOM algorithm, the program is useful as a general-purpose
tool to recover information about speakers and to measure groups of speakers against the overall set of
speakers with univariate statistics.
The model for SOM is the neural network of the human brain. This is perhaps an overly romantic
characterization, which comes down to reduction in the complexity of input data through reduction of the
number of dimensions in which it occurs, until we have a two- dimensional lattice, or feature map. The basic
idea as applied to LAMSAS is that a set of input nodes corresponding to each of the 1162 LAMSAS speakers,
for a set of N linguistic targets, will be processed statistically to form output nodes which may then be
compared for their relative similarity (our procedure follows the SOM algorithm as elaborated in Roussinov
and Chen 1998). This “similarity” is computed as the distance between nodes in N dimensions, corresponding
to the N linguistic targets. Processing takes place in N iterations, one for each of the N targets. The user must
specify how large a matrix of groups of speakers the statistic is to create (say, a 5 x 5 matrix of 25 groups),
and how many groups of groups will be established within the matrix (say, three relatively similar groupings
out of the 25 groups in the matrix.
For example, in a typical SOM run (say, on 19 different linguistic features, consisting of the three
common variants of the gully item, the nine common variants of the heavy rain item, and the seven common
variants of the thunderstorm item), the algorithm selected and displayed, as instructed with a parameter
setting, three groupings of speakers identified by color within a matrix of 25 smaller groups of speakers.
Within this matrix, the positions of the groups are not in abstract geographical space; the speakers within each
of the 25 groups need not be located anywhere near each other. The SOM program also produces a graphical
representation of the “distance” between each of the 25 nodes (a “Umatrix” display), in which the darkness of
the colored diamond between each node (the dots) indicates the degree of relation between adjoining nodes.
Each node of the matrix is convertible to a layered GIS display which shows which individual LAMSAS
respondents belong to the node. Some of the nodes contain large numbers of speakers, some few speakers. A
display from one of the nodes just a single speaker, (a frequent occurrence in this rain/storm/gully matrix), or
may reveal a geographical cluster of speakers. By observing the groupings that the SOM algorithm creates,
and by inspecting the statistical outputs which underlie the groupings, we hope to learn more about the
general behavior of groups within the region.
Our research is now complete enough to suggest how useful this approach will be for the linguistic
goals of LAMSAS. We will compare results from many SOM runs on different combinations of features, and
then discuss the nature of the estimates produced by the SOM algorithm. We can then compare SOM results
to those derived from another method for creation of diealect models, the Levenstein Distance algorithm as
now being executed by Nerbonne, Heeringa, and Kleiweg.
Kretzschmar, William A., Jr. 1996. “Quantitative Areal Analysis of Dialect Features”. Language Variation
and Change 8.13–39.
---, and Jay Lee. 1993. “Spatial Analysis of Linguistic Data with GIS Functions”. International Journal of
Geographical Information Systems 7: 541–60.
Roussinov, Dmitri, and Hsinchun Chen. 1998. “A Scaling Self-Organizing Map Algorithm for Textual
Classification”. Artificial Intelligence 15:81–112.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website:

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None