New maps of text: a new way to account for the distribution of lexemes in texts

  1. 1. M.M.A. Juillard

    Université de Nice Sophia Antipolis (University of Nice)

  2. 2. N.X. Luong

    Université de Nice Sophia Antipolis (University of Nice)

Work text
Several scholars have attempted to study the repetition of lexical items in a text. It has for instance
been noted that a given word often occurs in bursts
or “rafales” in the text under scrutiny (Lafon,1981) and an index for the topography of
repeated forms has also been devised in order to
make it possible to account for so-called “block
effects” (Serant & Thoiron, 1988). This paper will
illustrate a different attitude and demonstrate an
original procedure. We shall try to approach the
problem of repeated occurrences, and more generally that of the proximity between forms in a text,
by taking as our starting point certain patterns that
rest first and foremost on the basic notions of
topology and that are amenable to processing by
the methods of multidimensional analysis.
The two chief aspects of the discipline called
topology are the notions of neighbourhood and
equivalence of shape. The corpus for investigation
having been defined – several texts or a single text
considered as the sum of its parts – we choose a
significant unit as the elementary neighbourhood
of a word, i.e. a fixed length context which can be
at will the group, the sentence, a set of several
sentences, the paragraph etc. A neighbourhood
base having thus been established, the distribution
of a given lexeme is characterised by a sequence
of addresses corresponding to the neighbourhoods
and converted into a vector V called characteristic
vector, its components being Vj where, for a given
j, Vj is the number of occurrences of the lexeme
in a significant unit. The textual behaviour or
distribution of a set of lexemes can be observed
through the study of the corresponding characteristic vectors. These vectors carry valuable linguistic information and can be processed in a
variety of ways.
Word x, neighbourhood base: the sentence.
Addresses: 8-15-17-19-19-20-23-54 .........
Characteristic vector V1:
000000010000001010210010*(30 times)1 .....
There are many possible approaches:
A) Grouping together of neighbourhoods in order
to make up a larger significant unit.
Word x, with groups of 5 sentences.
Characteristic vector VS: 011410000001....
The number of sentences chosen here is of course
arbitrary; but it will be observed that all the occurrences of the word have been taken into account
and that their being grouped together in blocks or
bursts (“rafales”) is equally well shown in both
types of segmentation, the former being finer, the
latter more easily readable .
B) In V5 for instance, one can easily see blocks of
positive integers, each representing a portion of
text in which there is at least one occurrence of the
word x. The counting of blocks of various frequencies can lead to interesting measures.
C) Multidimensional processing in order to isolate
the shape, similarity or dissimilarity of vectors:
This is achieved by bringing various vectors together in a table that can be directly processed by
factor analysis. Work of the same type consists in
obtaining distances between vectors and carrying
out the corresponding tree-analysis.
D) The vectors can be examined two by two and
similarity indices can be defined not only through
the traditional methods of statistics but also by
adopting a topological point of view.
E) The sequence V1, V2, V3 , ..... Vn of the components of a characteristic vector shows analogies
with a chronological series.
Certain techniques usually employed to handle
these series can be adapted to the sequence under
The paper will illustrate the power and versatility
of the method with various applications concerning syntax, lexis and stylistics as both the size of
the neighbourhood base and the nature of items to
be investigated are altered. Of interest to grammarians and lexicographers will be such topics as the
environment –w ith special attention devoted to
modal auxiliaries – of subject personal pronouns
in contemporary English. Specialists of syntax
and of logic, among others, will be interested in
the results concerning the occurrences within a
grammatically tagged corpus (the LOB corpus of
English texts or similar corpora) of the coordinating conjunction in variable length contexts and
particularly in strings of the type noun and noun,
adjective and adjective, verb and verb etc, these
results making it possible to distinguish between
the current versus resulting uses of copulas (Quirk
1985). Other fields of interest worth investigating
with a varying neighbourhood base are the environment of negative words and the proximity of
verbs to subordinating conjunctions, particularly
that in subject, object or complement clauses. Given a suitable corpus rich in sequences of the type
adjective+noun it would also be possible to highlight the existence of what the linguist J.P. Mahler
has called “salient feature copying” (Bolinger
1980) and more generally the subtle dialectic tension between difference and repetition (Deleuze
