Inversed N-gram viewer: Searching the space of word temporal profiles

paper, specified "long paper"
Authorship
  1. 1. Vincent Buntinx

    École Polytechnique Fédérale de Lausanne (EPFL)

  2. 2. Frédéric Kaplan

    École Polytechnique Fédérale de Lausanne (EPFL)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Inversed N-gram viewer: Searching the space of word temporal profiles

Buntinx
Vincent

EPFL (École polytechnique fédérale de Lausanne), Switzerland
vincent.buntinx@epfl.ch

Kaplan
Frédéric

EPFL (École polytechnique fédérale de Lausanne), Switzerland
frederic.kaplan@epfl.ch

2014-12-19T13:50:00Z

Paul Arthur, University of Western Sidney

Locked Bag 1797
Penrith NSW 2751
Australia
Paul Arthur

Converted from a Word document

DHConvalidator

Paper

Long Paper

inversed n-gram viewer
n-gram viewer
temporal profiles
words frequencies
curves fitting

corpora and corpus activities
information retrieval
natural language processing
semantic analysis
text analysis
linguistics
media studies
data mining / text mining
English

Studies based on the visualization and analysis of temporal profiles of words using a so-called n-gram approach have been popular in recent years (Michel et al., 2011; Delahaye and Gauvrit, 2013). However, most of the studies so far discuss the case of remarkable words that are mainly found due to the researcher’s intuitions for finding ‘interesting’ curves using the n-gram viewer. In this paper, we investigate how we could inverse the problem and automatically explore the space of temporal curves in search for words. For instance, we could be interested in asking the system to retrieve all curves ‘similar’ to a given one. This would entail the definition of a method to describe temporal profiles and classify them according to a predefined distance.

Figure 1. Four examples of curves: (1) The word ‘1889’ is very popular in the year 1889 but ceases to be as interesting quickly after; (2) the term ‘olympique’ is very popular every four years; (3) the term ‘informatique’ keeps getting more popular over time; (4) the term ‘URSS’ is very popular only between 1950 and 1990.
The study presented in this paper uses a database of 4 million articles covering a period of 200 years and is composed of digitized facsimiles of ‘Le Journal de Genève’ and ‘La Gazette de Lausanne’. Each article has been OCRed and indexed in an SQL database. For each indexed word, the yearly temporal profile showing the relative number of word occurrences in the corpus has been pre-computed. We designed an n-gram viewer that allows for the simultaneous comparison of up to four words. In order to compare these curves with one another and therefore be able to retrieve ‘similar’ curves automatically in the database, we need to find a simple way of describing them. The problem could be approached in a very general manner (how to describe an unknown curve by approximating it on a given decomposition base) or could take into account the fact that we are dealing with temporal profiles that could be described by a given limited number of archetypical curves. The second option implies having some a priori knowledge about the kind of curves one might encounter but has the advantage of allowing for a very compact description of general curves using a family of possible profiles. We choose this second option in this paper.
The first step of our curve analyzer is to associate a given curve to a preexisting curve family. This can be done in a hierarchical manner. At this stage of our research, we decompose curves into four basic families: ‘Dirac’ curves (with a single peak), periodic curves (with regular peaks of popularity), monotonic linear curves (either increasing or decreasing) and ‘square’ curves (associated with a predefined period). It is clear that these four families do not cover the entire spectrum of possible curves but they do provide a possible starting point for investigating the space of ‘remarkable’ curves. To determine if a given curve can be reasonably approximated by one of the four families, we carry out a sequence of tests, starting with the periodicity of the curve. To evaluate whether a curve is periodic we simply compute its Fourier transform and automatically check for hidden periodicity. If the curve is detected to be periodic we extract the period and try to fit a ‘comb’ function as defined by the following formula:

f

x
;
a
,
b
,
m
,
t

=

0
,

i
f

x
<
m

a
,

i
f

x

m

a
n
d

x
-
m

m
o
d
u
l
o

t
=
0

b
,

i
f

x

m

a
n
d

x
-
m

m
o
d
u
l
o

t

0

The fitting is done using Particle Swarm Optimization method (PSO) (Trelea, 2003), and the quality of the fitting with the theoretical curve is measured using the classical least squares error (minimizes the sum of squared
residuals).

Figure 2. Temporal profile (blue) of the word ‘olympique’ with fitted curve (green) and Fourier transform (right).
Using the model, the extracted period becomes a way of comparing periodic curves with one another. Figure 3 shows different curves sorted by periodicities.

Figure 3. Temporal profile (blue) of the words ‘olympique’, ‘fifa’, ‘recensement’, and ‘halley’ with fitted curve (green).
For each curve, the sum of the squared
residuals between the actual curve and the fitted curve is calculated, which then represents the error of the comparison between them. This measure is used to optimize the fitting and allows for determining the category the actual curve belongs to when the optimization is done for all predefined categories of curves.

For ‘peaks’ we try to fit the curve with a classical model used in the analysis of ‘fads’ (e.g., the use of this model in the context of the analysis of Internet Memes [Bauckhage et al., 2013]) by using the Fréchet curve (Alves and Nevers, 2010) as defined by the following formula:

f

x
;
a
,
s
,
m

=

a

s

x
-
m

s

-
a
-
1

e

-

x
-
m

s

-
a

,

i
f

x
>
m

a
n
d

s
>
0

0
,

e
l
s
e

Using this model, fitted curves can be described using the position and value of the peak, allowing optimization with only one degree of freedom. These three parameters can be used to compare curves with one another as shown in Figure 4.

Figure 4. The year 1889 is much like 1906 using the fitted Fréchet curves. The years 1848 and 1940, because of their historical significance, have rather different parameters despite the fact that they belong to the same family. For the years 1900 and 1914, the error of fitting is higher, meaning that they might belong to a different category of curves.
The same approach can be conducted with linear monotonic curves and ‘squared’ curves.
We believe that inversing the problem of n-gram visualization by enabling automatic search in a space of curves could profoundly transform research in this area, going beyond the intuitive search for remarkable curves. It is likely that many different levels of information, combining semantics and grammatical constraints with historical contexts, are implicitly coded in n-gram temporal profiles.
Understanding how to classify and study these curves is important for harnessing the power of this set of statistical tools. The solution based on families of simple archetypical curves briefly described in this article is certainly not the only way of approaching this question but constitutes an initial attempt to demonstrate the potential of this overall research goal.

Bibliography

Bauckhage
, C.,
Kersting
,
K. and
Hadiji
.
F. (2013). Mathematical Models of Fads Explain the Temporal Dynamics of Internet Memes. International AAAI Conference on Weblogs and Social Media, Cambridge, MA, 8–11 July 2013.

Delahaye, J.-P. and Gauvrit, N. (2013).
Culturomics: Le numérique et la culture. O. Jacob, Paris.

Fraga
Alves, M. I. and
Neves
, C. (2010). Extreme Value Distributions. In Lovric, M. (ed.), International Encyclopedia of Statistical Science. Berlin: Springer-Verlag, pp. 493–96,

doi
:10.1007/978-3-642-04898-2.

Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books.
Science,
331(17), doi:10.1126/science.1199644.

Trelea
, I. C. (2003). The Particle Swarm Optimization Algorithm: Convergence Analysis and Parameter Selection. Information Processing Letters,
85(6): 317–25, ISSN 0020-0190, http://dx.doi.org/10.1016/S0020-0190(02)00447-7.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.