Extracting author-specifi c expressions using random forest for use in the sociolinguistic analysis of political speeches

Takafumi Suzuki

Authorship

1. Takafumi Suzuki

University of Tokyo

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This study applies stylistic text classifi cation using random
forest to extract author-specifi c expressions for use in the
sociolinguistic analysis of political speeches. In the fi eld of
politics, the style of political leaders’ speeches, as well as their
content, has attracted growing attention in both English (Ahren,
2005) and Japanese (Azuma, 2006; Suzuki and Kageura, 2006).
One of the main purposes of these studies is to investigate
political leaders’ individual and political styles by analyzing their
speech styles. A common problem of many of these studies
and also many sociolinguistic studies is that the expressions
that are analyzed are selected solely on the basis of the
researcher’s interests or preferences, which can sometimes
lead to contradictory interpretations. In other words, it is
diffi cult to determine whether these kind of analyses have
in fact correctly identifi ed political leaders’ individual speech
styles and, on this basis, correctly characterised their individual
and political styles. Another problem is that political leaders’
speech styles may also be characterised by the infrequent use
of specifi c expressions, but this is rarely focused on.
In order to solve these problems, we decided to apply stylistic
text classifi cation and feature extraction using random forest
to political speeches. By classifying the texts of an author
according to their style and extracting the variables contributing
to this classifi cation, we can identify the expressions specifi c
to that author. This enables us to determine his/her speech
style, including the infrequent use of specifi c expressions, and
characterise his/her individual and political style. This method
can be used for the sociolinguistic analysis of various types
of texts, which is, according to Argamon et al. (2007a), a
potentially important area of application for stylistics.
Experimental setup
We selected the Diet addresses of two Japanese prime
ministers, Nakasone and Koizumi, and performed two
classifi cation experiments: we distinguished Nakasone’s
addresses from those of his 3 contemporaries (1980-1989),
and Koizumi’s addresses from those of his 8 contemporaries
(1989-2006). Because Nakasone and Koizumi were two of
the most powerful prime ministers in the history of Japanese
politics, and took a special interest in the content or style of their own speeches (Suzuki and Kageura, 2007), their addresses
are the most appropriate candidates for initial analysis. Suzuki
and Kageura (2006) have demonstrated that the style of
Japanese prime ministers’ addresses has changed signifi cantly
over time, so we compared their addresses with those of
their respective contemporaries selected by the standard
division of eras in Japanese political history. We downloaded
the addresses from the online database Sekai to Nihon (The
World and Japan) (www.ioc.u-tokyo.ac.jp/~worldjpn), and
applied morphological analysis to the addresses using ChaSen
(Matsumoto et al., 2003). We united notational differences
which were distinguished only by kanji and kana in Japanese.
Table 1 sets out the number of addresses and the total number
of tokens and types for all words, particles and auxiliary verbs
in each category.
Table 1. Basic data on the corpora
As a classifi cation method, we selected the random forest (RF)
method proposed by Breiman (2001), and evaluated the results
using out-of-bag tests. RF is known to perform extremely well
in classifi cation tasks in other fi elds using large amounts of
data, but to date few studies have used this method in the area
of stylistics (Jin and Murakami, 2006). Our fi rst aim in using RF
was thus to test its effectiveness in stylistic text classifi cation.
A second and more important aim was to extract the
important variables contributing to classifi cation (which are
shown as high Gini coeffi cients). The extracted variables
represent the specifi c expressions distinguishing the author
from others, and they can show the author’s special preference
or dislike for specifi c expressions. Examining these extracted
expressions enables us to determine the author’s speech style
and characterise his/her individual and political styles. In an
analogous study, Argamon et al. (2007b) performed genderbased
classifi cation and feature extraction using SVM and
information gain, but as they are separate experiments and RF
returns relevant variables contributing to classifi cation, RF is
more suitable for our purposes.
We decided to focus on the distribution of particles and
auxiliary verbs because they represent the modality of
the texts, information representing authors’ personality
and sentiments (Otsuka et al., 2007), and are regarded as
refl ecting political leaders’ individual and political styles clearly
in Japanese (Azuma, 2006, Suzuki and Kageura, 2006). We
tested 8 distribution combinations as features (see Table 2).
Though Jin (1997) has demonstrated that the distribution of
particles (1st order part-of-speech tag) is a good indicator of
the author in Japanese, the performances of these features,
especially auxiliary verbs, have not been explored fully, and
as the microscopic differences in features (order of part-ofspeech
and stemming) can affect the classifi cation accuracy, we
decided to test the 8 combinations.
Results and discussion
Table 2 shows the precision, recall rates and F-values. Koizumi
displayed higher accuracy than Nakasone, partly because he
had a more individualistic style of speech (see also Figure 2
and 3), and partly because a larger number of texts were used
in his case. Many of the feature sets show high classifi cation
accuracy (more than 70%) according to the criteria of an
analogous study (Argamon et al., 2007c), which confi rms
the high performance of RF. The results also show that the
distribution of the auxiliary verbs and combinations can give
better performance than that of the particles used in a previous
study (Jin, 1997), and also that stemming and a deeper order of
part-of-speech can improve the results.
Table 2. Precisions and recall rates
Figure 1 represents the top twenty variables with high Gini
coeffi cients according to the classifi cation of combinations of
features (2nd order and with stemming). The fi gure indicates
that several top variables had an especially important role in
classifi cation. In order to examine them in detail, we plotted in
Figure 2 and 3 the transitions in the relative frequencies of the
top four variables in the addresses of all prime ministers after
World War 2. These include variables representing politeness
(‘masu’, ‘aru’, ‘desu’), assertion (‘da’, ‘desu’), normativeness
(‘beshi’), and intention (‘u’), and show the individual and political
styles of these two prime ministers well. For example, ‘aru’ is
a typical expression used in formal speeches, and also Diet
addresses (Azuma, 2006), and the fact that Koizumi used this
expression more infrequently than any other prime minister
indicates his approachable speech style and can explain his
political success. Also, the fi gure shows that we can extract the
expressions that Nakasone and Koizumi used less frequently
than their contemporaries as well as the expressions they
used frequently. These results show the effectiveness of RF
feature extraction for the sociolinguistic analysis of political
speeches.
Figure 1. The top twenty variables with high Gini coeffi cients.
The notations of the variables indicate the name of partof-
speech (p: particle, av: auxiliary verb) followed by (in
the case of particles) the 2nd order part-of speech. Figure 2. Transitions in the relative frequencies of the top
four variables (Nakasone’s case) in the addresses of all
Japanese prime ministers from 1945 to 2006. The ‘NY’s
between the red lines represent addresses by Nakasone,
and other initials represent addresses by the prime minister
with those initials (see Suzuki and Kageura, 2007).
Figure 3. Transitions in the relative frequencies of the top
four variables (Koizumi’s case) in the addresses of all
Japanese prime inisters from 1945 to 2006. The ‘KJ’s
between the red lines represent addresses by Koizumi, and
the other initials represent addresses by the prime minister
with those initials (see Suzuki and Kageura, 2007).
Conclusion
This study applied text classifi cation and feature extraction
using random forest for use in the sociolinguistic analysis of
political speeches. We showed that a relatively new method
in stylistics performs fairly well, and enables us to extract
author-specifi c expressions. In this way, we can systematically
determine the expressions that should be analyzed to
characterise their individual and political styles. This method
can be used for the sociolinguistic analysis of various types of
texts, which will contribute to further expansion in the scope
of stylistics. A further study will include more concrete analysis
of extracted expressions.
References
Ahren, K. (2005): People in the State of the Union: viewing
social change through the eyes of presidents, Proceedings
of PACLIC 19, the 19th Asia-Pacifi c Conference on Language,
Information and Computation, 43-50.
Argamon, S. et al. (2007a): Stylistic text classifi cation using
functional lexical features, Journal of the American Society for
Information Science and Technology, 58(6), 802-822.
Argamon, S. et al. (2007b): Discourse, power, and Ecriture
feminine: the text mining gender difference in 18th and 19th
century French literature, Abstracts of Digital Humanities, 191.
Argamon, S. et al. (2007c): Gender, race, and nationality in
Black drama, 1850-2000: mining differences in language use in
authors and their characters, Abstracts of Digital Humanities,
149.
Azuma, S. (2006): Rekidai Syusyo no Gengoryoku wo Shindan
Suru, Kenkyu-sya, Tokyo.
Breiman, L. (2001): Random forests, Machine Learning, 45,
5-23.
Jin, M. (1997): Determining the writer of diary based on
distribution of particles, Mathematical Linguistics, 20(8), 357-
367.
Jin, M. and M. Murakami (2006): Authorship identifi cation with
ensemble learning, Proceedings of IPSJ SIG Computers and the
Humanities Symposium 2006, 137-144.
Matsumoto, Y. et al. (2003): Japanese Morphological Analysis
System ChaSen ver.2.2.3, chasen.naist.jp.
Otsuka, H. et al. (2007): Iken Bunseki Enjin, Corona Publishing,
Tokyo, 127-172.
Suzuki, T. and K. Kageura (2006): The stylistic changes of
Japanese prime ministers’ addresses over time, Proceedings
of IPSJ SIG Computers and the Humanities Symposium 2006,
145-152.
Suzuki, T. and K. Kageura (2007): Exploring the microscopic
textual characteristics of Japanese prime ministers’ Diet
addresses by measuring the quantity and diversity of nouns,
Proceedings of PACLIC21, the 21th Asia-Pacifi c Conference on
Language, Information and Computation, 459-470.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Extracting author-specifi c expressions using random forest for use in the sociolinguistic analysis of political speeches

1. Takafumi Suzuki

ADHO - 2008