Extracting Stylistic Distances from Texts for Forensic Linguistics Purposes

Authorship
  1. 1. Katerina T. Frantzi

    Dept. of Mediterranean Studies - University of the Aegean

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The paper proposes a method for the automatic processing
of forensic texts for the extraction of similarities and
dissimilarities to be used for authorship identification purposes.
Forensic Linguistics expresses the application of linguistics to
the analysis of forensic language in written or oral form, the
study of the language of the law, the study of legal interpretation
and translation, the alleviation of disadvantage produced by
language in legal processes, the provision of forensic linguistic
evidence and of linguistic expertise in issues of legal drafting
and interpretation (Conley & O’Barr, 1998; Shy 1998; Cotterill,
2002; McMenamin & Dongdoo 2002; Semino & Culpeper,
2002; Gibbons, 2003). Our forensic texts belong to a Greek
terrorist organization, “17 November” (Kassimeris, 1997;
Παπαχελάς & Τέλλογλου, 2002; Frantzi, 2004). 17N started
its action in 1975, while the summer/autumn of 2002, all (or
most) of the members were arrested. The case is currently in
court for the second time. 84 of the proclamations, that used to
follow a 17N attack, have been characterized as genuine
(Κάκτος, 2002). Their authorship is an open issue. The final
aim of this project is to map each of the 17N proclamations to
its possible author (Frantzi & Ananiadou, 2006). In this work
we propose a way to characterise an organization’s text, i.e.
proclamation, pleading, replication, by values on its stylistic
characteristics. The values are used for extracting similarities
and dissimilarities among texts.
We consider the following stylistics features: the average
sentence and word length, the number of different nouns, proper
nouns, adjectives, active voice verbs, passive voice verbs,
adverbs, the number of future tense references, the use of
conjunctions, preposition, pronouns, particles, determiners,
articles, question words, punctuation, content words. For each
pair of possibly owned by the same author writings, we create
a matrix which we call distance-matrix (Table 1). Each cell of
the matrix contains a number that corresponds to a stylistic
feature distance between the two writings. In order to compute the distance-matrix for a pair of writings,
we need to assign a value to each cell that corresponds to a
particular stylistic feature. This value gives the difference
(distance) between the two writings regarding the specific
feature. Then the distance between the two writings is evaluated
by the average value of all the feature-distances in the
distance-matrix. A feature-distance is assigned to a cell and is
calculated by the average of all phenomena distances for that
feature. The phenomena distances for a feature are calculated
by their normalized frequencies of occurrence in the two
writings. The algorithm for the processing of the
feature-distances is the following:
Each author is characterized by his/her pleadings and
replications. If we could match a proclamation to a pleading
(and replication) then we could possibly match a proclamation
to the corresponding author. We need to compare all pleadings
(and replications) to the proclamations, to find the smallest
“distance” in terms of their feature values. The smallest distance
assigns a proclamation to a pleading, i.e. an author.
In this work we apply the distance-table method for
comparisons between the pleadings and replications: we already
know the authors of the pleadings and proclamations, so we
can present how the distance-table works. We apply the method
to three writings: one pleading and two replications. Let us
consider Yotopoulos’s (the person accused to be the main leader
and one of the main authors of the proclamations) pleading of
18321 words, and replication of 2036 words, and Koufontinas’s
(also accused to be a leading member) replication of 326 words.
We will evaluate the distance-matrix for the pairs of writings
(Yotopoulos’s pleading, Yotopoulos’s replication) and
(Yotopoulos’s pleading, Koufontinas’s replication). We will
show the evaluation for one of the cells of the two
distance-matrixes, the distance-matrix[1,1], i.e. the cell that
keeps the distance regarding the stylistic feature of the particles’
use, while the evaluation is analogous for the rest of the cells.1
There are five particles in Greek: “να”, “για”, “θα”, “ας” and
“μα”. Table 2 gives the frequencies of occurrence and
normalized frequencies (to 10,000 words) for the five particles
found in Yotopoulos’s pleading (Yp) and replication (Yr) and
those in Koufodinas’s replication (Kr).
The distance-matrix[1,1] cell for the (Yotopoulos’s pleading,
Yotopoulos’s replication) distance-matrix holds the distance
regarding the use of particles. We evaluate the distance on the
use of each of the five particle between the two writings and
we assign to the distance-matrix1,1] cell the average of all the
particles’ distances. Then we do the same for the
distance-matrix1,1] cell of the (Yotopoulos’s pleading,
Koufodinas’s replication) distance-matrix. The particles’
distances for the particle “να” for the two pair of writings are
calculated according the above-given algorithm as:The
distance-matrix[1,1] cell for the (Yotopoulos’s pleading,
Yotopoulos’s replication) distance-matrix holds the distance
regarding the use of particles. We evaluate the distance on the
use of each of the five particle between the two writings and
we assign to the distance-matrix[1,1] cell the average of all the
particles’ distances. Then we do the same for the
distance-matrix[1,1] cell of the (Yotopoulos’s pleading,
Koufodinas’s replication) distance-matrix. The particles’
distances for the particle “να” for the two pair of writings are
calculated according the above-given algorithm as:
Particles_distance(Yp,Yr)[να]=|245.619 - 294.695| / (245.619
+ 294.695)=0.091 Particles_distance(Yp,Κr)[να]=|245.619 -
122.699| / (245.619 + 122.699)=0.334
Table 3 gives the results of the calculations for all the particles’
distances for those pair of writings. We get the average for each
of the columns of Table 3, i.e. 0.289 and 0.409. These averages
will be assigned to distance-matrixes.[1,1] cells for the two
pairs of writings distance-matrixes.
Table 4 gives all the feature-distances for the (Yotopoulos’s
pleading, Yotopoulos’s replication) distance-matrixwhile Table
5 for the (Yotopoulos’s pleading, Koufodinas’s replication)
one. When matrixes are filled up, we can evaluate the total distances
for the pairs of writings as the average of all the cell values:
matrix-distance(Yp, Kr)= 0.5593 matrix-distance (Yp,Yr)=
0.3382
As a result Yotopoulos’s pleading is linked to Yotopoulos’s
replication (smaller distance), which actually means that these
two writings have been found to be closer as for their language
style than the writings of Yotopoulos’s pleading and
Koufodinas’s replication.
Future work involves:
• Augmentation of the distance-matrix with more stylistics
features, e.g. the use of collocations (Frantzi & Ananiadou,
2007),
• use of weights for the stylistics features,
• application of the method for characterizing the
proclamations for the provision of authorship evidence,
• comparisons among the proclamations for their grouping
according to their linguistic profile.
1. We cannot present the evaluation of all cells here due to space
restrictions.
Bibliography
Conley, John M., and William M. O'Barry. Just Words – Law,
Language, and Power. Chicago and London: The University
of Chicago Press, 1998.
Cotterill, Janet, ed. Language in the Legal Process. MacMillan,
2002.
Frantzi, Katerina T. "Corpus Linguistics: What can it do with
Terrorism?" International Journal of Humanities 2.2 (2004):
1603-1608.
Frantzi, Katerina T., and Sophia Ananiadou. "Automatic
Authorship Identification." Proceedings of the International
Association of Forensic Linguistics 2nd European Conference
on Forensic Linguistics / Language & the Law. Barcelona,
Spain. 14-16.
Frantzi, Katerina T., and Sophia Ananiadou. "C-value for
Authorship Identification." Proceedings of the International
Association of Forensic Linguistics 8th Biennial International
Conference on Forensic Linguistics / Language & the Law –
IAFL 8, Seattle, Washington, July 12-15 2007. forthcoming.
Gibbons, John. Forensic Linguistics: An Introduction to
Language in the Justice System. Language in Society 32.
Blackwell Publishing Ltd, 2003.
Kassimeris, George. Europe’s Last Red Terrorists: The
Revolutionary Organization 17 November. C. Hurst & Co. Ltd,
1997.
McMenamin, Gerald R., and Choi Dongdoo, eds. Forensic
Linguistics : Advances in Forensic Stylistics. CRC Press, 2002.
Semino, Elena, and Jonathan Culpeper, eds. Cognitive Stylistics:
Language and Cognition in Text Analysis. Linguistic
Approaches to Literature 1. John Benjamins Publishing
Company, 2002.
Shuy, Roger W.>. The Language of Confession, Interrogation
and Deception . Empirical Linguistics Series. Sage Publications,
1998.
Κάκτος. 17N – Οι προκηρύξεις. Αθήνα: Κάκτος, 2002.
Παπαχελάς, Α., and T. Τέλλογλου. Αθήνα, Ελλάδα: Εστία,
2002.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None