Università di Roma Tor Vergata
Università di Roma Tor Vergata
Università per Stranieri di Siena
Università di Roma Tor Vergata
This research is the result of the collaboration of the
Departments of Literature and Medicine of the University of
Rome “Tor Vergata”.
Marco Zanasi’s idea of coding and studying dream reports
of his patients at Tor Vergata University Hospital Psychiatric
Unit started a team research on a challenging hypothesis: Is
there any correlation between linguistic realization of dream
reports and the psychopathology from which the dreamer is
suffering? So far, analyses of dream reports have focused mainly
on actants, settings, and descriptors of emotional condition
during the oneiric activity.
The goal of “Dream Coding” is to collect, transcribe,
catalogue, fi le, study, and comment on dreams reported by
selected patients at Tor Vergata University Hospital in Rome.
In the group of observed patients, different psychopathological
statuses are represented.
The aim of the project is to build up a code which would
enable a different reading of the reports that patients make of
their dreams. A code that in the future could also provide new
tools for initial and on-going diagnoses of the psychopathology
affecting the patients.
Then we want to verify a set of linguistic features that can
be signifi cantly correlated to the type of psychopathology
on a statistical basis. Numerous aspects, both theoretical and
methodological, are discussed in this paper, ranging from the
nature of the variation of language to be investigated to the tag
set to be used in corpus preparation for computer analysis.
The fi rst issue that the analysis has to solve is the speech
transcription and how to obtain an accurate transcoded text
from the oral form. After having considered several working
hypothesis we have decided to use a “servile transcription” of speech: by recording the narrators’ voice and transferring
it in an electronic format.
At an early stage, we realized that some important elements
of the speech could get lost in this process. This is why we
have decided to reproduce the interruptions and hesitations.
Interruptions have been classifi ed on the basis of their length.
Starting from this consideration a form (enclosed) has been
created in order to catalogue the dreams and other data, such
as age, sex, address, birthday, job, and so on.
Furthermore the form contains information related to the
transcriptor so as to detect the “noise” added by them. The
noise can be partially isolated by noticing the continuity of
some elements.
Obtained data is stored on the computer as a word fi le (.rtf
format) to be analyzed later on. The working process has been
divided into four steps, namely:
1. fi rst step: in this step we collect the data of the patients
that already have a diagnosis (che hanno già una diagnosi?)
2. second step: in this step we analyze the dreams collected
in the previous phase and point out (individuare?) recurrent
lexical structures (ricorrenti?)
3. third step: in this step we collect data from all other
patients in order to analyze them and propose a kind of
diagnosis
4. fourth step: in this step we make a comparison between
the results of the fi rst group of patients and the second
one, and try to create a “vademecum” of the particularities
noticed from a linguistic point of view. We call this
classifi cation “textual diagnosis”, to be compared with
the “medical diagnosis”, in order to verify pendants and
incongruities.
Although our research is at a preliminary stage, the statistical
results on text analysis obtained so far allow us to make a few
considerations. The fi rst noticeable aspect is that differences
between dreams of psychopathologic patients and controls
are observed, which suggests the existence of specifi c features
most likely correlated to the illness. We also observe a scale of
narration complexity progressively decreasing from psychotic
to bipolar patients and to controls. Since the patients recruited
are in a remission phase, we may hypothesize that the variations
observed are expression of the underlying pathology and thus
not due to delusions, hallucinations, hypomania, mania and
depression.
For such an accurate analysis the research group has started
marking the texts of the dreams reports. The marking will allow
the computer-aided analyses on semantics, syntax, rhetorical
organization, metaphorical references and text architecture.
Under the notation of “text architecture” we have included
items such as coherence and cohesion and the anaphoric
chains. The complete, quantitative results will be soon available,
but it is possible to show at least the set of variables for which
the texts will be surveyed, in order to account for the lexical
variety we registered.
The syntactic and semantic opposition of NEW and GIVEN
information is the basis according to which both formal and
content items will be annotated. Therefore, as an example of
the connection between formal and content items, the use of
defi nite or indefi nite articles will be linked to the introduction
of the following new narrative elements:
• a new character is introduced in the plot, be it a human
being or an animal;
• a shift to a character already in the plot;
• a new object appears in the scene;
• a shift to an object already in the plot;
• a shift of location without a verb group expressing the
transfer.
The diffi culty of recognizing any shift of time of the story
without the appropriate time adverb or expression in the
discourse, made it not relevant to register this variation since
it would be necessarily correlated with the correct formal
element.
The non-coherent and often non-cohesive structure of many
patients’ dreams shows to be relevant to determine what
the ratios indicate as a richer lexicon. The consistent, abrupt
introduction of NEW pieces of information, especially when
correlated with the formal features, is likely to be one of the
most signifi cant differences between the control persons and
patients, as well as between persons of the fi rst two large
psychopathologic categories analyzed here, bipolar disorders
and psychotic disorders.
Our research is an ongoing project that aims to amplify samples
and evaluate the possible infl uences of pharmacological
treatments on the oniric characteristics being analyzed.
We have elaborated a measurement system for interruptions
that we called “Scale of Silence”; it provides several special
mark-ups for the interruptions during the speech. The scale
(that is in a testing phase) has three types of tags according
to the length of the interruption (0-2 sec; 2-4 sec; 4 or more).
Special tags are also used for freudian slips and self-corrections
the narrator uses. The Chinese version of JGAAP supports Chinese word
segmentation fi rst then followed by a feature selection process
at word level, as preparation for a later analytic phase. After
getting a set of ordered feature vectors, we then use different
analytical methods to produce authorship judgements.
Unfortunately, the errors introduced by the segmentation
method(s) will almost certainly infl uence the fi nal outcome,
creating a need for testing.
Almost all methods for Chinese word segmentation developed
so far are either structural (Wang et al., 1991) and statisticalbased
(Lua, 1990). A structural algorithm resolves segmentation
ambiguities by examining the structural relationship between
words, while a statistical algorithm usually compares word
frequency and character co-occurrence probability to
detect word boundaries. The diffi culties in this study are the
ambiguity resolution and novel word detection (personal
names, company names, and so on). We use a combination of
Maximum Matching and conditional probabilities to minimize
this error.
Maximum matching (Liu et al., 1994) is one of the most
popular structural segmentation algorithms, the process
from the beginning of the text to the end is called Forward
Maximal Matching (FMM), the other way is called Backward
Maximal Matching (BMM). A large lexicon that contains all the
possible words in Chinese is usually used to fi nd segmentation
candidates for input sentences. Here we need a lexicon that
not only has general words but also contains as many personal
names, company names, and organization names as possible
for detecting new words.
Before we scan the text we apply certain rules to divide the
input sentence into small linguistic blocks, such as separating
the document by English letters, numbers, and punctuation,
giving us small pieces of character strings. The segmentation
then starts from both directions of these small character
strings. The major resource of our segmentation system is
this large lexicon. We compare these linguistic blocks with the
words in the lexicon to fi nd the possible word strings. If a
match is found, one word is segmented successfully. We do this
for both directions, if the result is same then this segmentation
is accomplished. If not, we take the one that has fewer words.
If the number of words is the same, we take the result of
BMM as our result. As an example : Suppose ABCDEFGH is
a character string, and our lexicon contains the entries A, AB,
ABC, but not ABCD. For FMM, we start from the beginning of
the string (A) If A is found in the lexicon, we then look for AB
in the lexicon. If AB is also found, we look for ABC and so on,
till the string is not found. For example, ABCD is not found
in the Lexicon, so we consider ABC as a word, then we start
from character D unti the end of this character string. BMM
is just the opposite direction, starting with H, then GH, then
FGH, and so forth.
Suppose the segmentation we get from FMM is
(a) A \ B \ CD \ EFG \ H
and the segmentation from BMM is
(b) A \ B \ C \ DE \ FG \ H
We will take result (a), since it has fewer
words. But if what we get from BMM is
(c) AB \ C \ DE \ FG \ H
We will take result (c), since the numbers
of words is same in both method.
After the segmentation step we take the advantage of
JGAAP’s features and add different event sets according to
the characteristics of Chinese, then apply statistical analysis
to determine the fi nal results. It is not clear at this writing,
for example, if the larger character set of Chinese will make
character-based methods more effective in Chinese then they
are in other languages written with the Latin alphabet (like
English). It is also not clear whether the segmentation process
will produce the same type of set of useful “function words”
that are so useful in English authorship attribution. The JGAAP
structure (Juola et al, 2006;Juola et al., submitted), however,
will make it easy to test our system using a variety of different
methods and analysis algorithms.
In order to test the performance on Chinese of our software,
we are in the process of constructing a Chinese test corpus.
We will select three popular novelists and ten novels from
each one, eight novels from each author will be used as training
data, the other two will be used as testing data. We will also
test on the blogs which will be selected from internet. The
testing procedure will be the same as with the novels.
This research demonstrates, fi rst, the JGAAP structure can
easily be adapted to the problems of non-Latin scripts and not
English languages, and second, provides somes cues to the best
practices of authorship attribution in Chinese. It can hopefully
be extended to the development of other non-Latin systems
for authorship attribution.
References:
Patrick Juola, (in press). Authorship Attribution. Delft:NOW
Publishing.
Patrick Juola, John Noecker, and Mike Ryan. (submitted).
“JGAAP3.0: Authorship Attribution for the Rest of Us.”
Submitted to DH2008.
Patrick Juola, John Sofko, and Patrick Brennan. (2006). “A
Prototype for Authorship Attribution Studies.” Literary and
Linguistic Computing 21:169-178 Yuan Liu, Qiang Tan, and Kun Xu Shen. (1994). “The Word
Segmentation Rules and Automatic Word Segmentation
Methods for Chinese Information Processing” (in Chinese).
Qing Hua University Press and Guang Xi Science and
Technology Press, page 36.
K. T. Lua. (1990). From Character to Word. An Application
of Information Theory. Computer Processing of Chinese and
Oriental Languages, Vol. 4, No. 4, pages 304--313, March.
Liang-Jyh Wang, Tzusheng Pei, Wei-Chuan Li, and Lih-Ching
R. Huang. (1991). “A Parsing Method for Identifying Words
in Mandarin Chinese Sentences.” In Processings of 12th
International Joint Conference on Artifi cial Intelligence, pages
1018--1023, Darling Harbour, Sydney, Australia, 24-30 August.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Oulu
Oulu, Finland
June 25, 2008 - June 29, 2008
135 works by 231 authors indexed
Conference website: http://www.ekl.oulu.fi/dh2008/
Series: ADHO (3)
Organizers: ADHO