The Introduction of Word Types and Lemmas in Novels, Short Stories and Their Translations

paper
Authorship
  1. 1. Mária Csernoch

    Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
In earlier analyses of the introduction of word types
in literary works authors came to contradictory
conclusions. Some suggested, in accordance with
reader’s intuition, that the launch of new chapters and a sudden increase in the number of the newly introduced word types (NWT) usually coincide. Others, on the other hand, found that there is no clear connection in the rise of NWT and the beginning of chapters, rather an increase in NWT appears at longish descriptions with rather stylistic reasons.
Words do not occur randomly in texts, so the ultimate
goal of building models based on word frequency
distributions may not be the reproduction of the original
text. Nevertheless, models based on the randomness
assumption give reliable information about the structure of the texts (for review see Oakes, 1998; Baayen, 2001). To further our knowledge in this field the source of the bias between the original and the model-based texts should be examined. Baayen (1996; 2001) described a systematic overestimation for the expected vocabulary size and found that this bias disappears when the order of the sentences is randomized, indicating that the bias
should not be attributed to constrains operating on
sentence level.
To prove that this misfit is due to significant changes
on discourse level we introduced several new concepts during the process of building the model and analyzing
the results (Csernoch, 2003). Among these the
fundamental step was to scrutinize NWT in hundred-
token-long intervals rather than examining the overall vocabulary size. Next, instead of eliminating the bias
between these artificial texts and original works, the
significant protuberances on the graphs of NWT were examined. First monolingual sets of works were
processed then, to improve the comparison we also
analyzed original English texts and their Hungarian
translations together with English and German
translations of a Hungarian text.
Assuming that the changes occur on discourse level, the language in which the text is written should have no
significance. In other words, neither syntactic nor semantic
constrains on sentence or paragraph level should matter,
and only events that occur on discourse level will
provide substantial alterations in the flow of the text,
and thus produce considerable protuberances on the
graphs of NWT.
Methods
Building the model
To analyze a text first the number of different word
types was counted, the frequency of each was
determined, and then based on these frequencies a dynamic
model was built (Csernoch, 2003). The model generated an artificial text whose word types had the same frequencies
as in the original text and was able to reproduce the trends of the original text. However, changes which are only seasonal – protuberances – did not appear in the
artificial text. To locate these protuberances the difference
between the original and the model text was calculated.
We then determined the mean (M) and the standard
deviation (SD) of the difference. Protuberances exceeding
M±2SD were considered significant.
The distribution of the hapax legomena was also examined.
Assuming that they are binomially distributed their
expected mean (Mh) and standard deviation (SDh) were calculated and again those points where considered
significant which exceeded Mh+2SDh.
texts compared to their translations
Original texts were not only compared to the
model-generated artificial texts but to their
translations in other natural languages. In this study we analyzed the Hungarian novel, SORSTALANSÁG from Imre Kertész and its English (FATELESS) and German (ROMAN EINES SCHICKSALLOSEN) translations, Rudyard Kipling’s THE JUNGLE BOOKS and their Hungarian translations (A DZSUNGEL KÖNYVE), and Lewis Caroll’s ALICE ADVENTURES IN WONDERLAND and THROUGH THE LOOKING GLASS and their Hungarian translations (ALICE CSODAORSZÁGBAN and ALICE TÜKÖRORSZÁGBAN).
These three languages were chosen because they are
different in their morphological structures, it is hard to trace any common syntactic characteristic which all three share.
Analyzing lemmatized texts
To check whether the analyses of the raw, un-lemmatized texts give reliable information for the introduction
of NWT the lemmatization of both the English and the Hungarian texts was carried out. The English texts were tagged and lemmatized by CLAWS (the Constituent Likelihood Automatic Word-tagging System) [1], while the morphological analysis of the Hungarian texts was carried out by Humor and the disambiguation was based on a TnT tagger [2].
Results
Comparing the texts and their translations it was first found that the morphologically productive
Hungarian texts had the smallest number of running words and lemmas while the largest number of hapax
legomena both in the lemmatized and un-lemmatized versions. In contrast, the English texts contained the most running words but the smallest number of hapax legomena.
To each text and language an individual model was
created. Based on these models the positions of the
significant protuberances were traced and compared to each other in the original texts and their translations. It was noticed that regardless of the actual language these protuberances occurred in most cases at the same
position, that is, at the same event in the flow of the
story.
We could clearly establish that the protuberances were found at places where new, usually only marginally
connected pieces of information were inserted into the text rather than at new chapters. This idea was strengthened by a peculiarity of the English translation of SORSTALANSÁG, namely that the boundaries of chapters are different from those of the Hungarian and German texts, which further
substantiates that the protuberances do neither necessarily
coincide with the beginning nor are hallmarks of a new chapter. Similarly, in the original Alice stories the
boundaries of the chapters are eliminated by unusual typographic tools, while in the Hungarian translation these
boundaries are set back to normal. Neither the English nor the Hungarian texts produced any protuberances at these places. In THE JUNGLE BOOKS we again found that the significant differences between the original text and the model are not necessarily at the beginning of a new tale, except for cases when a new setting is introduced.
The fact that these descriptions have only a stylistic role in the text was further substantiated by examining
the distribution of hapax legomena. The number of
hapax legomena was found to be high exactly at the same
positions of the text where protuberances in the number of the newly introduced word types occurred.
To examine the lemmatized version of the texts carried some risk since loosing the affixes might eliminate the change in mode, time, style, etc., while, on the other hand, might reveal events lost in word types carrying the affixes. Since our dynamic model is capable of giving a relatively good estimation for the introduction of words, the question was whether using lemmas instead of word types would provide additional information gained by comparing the artificial texts and the translations to the original text.
In the English texts the lemmatization did not reveal any additional information, the protuberances occurred at exactly the same places in the lemmatized as in the
un-lemmatized versions. In un-lemmatized Hungarian texts the first protuberance usually occurred later than in corresponding English and German texts, although we were able to locate them by examining protuberances
that were somewhat below the level of significance.
In these cases lemmatization helped, and we got clear
protuberances reaching the level of significance in
lemmatized Hungarian texts.
The comparison of the dynamic model built to lemmatized texts in different languages might also be used to analyze
and compare the vocabulary of the original texts and
their translations. It would, furthermore, enable the
comparison of the stylistic tools used by the original author and the translator in the introduction of new words.
Summary
Using lexical statistical models for analyzing texts the explanation for the difference between the
original and the model-based artificial text was examined.
It was found that changes on discourse rather than on sentence or paragraph levels are responsible for these differences. Two methods were used to prove this. First,
texts and their translations, both lemmatized and
un-lemmatized versions, were analyzed and compared to a dynamic model built on the randomness assumption to find that the significant changes on the graphs of the newly introduced word types occurred at corresponding
positions within the translations. Second, the distribution
of hapax legomena was compared to a binomial
distribution, again to find that the significant differences
between the original and the predicted distributions
occurred at descriptions, only in loose connection with the antecedents and what follows. More importantly, these coincided with the significant changes of the newly introduced word types.
References
Baayen, R. H. (1996) The Effect of Lexical
Specialization on the Growth Curve of the Vocabulary. Computational Linguistics 22. 455-480.
Baayen, R. H. (2001) Word Frequency Distributions.
Kluwer Academic Publishers, Dordrecht,
Netherlands
Csernoch, M. (2003) Another Method to Analyze the Introduction of Word-Types in Literary Works and Textbooks. Conference Abstract, The 16th Joint
International Conference of the Association for
Literary and Linguistic Computing and the
Association for Computers and the Humanities
Göteborg University, Sweden
Oakes, M. P. (1998) Statistics for Corpus Linguistics. Edinburgh University Press
[1] http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/
[2] http://corpus.nytud.hu/mnsz/index_eng.html

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None