Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)
INTRODUCTION
Models based on the frequency of word-types assume that tokens in a text occur randomly. Even with this constraint, many different strategies can be followed in the process of selecting words randomly. The best models proved to be those that assume that word-types are binomially distributed. Based on this binomial distribution a model was created first by Baayen (Baayen, 1996a; Baayen, 1996b; Baayen et al., 1996; Tweedie and Baayen, 1998) than used by Hoover (2003). In these models a constant was calculated for each type which occurred m times in the original text. Summing up these constants for each type a predicted number of word-types can be calculated. The model created this way is static, since it always provides a constant for a selected M (M = N, where N is the number of tokens in the text). In the era of the new generation of personal computers dynamic models can also be built based on the same assumptions.
The ultimate goal of our studies was to build such a dynamic model. The question was if this new model could help us to explain the regularities of the introduction of word-types in a text, to support the hypotheses that by significantly decreasing the size of blocks in a text we can get a more sophisticated insight into its overall structure. Namely, by doing so, we can trace important narrative events (the introduction of a new scene, the embedding in the flow of narration of a scene, setting or event only distantly related to the general flow of the text) and thus pointing to possible individual characteristics of the given narration. This hypothesis is tested by matching statistical data of the actual text against a model which generates the hypothetical distribution of the introduction of new word types in a text. Any differences between real text data and data from the model are expected to show both the potential and the limitations of the approach.
METHODS
To conduct the analysis of a given text a program was developed which provides a model and creates an artificial text based on the frequency of the word-types in the original text. The essence of the model is that the relative frequency of the word-types are counted, and based on these frequencies a distribution function is generated. Then, to a randomly picked number a word is mapped with the distribution function. To provide comparable results to previous models and to analyze Hungarian literary works, the program is designed for English and Hungarian texts, but the users have the opportunity to create their own character sets for further investigations.
Unlike models described in previous reports (Baayen, 1996a; Baayen, 1996b; Hoover, 2003) this program, due to the distribution function-based random selection, provides a dynamic model. The other difference as compared to previous methods is that the program divides the texts into short, constant-length intervals with usually 100 words. This method was found to be more accurate in two aspects. First, the length of the intervals does not depend on the length of the text, and this way, texts of different lengths can be compared on a more reliable way. The other advantage comes from the short intervals, since subtle changes in the course of the story can also be traced back.
MATERIALS
The analyzed texts were chosen by the following strategy: different works of an author, different volumes of a series from the same author, works of the same genre, concatenated short stories from the same, or different authors, both in English and in Hungarian. The main source of the e-texts was the Internet. The works not found on the Internet were manually digitalized.
Due to the many different sources of the texts, first these texts had to be standardized: delete those paragraphs which are added to the e-text, but not part of the original texts; delete footnotes and endnotes; deal with the different typographic rules; concatenate the chapters; correct the mistakes which occurred at the process of digitalization, and finally the different file-formats had to be converted into text files (with .txt extension). To analyze a text first the introduction of the types in the original work, then the model, the artificial text was plotted. To each 100-word long block an integer was mapped, the number of the newly introduced types in the block.
RESULTS
The intervals into which the texts were divided were short enough to show subtle changes in the discourse. Such events occurred when a longish description was inserted into the text, when a new character with a new style (different from the style of the other characters) was introduced, and when foreign expressions, sentences popped up. The results, of course, show not only those changes that were coming from the logical flow of the story but also those, where the text contains parts which are not related to the events. In both cases the monotonic decay of the introduction of types was somewhat reversed. The above listed reasons provided, in most, cases more significant changes than the introduction of a new chapter.
The model was able to follow those changes which were due to the flow of the story by providing small heaps in the otherwise decaying graph. However, the trace of those surprising, unrelated events, understandably, never occurred in the model. Comparing the original text and the related model we can pinpoint parts which are only loosely related to the story.
It was also found that the length of the texts has significant importance, which is in accordance with previously mentioned results (Holmes, 1994). This can explain the fact that concatenated short stories of the same genre did not provide huge jumps at the beginning of a new story. The concatenated short stories behaved like a novel in this sense. The difference between concatenated stories and a novel was that the number of the newly introduced types was higher in the concatenated text. The concatenated short stories also showed that the introduction of types is not a characteristic of an author since concatenating stories from different authors sometimes showed higher repeating rate than stories from one author. Similar results were presented with another method in earlier works (Baayen, 1996b) where it was found that the differences in register [genre] may override differences in authorship.
Not only the analysis of literary works was carried out but also the analysis of textbooks with second language teaching purposes. The recently accepted strategy to choose the vocabulary of a textbook is that to teach at least 1000 new words in each stage of a general course (120-140 hours' work), and over this suggested minimum as many words as possible. (Cunningsworth, 1995), To provide this amount of new words the textbooks should also provide a sufficient amount of text to make the courses effective enough.
The result of the analysis of the selected series shows that these books fail to fulfill this requirement. The introduction of new types and the high number of hapax legomena show that the vocabulary of a textbook in these senses is not different from the vocabulary of concatenated short stories.
SUMMARY
Hungarian and English literary works and English textbooks were analyzed the find regularities in the introduction of word-types in these works. To carry out the study the texts were divided into short, constant-length intervals with a usual length of 100 words. Based on the frequency of the word-types in the original text a model, an artificial text was created. Comparing the original and the artificial text we were able to find intervals in the original text that corresponds to unpredicted, sometimes illogical events in the discourse of the text. Analyzing the textbooks, we learned that the introduction of word-types in these books showed resemblance to randomly chosen and then concatenated short stories. It seemed that the authors of the textbooks ignored that not only the number of word-types should be increased, but the words should be repeated a certain times in these books.
The analysis of the given works showed that the introduction of new types in a text mainly depends on the length and genre of the work, and not a significant characteristic of the author. In further works I would like to analyze series of novels from different authors, also to see how they differ from those whose author is the same. The analysis of other monolingual textbook series is also in progress. The results of this analysis can lead to a change in planning textbooks where the designing of the vocabulary will not only mean the setting of the number of words, but also the effectiveness of vocabulary teaching will be improved.
REFERENCES
1. Baayen R. H. (1993) Statistical Models for Word Frequency Distributions: A Linguistic Evaluation. Computers and the Humanities 26. pp. 347-363.
2. Baayen R. H. (1996a) The Randomness Assumption in Word Frequency Statistics, Research in Humanities Computing 5 Selected Papers from the ACH/ALLC Conference, University of California, Santa Barbara, August 1995 pp.17-31.
3. Baayen R. H. (1996b) The Effect of Lexical Specialization on the Growth Curve of the Vocabulary. Computational Linguistics 22. pp. 455-480.
4. Baayen H., Halteren H, Tweedie F. (1996) Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authership Attribution pp. 121-131.
5. Cunningsworth A. (1995) Choosing your Coursebook. Heinemann
6. Hoover D. L. (2003) Another Perspective on Vocabulary Richness. Computers and the Humanities 37: pp. 151-178.
7. Holmes D. (1994) Authorship Attribution Computers and the Humanities 28. pp. 87-106.
8. Tweedie F. J. and Baayen R. H. (1998) How Variable May a Constant be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32: pp. 323-352.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at Göteborg University (Gothenburg)
Gothenborg, Sweden
June 11, 2004 - June 16, 2004
105 works by 152 authors indexed
Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/