A papaga'lyokrol: Analysing rare terms in a large literary database

  Paul A. Fortier

    University of Manitoba

At the Kingston ALLC-ACH Conference, William Winder stated that although interesting in the short story "Un Coeur Simple" by Flaubert, the theme of parrots was too rare in French literature to be analysable by computer. In private conversation, I challenged this judgment, and agreed with Professor Winder that it would be interesting to see what could indeed be done with such a rarely occurring vocabulary item. This session is the result of this agreement.
An opening paper (Olsen) sets the context, both theoretical and practical, of the use of large databases to analyse literature from the perspective of the history of mentalities. The second paper (Clark) adapts Firthian analysis methods to the rare term of parrot(s).The third paper (Keen, Fortier) applies statistical techniques, notably control charts, to the problem. In a final brief paper, Professor Winder evaluates the success of the methods used from his perspective as a literature scholar with intimate knowledge of computer techniques. (Note: It is not certain that Professor Winder will be able to attend the conference in Debrecen; if he is not able to attend, his evaluation will be read for him by the session organiser.)

Practical and Theoretical Aspects of Large Databases
Mark Olsen
ARTFL Project, University of Chicago
University of Chicago
E-MAIL : mark@barkov.uchicago.edu
1. Introduction
Several years ago, I engaged in a theoretical debate concerning the potential uses of electronic text in research and argued that the dominance of traditional paradigms directing Computer-aided Literature Studies had failed to produce a body of results or engage debates then (and still) raging over the nature of textuality. I stand by this general position, as stated in the abstract of that paper:
Computer-aided Literature Studies have failed to have a significant impact on the field as a whole. This failure is traced to a concentration on how a text achieves itsliterary effect by the examination of subtle semantic or grammatical structures in single texts or the works of indi- vidual authors. Computer systems have proven to be very poorly suited to such refined analysis of complex language. Adopting such traditional objects of study has tended to discourage researchers from using the tool to ask questions to which it is better adapted, the examination of large amounts of simple linguistic features. Theoreticians such as Barthes, Foucault and Halliday show the importance of determining the linguistic and semantic characteristics of the language used by the author and her/his audience. Current technology, and databases like the TLG or ARTFL, facilitate such wide-spectrum analyses. Computer-aided methods are thus capable of opening up new areas of study, which can potentially transform the way in which literature is studied [1].
The first half of my talk will address this general position and respond to some of the criticisms which have appeared in subsequent years. I hope that this relatively brief discussion will help provide a framework for the following papers. The second half of my proposed talk, however, will address methodological issues arising from analysis of large textual databases by focusing on three areas of more recent work we are doing.
2. Dynamic Linking of Full Text to Historical Reference Materials
In addition to expanding the main ARTFL database in areas of weakness, we are beginning a collaborative project to add many more additional texts by women writers and expanding our holdings of earlier materials by collaborations with several institutions, most notable P. Kunstmann's team at the University of Ottawa. ARTFL has undertaken a series of projects to build collections of important reference materials in a variety of formats which are being dynamically linked to the main database. These materials include multiple editions of the Dictionnaire de l'Academie francaise (a collaboration with R. Wooldridge's team), Diderot's Encyclopedie and other materials. At this time, we have built simple mechanisms to move quickly from full texts in the main ARTFL databases to chronological searching of headwords across these reference material databases (currently in production). As we assemble more historical reference materials, we expect that this will form a powerful adjunct to tracing long term changing in word use and meanings that the ARTFL database provides.
3. Dynamic Computational Linguistics Applications
ARTFL has had a long standing relationship with the MultiLingual Theory and Technology team of Rank Xerox Research Centre Grenoble Laboratories. We are currently experimenting with integration of "on-the-fly" text taggers to be used in several different ways. These include refinements of stylistic analysis. Our current target research applications are quantitative measures of changes of adverbial use or broad comparisons between different genres of text and (using light bracketing) examination of active/passive use of verbs in reference to the gender of actors. We also expect to use dynamic tagging to segregate homographs and possibly to combine word and part of speech information as part of queries, i.e. find sentence where cour is a noun followed by any adjective. Finally, we have run some initial experiments on tagging large portions of the ARTFL database and writing search engines to exploit this information. The project requires considerable rethinking of the technical notion of a word object within our systems.
4. Statistical Techniques: Frequencies and Beyond
A standard function of ARTFL for many years has been rapid generation of word frequencies by user defined subcorpora, such as periods, genres, authors, etc. As a very rough indication of word use and evolution, this is a reasonably helpful measure. In my own work, I have developed systems to examine word collocation techniques using standard statistical measures of the degree a "pole word" is related to words in its immediate context (Z-score measures which simply produces the number of standard deviations from an expected random distribution that is reflected by the actual distribution). In studies over the past couple of years I have experimented with different spans of context, including both linguistic elements of span definition (phrase, sentence, paragraph) and arbitrary spans of numbers of words (or characters). While the Z-score measure of relatedness helps filter out unimportant collocations, I find that it is biased to low frequency terms. Further, span definition appear to be more important in forming collocations which appear meaningful than variations in the statistical techniques used to identify them. I will indicate that short spans, phrase level or merely two non-function words each side, is the best indication of meaning. Finally, one of the proposed applications of dynamic computational linguistic techniques described above, "light bracketing" or partial parsing, may be a better mechanism to determine changing word meanings. Collocation is a statistical mechanism that does not take into account actual linguistic function (the assumption, for statistical purposes, that a text as a random distribution of terms clearly flies in the face of all linguistic rules). Identification of actual multi-word sequences, such as noun phrases or verb phrases, may be of considerably greater value to establishing changing word meanings than collocation.
5. Conclusion
Theoretical considerations should drive research in text oriented disciplines as much as any other area of the human sciences. It is even more evident that such considerations influence the development of computer systems to access and analyze text. The computer programmer and the literary scholar both begin by framing a set of questions and a set of methods/techniques/systems to answer such questions. I hope to indicate, in this talk, that my work at ARTFL is informed by theoretical perspective that guides my own research efforts and implementation of systems used by many scholars of French literature in North America.
1. Computers and the Humanities vol. 27 nos. 5-6 (1993); http://humanities.uchicago.edu/homes/mark/Signs.html

Disambiguating perroquet in the roman : Modernizing Firthian principles with computational tools
G. Aileen Clark
University of Ottawa
The idea of disambiguation is not a new one. However, computational tools of analysis have enabled modern researchers to resurrect once disregarded approaches, such as collocational analysis. In the early 1950s, the British semantician J.R. Firth proposed a theory of collocational study. In his work Speech, he suggested that by analyzing the distribution surrounding a given ambiguous word, one could then predict the environment that would connote one meaning versus another. Until recently, attempting this type of analysis seemed unthinkable. One hesitated because finding an exhaustive list of a word's occurrences would be nearly impossible. Thanks to computer databases such as ARTFL, we are now able to request a search for a word to analyze the distribution through collocation as Firth intended.
This paper shows how we can modernize our approach to linguistic disambiguating by using Web databases such as ARTFL to collect and sort data. It presents conclusions derived from a collocational analysis of perroquet, followed by a commentary on the significance of these findings as they apply to the study of exotic terminology within the history of the novel.
The term perroquet first appeared in French literature during the seventeenth century. Used to colour and beautify literary language, the term 'perroquet' belongs first to exotic vocabulary. Upon closer analysis, one also finds the term perroquet used in the literal sense, to mean "the bird" itself. The semantic duality of perroquet justifies further study of this term, as it appears in 17th century literature, but also as it manifests itself in modern and contemporary literature. The first point of interest lies in predicting which of the two meanings is connoted within a given context of perroquet. By studying the contextual environment which surrounds an ambiguous term like perroquet, the reader of the modern era can transcend temporal gaps, ultimately grasping the meaning of a text fragment. This fosters a better understanding of the work as a whole.
This study's main objective is to analyze the term perroquet within its distribution in order to disambiguate its meaning. By querying ARTFL in all types of documents, we found 771 occurrences. Given the nature of a collocative study, which relies on syntactic and semantic distribution, certain genres would risk skewing the results. This is the case with poetry, which arbitrarily chooses the distributional context that surrounds a given word. So it is appropriated to restrict the corpus in a way that included occurrences that manifest themselves in the roman. This restricted query of the ARTFL database found 482 occurrences of the word perroquet, total of which formed the corpus for study.
The second part of the research classifies the occurrences in such as way as to regroup occurrences with similar semantic or syntactic features. This part of the research involves creating a semantic classifier to significantly reduce the time spent on classifying the occurrences. For perroquet, semantic similarities within a 20-word distribution are identified. This allows prediction of the types of context that connote one function of perroquet rather than the other. This work builds on an earlier study of the term bienseance in 17th century France, and offers another example of a study which successfully uses the collocational approach through syntactic distributional similarities.The first main classification identifies all the literal uses of perroquet. Perroquet, in collocation with one or more verbs with the semes (+perroquet, +action) connoted literal meaning of "the bird." Using the same method of classification shows that exotic uses of perroquet show a predominant type of collocational relationship - the term perroquet is preceded by a comparison connector such as comme or ainsi que.Thus, by using the context, one can predict the meaning of the word perroquet. Computers in collocational text analysis foster a method of semantic disambiguation which establish contextual patterns. These in turn determine which meaning of perroquet is produced. By allowing the researcher to remain within the text, to read and analyze using nothing but the text itself, this method of disambiguating terms is a useful tool for linguistic analysis.Having looked at the purely linguistic aspect of this study, one is led to question the practicality of this type of analysis. What is gained from knowing when perroquet is used in the exotic meaning instead of the literal meaning? The data acquired from the linguistic study of perroquet furthers literary analysis of the term as it appears within a certain genre (i.e., the roman.) From a literary study point of view as it provides a basis from which to analyze the uses of exotic vocabulary in the novel, both modern and classical. Preliminary analysis suggests that the exotic uses of the term perroquet decrease over time in the novel. This suggests that, as society opened its borders and expanded its horizons, literature incorporated once exotic terminology as part of everyday vocabulary.
Patterns in a Rare Term: The Case of Parrots
Kevin J. Keen, Paul A. Fortier
University Of Manitoba
Advances in computer technology, particularly in the area of graphics, mean that the person applying statistical techniques is no longer limited to testing hypotheses, which all too frequently simplify the underlying literary reality beyond recognition in order to achieve elegance in the statement of a statistical hypothesis. Software such as Minitab and JMP-IN facilitate the use of statistics as an exploratory tool, in a manner much more congenial to the aims and interests of scholars of literature.The French noun "perroquet" (parrot), and its pluralcertainly qualify as a rare term, given that they appear a total of 771 times in the 114,521,745 words of the ARTFL database. This tempts one to extend the coverage to the extent possible by adding the term "perruche" (parakeet). It can be quickly demonstrated that the distribution pattern of "perruche" is statistically independant from that of "perroquet". In other words such an extension of the semantic field is not justified.In dividing up the period covered by the holdings of the ARTFL database (1600-1964), it quickly becomes apparent that simply taking equal sized slices of the temporal continuum does not produce a good fit with the periodicity of French history. In any case, an equal number of words or even of texts is not found in equal temporal slices. Since relative frequencies are thus required, it makes sense to organise the data in terms of important periods of French history, changes of reign in the pre-revolutionary period, changes in regime after it.Once relative frequencies of occurrence of the word perroquet(s) have been determined for each period, as well as the realtive number of texts in each period which use the word at all, analysis can proceed. The tool chosen was control charts, a technique developed for industrial quality control, and now widely available in commercial statistical software. The underlying distributional model is the Poisson distribution (even though parrots do not eat fish), certainly the most appropriate for such a rare term.Application of the model to the data reveals three main high points in the use of the word perroquet(s). First is the period from the death of Louis XIV to the fall of Napoleon, which corresponds with a period of French dominance in Europe and colonial rivalry with England. A second series of high points extends from 1850 to 1879, again a period of colonial expansion, following the turmoil attendant on the fall of Napoleon and a series of revolts and revolutions aimed at working out a new socio-political system in France. Less easily understood is the third peak during the years 1908-1926, covering as it does the end of the "Belle Epoque", the First World War, and the first half of the roaring twenties.Although not able to explain all the patterns in the data, this exploratory analysis does indicate the influence of political and societal reality on the language of literature. Even the results that do not seem immediately to correspond to this reality are valuable because they foreground the area most worthy of further study. In any case the usefulness of control charts elaborated in terms of the Poisson distribution for the analysis of rare terms in a large database is demonstrated by our results.
Parrots Revisited
William Winder
This paper will evaluate the three preceding papers in the context
of the methods and focus of traditional, non-computational Flaubertian and 19th century literary scholarship, and
of the new orientations of computational criticism which are rapidly leading us to a "neo-wissenschaft" period in literary scholarship.

