Practical and Theoretical Aspects of Large Databases

Mark Olsen

Authorship

1. Mark Olsen

ARTFL Project - University of Chicago

Parent session

A papaga'lyokrol: Analysing rare terms in a large literary database , Paul A. Fortier

Original URL

http://web.archive.org/web/19980716093102/http://lingua.arts.klte.hu/allcach98/abst/abs14.htm

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
Several years ago, I engaged in a theoretical debate concerning the potential uses of electronic text in research and argued that the dominance of traditional paradigms directing Computer-aided Literature Studies had failed to produce a body of results or engage debates then (and still) raging over the nature of textuality. I stand by this general position, as stated in the abstract of that paper:
Computer-aided Literature Studies have failed to have a significant impact on the field as a whole. This failure is traced to a concentration on how a text achieves itsliterary effect by the examination of subtle semantic or grammatical structures in single texts or the works of indi- vidual authors. Computer systems have proven to be very poorly suited to such refined analysis of complex language. Adopting such traditional objects of study has tended to discourage researchers from using the tool to ask questions to which it is better adapted, the examination of large amounts of simple linguistic features. Theoreticians such as Barthes, Foucault and Halliday show the importance of determining the linguistic and semantic characteristics of the language used by the author and her/his audience. Current technology, and databases like the TLG or ARTFL, facilitate such wide-spectrum analyses. Computer-aided methods are thus capable of opening up new areas of study, which can potentially transform the way in which literature is studied [1].
The first half of my talk will address this general position and respond to some of the criticisms which have appeared in subsequent years. I hope that this relatively brief discussion will help provide a framework for the following papers. The second half of my proposed talk, however, will address methodological issues arising from analysis of large textual databases by focusing on three areas of more recent work we are doing.
2. Dynamic Linking of Full Text to Historical Reference Materials
In addition to expanding the main ARTFL database in areas of weakness, we are beginning a collaborative project to add many more additional texts by women writers and expanding our holdings of earlier materials by collaborations with several institutions, most notable P. Kunstmann's team at the University of Ottawa. ARTFL has undertaken a series of projects to build collections of important reference materials in a variety of formats which are being dynamically linked to the main database. These materials include multiple editions of the Dictionnaire de l'Academie francaise (a collaboration with R. Wooldridge's team), Diderot's Encyclopedie and other materials. At this time, we have built simple mechanisms to move quickly from full texts in the main ARTFL databases to chronological searching of headwords across these reference material databases (currently in production). As we assemble more historical reference materials, we expect that this will form a powerful adjunct to tracing long term changing in word use and meanings that the ARTFL database provides.
3. Dynamic Computational Linguistics Applications
ARTFL has had a long standing relationship with the MultiLingual Theory and Technology team of Rank Xerox Research Centre Grenoble Laboratories. We are currently experimenting with integration of "on-the-fly" text taggers to be used in several different ways. These include refinements of stylistic analysis. Our current target research applications are quantitative measures of changes of adverbial use or broad comparisons between different genres of text and (using light bracketing) examination of active/passive use of verbs in reference to the gender of actors. We also expect to use dynamic tagging to segregate homographs and possibly to combine word and part of speech information as part of queries, i.e. find sentence where cour is a noun followed by any adjective. Finally, we have run some initial experiments on tagging large portions of the ARTFL database and writing search engines to exploit this information. The project requires considerable rethinking of the technical notion of a word object within our systems.
4. Statistical Techniques: Frequencies and Beyond
A standard function of ARTFL for many years has been rapid generation of word frequencies by user defined subcorpora, such as periods, genres, authors, etc. As a very rough indication of word use and evolution, this is a reasonably helpful measure. In my own work, I have developed systems to examine word collocation techniques using standard statistical measures of the degree a "pole word" is related to words in its immediate context (Z-score measures which simply produces the number of standard deviations from an expected random distribution that is reflected by the actual distribution). In studies over the past couple of years I have experimented with different spans of context, including both linguistic elements of span definition (phrase, sentence, paragraph) and arbitrary spans of numbers of words (or characters). While the Z-score measure of relatedness helps filter out unimportant collocations, I find that it is biased to low frequency terms. Further, span definition appear to be more important in forming collocations which appear meaningful than variations in the statistical techniques used to identify them. I will indicate that short spans, phrase level or merely two non-function words each side, is the best indication of meaning. Finally, one of the proposed applications of dynamic computational linguistic techniques described above, "light bracketing" or partial parsing, may be a better mechanism to determine changing word meanings. Collocation is a statistical mechanism that does not take into account actual linguistic function (the assumption, for statistical purposes, that a text as a random distribution of terms clearly flies in the face of all linguistic rules). Identification of actual multi-word sequences, such as noun phrases or verb phrases, may be of considerably greater value to establishing changing word meanings than collocation.
5. Conclusion
Theoretical considerations should drive research in text oriented disciplines as much as any other area of the human sciences. It is even more evident that such considerations influence the development of computer systems to access and analyze text. The computer programmer and the literary scholar both begin by framing a set of questions and a set of methods/techniques/systems to answer such questions. I hope to indicate, in this talk, that my work at ARTFL is informed by theoretical perspective that guides my own research efforts and implementation of systems used by many scholars of French literature in North America.
Reference
1. Computers and the Humanities vol. 27 nos. 5-6 (1993); http://humanities.uchicago.edu/homes/mark/Signs.html

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/

References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm

Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Practical and Theoretical Aspects of Large Databases

1. Mark Olsen

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"