Center for Survey Research and Methodology (ZUMA) - ZUMA
Integrated Publication and Information Systems Institute - GMD
Pattern concordances - TATOE calls XGrammar
Melina Alexa
Center for Survey Research and Methodology (ZUMA)
alexa@zuma-mannheim.de
Lothar Rostek
GMD - Integrated Publication and Information Systems Institute
rostek@darmstadt.gmd.de
Keywords: search pattern definition, text analysis
Introduction
This paper comes as a consequence to the suggestions and wishes which were expressed by those who attended the demonstration of the Text Analysis Tool with Object Encoding (TATOE) at last year's ALLC/ACH96 conference at Bergen, Norway (Alexa and Rostek 1996). Those interested in using TATOE observed that an important functionality which would crucially support text analysis with this tool would be to enable definition of search patterns making use of already existing mark up. At that time we only sketched the first steps we had made towards that goal.
One of the main features of TATOE is the combination of the typical corpus exploration functionalities, e.g. word searches, word frequencies, concordances of selected words, with an on-text mark up functionality. TATOE enables both importing texts which are already marked up as well as performing on-text mark up once the texts have been imported in the system. Furthermore, in TATOE the user can use one or more categorization schemata for mark up. By this means the analyst may create a very rich "personalized annotation database" with different kinds of information relating to different levels of textual interpretation and description. Having this information available, the user should have means for performing both very specific searches as well as automatically marking up the results obtained by the search.
A lacking feature of TATOE as it was presented last year was the possibility to define complex search patterns combining different kinds of information based on the existing mark up in order to extract frequency of occurrence lists and concordances according to the specific search pattern. Although it was possible in TATOE to search for a particular schema category and obtain all those instances which had been marked up or tagged according to that category, it was not possible to define a search pattern for a combination of categories belonging to a single schema or a combination of categories of different schemata or a combination of schema categories with text strings. This was an obvious limitation for text analysis, especially if one considers the fact that TATOE enables the creation of a rich source of information as far as mark up is concerned and yet the analysts were not supported enough in order to be able to perform more targetted and fine searches.
We have taken with us last year's comments and suggestions, and based on our specific text analysis tasks and requirements we have now enhanced TATOE with new features for obtaining concordances according to user-defined search patterns.
Pattern Definition
In order to provide concrete examples for the variety of corpus search needs according to pattern definitions we use the text type analysis of artist biography texts as reported in Alexa (in press). This analysis has been performed on a corpus of 88 English texts with the main aim of the analysis being to empirically identify those features which are specific for the text type analysed. The corpus had been automatically tagged with part of speech information and it was then imported into TATOE and subsequently marked up semi-automatically according to firstly a semantic domain information schema, using such categories as artists, works of art, styles, etc., and secondly with particular thematic progression information schema (based on the systemic functional linguistics model for theme analysis (Halliday 1994)). Three categorization schemata had been, thus, defined according to three different descriptive levels: one for part of speech information, one for specific domain information and one for the thematic progression description of each corpus text.
The way we have implemented the user-defined search patterns in TATOE is that they may consist of a regular expression, i.e. beside terminal elements, the syntactic means for building compound elements are sequence, alternation, iteration and optional structures. Moreover each search pattern can be given a name and these names can be also used within search patterns as nonterminal symbols. By doing this a search pattern is a context free grammar. The pattern match is a partial parse: once the pattern matching process is initiated, the system looks iteratively for the longest prefix of the text token sequence which belongs to the formal language defined by the grammar.
As terminal elements of a pattern we allow not only word strings, but also schema categories, e.g. the string 'he' followed by 'VBD' (a part of speech category: past tense verb) and followed by 'workOfArt' (a semantic domain category). In fact in TATOE this specific pattern can be defined as: 'he' #VBD #workOfArt. For obtaining more selective search results the definition of search patterns need not only be based on a single categorization schema, but it is exactly the combination of various elements of different schemata, typically representing different layers of description, which enhances analysis.
To give some concrete examples, for the purposes of the particular text type analysis (as described briefly above), the analyst may define a search pattern for all prepositions appearing at the beginning of a sentence followed by either a number, or a sequence of proper nouns in order to determine whether there exist and if so what kind of marked theme structures (i.e. by marked theme structures we mean clauses whose starting point is not the subject), for instance phrases which denote temporal information and appear at the beginning of a clause: '.' #IN (#CD | (#NNP)+), where IN stands for prepositions, CD for cardinal numbers and NNP for proper nouns. The plus symbol stands for a sequence of - at least one - NNPs. As another example, the analyst may check for all marked instances of the corpus according to marked theme which are followed by and artist's name or by conjunctions or adverbs: (#markedTheme)+ ((#artist)+ | #CC | #RB), where CC stands for conjunctions and RB for adverbs.
One formulates patterns which cover specific text analysis hypotheses. Interestingly, it is often easier to formulate patterns which cover regularities; in that case one is interested in those text positions which do not match the specified pattern. Therefore, allowing for both matching and non-matching pattern concordances is necessary.
Implementation
To enable all the above specified needs, we have integrated the XGrammar tool (Rostek et al. 1993), a Smalltalk-based toolkit which contains a general top-down parser for user-specified "grammars" which can be written in a BNF-like language, with TATOE. We have implemented a specific tool for the definition and maintenance of search patterns. To define a pattern the analyst types in a BNF-like expression and the system produces a syntax graph (a graphical presentation of the pattern) for controlling, correcting and refining purposes.
The result of calculating a specified pattern concordance is displayed automatically as a concordance list on the main text pane of TATOE, with the total of occurrences according to the particular pattern shown at the top of the concordance. Once the calculation is performed, the syntax graph is then updated with frequency of occurrence information: this shows how many occurrences of each pattern element have been found. This provides an immediate feedback about the distribution of the respective elements of a pattern.
Before performing the pattern search the user can choose whether either the matching or non-matching pattern concordance should be displayed. In any case, both results are stored in a history of occurrence lists, enabling later inspection without requiring to start a new calculation. All stored occurrence lists can be further combined with Boolean operators to produce new pattern concordances.
Conclusion
We have presented how we have specified and implemented an additional functionality for the Text Analysis Tool with Object Encoding for defining pattern concordances. The integration of the XGrammar tool with TATOE provides flexible mechanisms for defining complex search patterns and enables the exploitation of different layers of mark up. Although in general we are fairly satisfied with the system speed when performing such searches, we think that this process can be further optimized and we intend to improve this in the future.
References
Alexa, Melina (in press): Text type analysis with TATOE. In Storrer, A. and B. Harriehausen (eds)(in press): Hypermedia fuer Lexikon und Grammatik. Tuebingen, Narr Verlag.
Alexa, Melina and Lothar Rostek (1996): Computer-assisted corpus-based text analysis with TATOE. Presented at ALLC/ACH96, Bergen, Norway. Abstracts, pp. 11-17.
Halliday, M.A.K. (1994) An Introduction to functional grammar (second edition). Edward Arnold, London.
Rostek, Lothar and Wiebke Moehr and Dietrich Fischer (1993): Weaving a web: the structure and creation of an object network representing an electronic reference work. In Electronic Publishing, vol 6(4), pp. 495-505.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Queen's University
Kingston, Ontario, Canada
June 3, 1997 - June 7, 1997
76 works by 119 authors indexed
Conference website: https://web.archive.org/web/20010105065100/http://www.cs.queensu.ca/achallc97/