Categorisation techniques in computer assisted reading and analysis of texts (CARAT) in the humanities

  1. 1. Jean-Frédéric de Pasquale

    Laboratoire d'ANalyse Cognitive de l'Information (LANCI) - Université du Québec à Montréal (Quebec a Montral - UQAM)

  2. 2. Jean-Guy Meunier

    Université du Québec à Montréal (Quebec a Montral - UQAM)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1- Mathematical classification and categorizations strategies

There are two important recurring strategies in computer assisted reading and analysis of text (CARAT). A first one relates to the classification process which, through various clustering techniques must discover classes of segments on the ground of some type or other of similarity criterion. This is typical in lexical, semantic, narrative, thematic or stylistic analysis. The second strategy pertains to the categorisation, that is, in the information retrieval sense ( not the cognitive one): the attribution of tags from a finite set of tags to each segment, sentence, or word of the whole text. These tags are used as descriptor for some aspect of the content. They may be morphological (vg Masc. Fem) syntactical (e.g. Name, verb) but they may also be semantic. For instance, these last types may define the individual senses of words (Paliouras, Karkaletsis and Spyropoulos 1999, Rastier, 1994) by relating them to some conceptual, notional, ontological category such as "HUMAN", "MATERIAL OBJECT", "ETHICAL SUBJECT", etc... These two often interrelated operations have been regularly recognized as essentials components of text analysis (Beaugrande 1980, Landow & Delany 1993, Jansen 1992, Hearst 1994, Hayes 1979, Barrett 1989, Rastier 1994, Robert & Bouillaguet 1997). It is through these two main operations that content analysis and interpretation of texts are usually performed. Altough some of these operations can be computer assisted if they belong to basic grammatical level (lemmatisation, morphological tagging, syntactic tagging) they are seldom found at the more complex semantical and logical level. This is why systems such as NUDIST*, ATLAS,... are so welcome (Alexa & Zuell 1999a, 1999b). These systems assist and manage the manual classification and categorization process. But even so, these two operations are highly time consuming. For relatively small corpora, such manual operations may be possible, but for large and complex philosophical or literary text corporaor even a large corpora of psychological interview, the process is energy consuming and will be practically unrealizable. A possible solution to this problem calls upon more inductive or bottom-up strategies that are numerical and statistical. These classification and categorisation techniques are used in the information retrieval field and in what is more and more named text mining strategies (Hearst 1994, 1999). By comparison, these techniques are fast, easy to use and entirely or quasi-entirely automatics. The classification techniques are usually realized through various clustering strategies such as factorial analysis, k means, principal component analysis, etc. (Bouroche et Saporta 1980). The categorisation techniques are realized through neural nets (Wermter, Panchev and Arevian 1999), k-NN, linear regression (Yang and Liu 1999), decision trees (Lewis and Ringuette 1994), genetics algorithms (Tauritz, Kok & Sprinkhuizen-Kuyper 2000), etc. Both types of strategies may be combined. Both techniques are known to have obtained important success. And the categorizations algorithms in recent research (Sebastiani 1999) may even obtain more than 80% mark on breakeven point scale. But application of these techniques in the fields of humanities texts have not been frequents. Most of the time, the categorisation algorithms are used with simple and easy to process corpus (like the standards test corpus, the different Reuters Corpora); the humanities texts, and more so philosophical or literary texts or psychological interviews, need finer discriminations.
Our research aims to find answers to the following question: Can these text classification and categorisation techniques be applied successfully to the reading and analysis or texts in the humanities and social sciences? A positive answer would allow important methodological innovations for the computer text analysis as practices in theses researches, because machine learning algorithms allows the reader to make there own categories without an explicit theory of necessary and sufficient conditions for belonging to the categories. Some researchers (Hearst 1999), think that these text mining tools should be used as new scientific tools, just as were microscopes or telescopes. For the moment, we think more modestly, that these methods have to be explored more systematically on large and complex corpora before we can pronounce ourselves on their strength and weakness.
In our own research, we are exploring a few of these techniques and their combinations. We now know, through our own past research and other's works, that the classification methods allows a good empirical thematic exploration of a corpus (Meunier, Remaki, Forest, 1999; Memmi, Meunier, Gabi, 1998) and may be used in hypertextualisation of corpus (Nault, Rialle, Meunier, 1999). More specifically, in this paper we shall concentrate mainly on the problem of assisting the automatic categorization of small segments of a philosophical text into a set of thematic categories. The main goal in this experience is to make a "proof of concept": is the idea of using these Information Retreival tools in content analysis a viable idea ? More work must be done before we can have a definitive answer; but this experience can give a general idea of the possibility and the limit of the actual tools - the perceptron beeing one of the best ones.

2.- Methodology

Because of the particular complex nature of humanity texts, the design of our methodology contain 6 main steps. In the first one, the text is filtered. Here we may eliminate from the text all functional and subjectively non-pertinent words either manually or automatically. In the present experiment, for simplicity of evaluation we have skipped this step. In the second step, a set of categories or tags is chosen. This set of tags is the working hypothesis for the expert reader. They are usually taken from an a priori knowledge that the expert has about the corpus. In the third step, the original text is automatically transformed into a matrix, using the Vector Space Model (Salton, 1983; Manning and Schütze, 1999).Here, each segment is seen as a binary vector and each element of the vector represents the absence or presence of a specific word. The fourth step is the training one. Here, as usual in these algorithms, the expert reader, manually tags a sample set of segments. Then a neural net "learns" what "counts" as typical exemplars of a particular tag. Technically, this learning is realized by defining a partition of the vector space by an hyper plane, using linear regression. In the fifth step, the neural net now takes on the whole text. It then tags the rest of the segments of the text into each one of the categories. This is realized through the matrix built in the second step, and the categorisation techniques are then applied to the matrix. In the sixth step, the various segments of the text are then presented to the expert for analysis and evaluation. Here the expert may accept or reject the classification realized according to some type or other of templates (e.g. experts in the field or his own working hypothesis, etc.) Further development will explore the possibility of using some type or other of dynamic relevance feedback techniques (Salton and Buckley 1990) e.g. genetic algorithms. (Nault, 1999).

3.- The experiment

The preceding methodology has been applied to a philosophical text of Bertrand Russell (about 43 000 words). The text is segmented in 50 words segments. The set of categories chosen pertain to various dimensions of the various possible types of philosophical dimensions a russellian discourse can present. The ones chosen here were : PERCEPTION", "KNOWLEDGE", "MIND". The categories are not exclusive and do not form a structured ontology. This computer processing was realized on an in-house system called CONTERM in which perceptron neural net modules has been included and specially programmed for this experiment. The one-layered perceptron algorithm is a classical but robust neural network. The current research seems to show that the multilayered perceptron is not better than the one-layered one in the text categorisation task.

4.- Results

After training our system on some first segments, the system then had to categorized the rest of the text. The results were positive. As example, it correctly categorised the following segment into the category "KNOWLEDGE":
"In this respect our theory of belief must differ from our theory of acquaintance, since in the case of acquaintance it was not necessary to take account of any opposite. (2) It seems fairly evident that if there were no beliefs there could be..."
But the sentence :
"Some relations demand three terms, some four, and so on. Take, for instance, the relation 'between'. So long as only two terms come in, the relation 'between' is impossible: three terms are the smallest number that render it possible. York is between London."
is rigthly rejected as not belonging to the category. As we can see the machine learning tool manages to categorize the first segment as belonging to the category, although the word "knowledge" does not appear in it. This illustrate the basic reason of using such tools: the definition of a category learned by the algorithm may be not a priori evident to the user. And it may heuristicaly deliver to the user segments that could not appear in a classical concordance or in a key word retrieval.
More so, the system directly finds segments that can be considered as prototypical of a category because of the high synaptic weight it attributes to certain words in it. For instance words as "acquaintance" (7.5), "knowledge" (5.5), "about" (4.0), "could" (4.0), "nature", "truths", "know", "should" (3.0), "reason" (2.5) in a segment are among those found as having the more high weights. This is common in neural networks technologies (McLeod, Plunkett, Rolls 1998).
By using this algorithm with the Russell corpus, we cannot hope to reproduce the 80% results obtain by others with this kind of algorithm. But the result are encouraging: without any pre-filtering, ( lemmatization, complex names, elimination of hapax, etc ) we have obtained more than we obtain a recall of 0.658 and a precision of 0.531 in test phase with the category "Knowledge". But for "Mind" and "Perception", the Perceptron results are near random, probably due to the low cardinality of the positive training set.

5. Discussion

We can see that categorizing a philosophical text is not like categorizing sports or business news. We think that because of the particular nature of philosophical texts some specific modifications should be added to the process before the perceptron or another similar algorithm can used with more precision in a content and thematic analysis. Although the results of this experiment were positive much more work has to be realized in order to discover the various pertinent factors that come into play in the application of these numerical classification and categorisation strategies to humanities texts and to increase the success of the categorization. Among these we can cite: 1) more complex pre filtering (lemmatisation, elimination of functional and subjectively non-pertinent words, use of compound-word detector 2) better understanding of the nature of categories set and training set for training purpose, 3) better parameters for correct segmentation for categorisation purposes (by words, by sentences or according to some predefinite criterion). 4) better design of the categorizing algorithm, especially for dynamical corpora. 5) specific evaluation strategy for bench marking text categorization according to text interpretations by expert of the domain.
ALEXA, M. & C. ZUELL (1999) A review of software for text analysis. ZUMA: Mannheim.
BARRETT, E. (1985). The Society of Text. Hypertext, Hypermedia, and the Social Construction of Information. Cambridge, Mass.: MIT Press
BEAUGRANDE, R. (1980) Text Discourse and Process. Longman.
BOUROCHE, J.M., SAPORTA, G. (1980), L'analyse des données, Paris, Presses Universitaire de France.
CARPENTER, G.A. & GROSSBERG, S. (1988) The ART of Adaptative Pattern Recognition by a Self-Organizing Neural Network, IEEE Computer 12(3): 77-88.
HAYES, P. J. (1980). "The Logic of Frames". In D. Metzing (Ed.), Frame Conceptions and Text Understanding. New York: Walter de Gruyter.
HEARST, M.(1994a) Context and Structure in Automated Full-Text Information Access. PhD thesis, University of California at Berkeley.
HEARST, M.(1999) Untangling Data Mining, in the Proceeding of ACL'99 : the 37th Annual Meeting of the Association for Computational Linguistic, University of Maryland, June 20-26.
JANSEN, S., OLESEN, J., PREBENSEN, H., & THARNE, T. (1992). Computational approaches to text Undestanding. Copenhaguen: Museum Tuscalanum Press,
LACHARITÉ, N. (1989), Introduction à la méthodologie de la pensée écrite, Presses de l'Université du Québec, Québec.
LANDOW, G.P. & DELANY, P. (1993). The Digital Word: Text-Based Computing in the Humanities. Cambridge: MIT Press.
LEWIS, D.D., and M. RINGUETTE (1994), A comparison of two learning algorithms for text categorization, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93.
MANNING, C.D., SCHÜTZE, H. (1999), Foundations of statistical natural language processing, Cambridge, Mass. : MIT Press.
MCLEOD, P., PLUNKETT, K., ROLLS, E. T. (1998), Introduction to Connectionist Modelling of Cognitive Processes, Oxford University Press.
MEMMI, D. (2000), Le modèle vectoriel pour le traitement de documents, Les cahiers du laboratoire Leibniz, Leibniz-Imag, Grenoble.
MEUNIER,JG. MEMMI, D. GABI, K. (1998) Dynamical Knowledge extraction from texts by Art Networks. Proceedings of Neurap.Marseille. p. 205-210. 6 p.
MEUNIER.J.G.REMAKI, L. FOREST D. (1999), "Use of classifiers in Computer assisted reading and analysis of text", Proceedings of the 1999.Internat, Conf. on Imaging Science, Systems, and Technology (CISST'99), pp.437 à 443. 7 p.
NAULT G., V. RIALLE et J.G. MEUNIER (1999), PROGEN : a Genetic-Based Semi-automatic Hypertext Construction Tool - first steps and experiment. In Smith, R. E. (eds.). GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference, July 13-17, Orlando, Florida USA. San Francisco,CA: Morgan Kaufmann.
PALIOURAS, G., and KARKALETSIS, V. and C. D. SPYROPOULOS, Learning rules for large vocabulary word sense disambiguation, Proceedings of IJCAI-99, 16th International Joint Conference on Artificial Intelligence, pp. 674-679, Morgan Kaufmann Publishers, San Francisco, US, 1999.
RASTIER, F. et al. (1994), Sémantique pour l'analyse. De la linguistique à l'informatique. Paris : Masson.
ROBERT, A. D., BOUILLAGUET, A., L'analyse de contenu, PUF, 1997.
RUSSELL, B. (1959), Problems of philosophy, London, Oxford University Press.
SALTON G., & Mc Gill, M. (1983). Introduction to models of Information Retrieval, New York: Mc Graw Hill.
SALTON,G. BUCKLEY C.(1990) Improving retrieval performance by relevance feedback. Journal of the American Socity for information Science. 41(4) 288-297
SEBASTIANI, F., Machine learning in automated text categorisation: a survey, Technical Report, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Number IEI-B4-31-1999, 1999.
TAURITZ, D.R., and KOK, J.N., and I.G. SPRINKHUIZEN-KUYPER, Adaptive information filtering using evolutionary computation, Information Sciences, Vol. 122, Number 2-4, pp. 121-140, 2000.
WERMTER, S., Panchev, C. and G. Arevian, Hybrid Neural Plausibility Networks for News Agents, Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence, pp. 93-98, AAAI Press, Menlo Park, US, 1999.
YANG, Y., and X. LIU, A re-examination of text categorization methods, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pp. 42-49, ACM Press, New York, US, 1999.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC