Italian NLP: Computational Linguistics Meets Literary Text Analysis

paper
Authorship
  1. 1. Valentina Di Giovanni

    Dept of Italian - Royal Holloway University of London

  2. 2. Roberto Bartolini

    Istituto di Linguistica Computazionale (ILC) (Institute for Computational Linguistics) - Consiglio Nazionale delle Ricerche (CNR)

  3. 3. Alessandro Lenci

    Dipartimento di Linguistica - Università di Pisa

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The paper will describe an experiment to apply a suite of tools for Natural Language Processing (NLP) of Italian to literary texts. The work will show that state-of-the-art computational linguistics technology can be profitably used to support text linguistic analysis and to gain deeper insights into authors' language. The corpus chosen for the reported experiment was formed by some Calvino's works, selected on diachronic line to cover all his production (~ 500.000 words). The corpus was parsed through a "pipeline" of NLP tools (Italian NLP, http://foxdrake.ilc.cnr.it/~webtools) performing a "shallow" syntactic analysis of input texts. The output analyses have been used to collect data about the distribution of a set of relevant features in Calvino's texts, in order to assess the impact of spoken language upon the narrative, by means of syntactic adjustment as mention of the subject, dislocations, hanging topic, ellipsis and repetition.

Italian NLP is a suite of linguistic processing tools based on the paradigm of "shallow NLP". Traditional full-parsing techniques seek to associate to each sentence a fully specified recursive structure, in order to identify the proper syntagmatic composition, as well as the relations of functional dependency among the identified constituents. The drawback of full parsing is that it is an extremely costly task for most of existing systems since it needs huge amounts of linguistic knowledge to work properly. Conversely, the main philosophy of "shallow processing" is that NLP systems can resort to a shallower level of syntactic description, which, although underspecified under various respects, still provides enough syntactic information as the basis for higher-level analyses.

After a short outline of shallow parsing, the paper will describe the main features of the processing tools forming the Italian NLP suite. The text processing chain starts out with the morphological analyser (MAGIC), which assigns to each word form in the tokenised input all its possible lemmas, together with the morpho-syntactic features describing them. On the morphologically analysed text a shallow syntactic analysis is then carried out, which includes chunking, a process of non-recursive text segmentation, and dependency analysis, aimed at identifying the full range of functional relations (e.g. subject, object, modifier, complement, etc.) within each sentence.

Text chunking is carried out through a battery of finite state automata (CHUNK-IT, Federici et al., 1998), which takes as input a morphologically analysed and lemmatised text and segments it into an unstructured sequence of syntactically organized text units called chunks. Chunking requires a minimum of linguistic knowledge that is through recourse to an 'empty' syntactic lexicon, which contains no other information than the entry's lemma, part of speech and morpho-syntactic features. The resulting analyses are flat: all chunks are represented at the same structural level, as daughters of the same top node. At this stage, a chunked sentence does not give information about the nature and scope of inter-chunk dependencies. These dependencies are identified during the last phase of dependency analysis, carried out by IDEAL (Italian DEpendency AnaLyzer, Lenci et al. 2001).

IDEAL includes two main components: (i.) a core grammar; (ii.) a syntactic lexicon of ~26,400 subcategorization frames for nouns, verbs and adjectives derived from the Italian LE-PAROLE syntactic lexicon (Ruimy et al. 1998). The IDEAL core grammar is formed by ~100 rules covering the major syntactic phenomena. The grammar rules are regular expressions (implemented as finite state automata) defined over chunk sequences, augmented with tests on chunk and lexical attributes. The rules are organized into two major modules: 1) structurally-based rules and 2) lexically-based rules.

It is worth stressing the fact that the Italian NLP tools have been developed for purposes and domains other than literary text analysis, and mainly concerning human language technology and language engineering applications. A challenging aspect of the experiment we carried out was exactly the customisation of computational linguistics technology at the needs of literary texts investigation. The target linguistic analysis to be performed on the corpus concerned the identification of features of spoken language into the narrative, with special regard to syntactic adjustment and sentences with marked order of elements, in order to evaluate the impact of spoken Italian over the written production. In the spoken language, phenomena of markedness pertain the mise en relief of certain information in the utterance. The same mechanism can be applied in the written language whenever placing emphasis provokes the reiteration of an information already provided (this is the case of the subject, being Italian a pro-drop language) or a repetition (anaphoric or cataphoric use of unnecessary pronouns) along with a shift of the focused element to a marked position in the sentence (hanging topic). The nature of the investigation, i.e. the utilization of a computational tool for the specific domain of the narrative, lead us to plan and develop essential changes within the framework of IDEAL, not just to fulfil the particular objectives on Calvino but to serve the narrative domain as a whole.

In order to gain more precise results, some existing rules were implemented and new ones compiled. Adaptations and new rules moved from the subcategorization of various VPs and endeavour to achieve a better definition of their arguments. The rules were designed and arranged follow the degree of embedment of the clauses, moving from the most embedded one to the main clause. The paper will report the first experiments to assess the contribution of the new rules to derive more accurate syntactic analyses. Preliminary evaluations are encouraging and interesting patterns of evolution in Calvino's language and style are emerging out of the corpus processed with Italian NLP.

REFERENCES: Texts:

1. CALVINO I., Opere, Romanzi e racconti, 3 voll., Milano, Meridiani Mondatori, 1994.
REFERENCES: Studies:

1. ABNEY S. P., Parsing by Chunks, in R. C. BERWICK et al. (eds.), Principled-Based Parsing: Computation and Psycholinguistics, Kluwer, Dordrecht, 1991, 257-278.
2. BAZZANELLA C., Le facce del parlare. Un approccio pragmatico all'italiano parlato, Firenze --Roma, La Nuova Italia, 1994.
3. BERRETTA M., Il parlato italiano contemporaneo. In: L. Serianni-P. Trifone (eds.), Storia della lingua italiana, II, Scritto e parlato, Torino, Einaudi, 1994, 239-270.
4. BERRUTO G., Le dislocazioni a destra in italiano. In: H. Stammerjohann (cur.), Tema-Rema in italiano, Tübingen, 1986, 55-70.
5. BONSAVER G., Il mondo scritto. Forme e ideologia nella narrativa di Italo Calvino, Tirrenia, Torino, 1995.
6. CARROLL J., BRISCOE T., CALZOLARI N., FEDERICI S., MONTEMAGNI S., PIRRELLI V., GREFENSTETTE G., SANFILIPPO A., CARROLL G., ROOTH M., Specification of Phrasal Parsing, SPARKLE Deliverable 1.1, 1996.
7. CARROLL G., LIGHT M., PRESCHER D., ROOTH M., CARROLL J., BRISCOE T., KORHONEN A., MCCARTHY D., CALZOLARI N., FEDERICI S., MONTEMAGNI S., PIRRELLI V., Syntactic and Semantic Type and Selection, Deliverable 5.1, Work Package 5, EC project, 1997.
8. CHOMSKY N., Barriers, Cambridge MA, MIT Press, 1986.
9. DE MAURO T., 1963, Lessico di frequenza dell'italiano parlato, Milano, Etas.
10. FEDERICI S., MONTEMAGNI S., PIRRELLI V, Shallow Parsing and Text Chunking: a View on Underspecification in Syntax, in J. CARROLL (ed.), Proceedings of the Workshop On Robust Parsing, Prague, Czech Republic, 12-16 August 1996.
11. FEDERICI S., MONTEMAGNI S., PIRRELLI V., Chunking Italian: Linguistic and Task-oriented Evaluation, in Proceedings of the First International Conference on Language resources and Evaluation, Granada, Spain, 1998a.
12. FEDERICI S., MONTEMAGNI S., PIRRELLI V., An Analogy-based System for Lexicon Acquisition, SPARKLE Working Paper, 1998b.
13. FEDERICI S., MONTEMAGNI S., PIRRELLI V., CALZOLARI N., Analogy-based Extraction of Lexical Knowledge form Corpora: the SPARKLE Experience, in Proceedings of the First International Conference on Language resources and Evaluation, Granada, Spain, 1998c.
14. FEDERICI S., PIRRELLI V., Analogy, Computation and Linguistic Theory, in H. SOMERS, D. JONES (eds.), New Methods in Language Processing, London, University College London, 1996.
15. LENCI A., MONTEMAGNI S., PIRRELLI V., SORIA C., NETTER K., RAJMAN M., Corpora for Evaluation, ELSE (LE4-8340) Deliverable D5, 1999.
16. LENCI A., MONTEMAGNI S., PIRRELLI V., SORIA C., Where opposites meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation, in Proceedings of LREC-2000, Athens, Greece, 31 May - 2 June 2000, 625-632.
17. LENCI A., BARTOLINI R., CALZOLARI N., CARTIER E., Document Analysis, MLIS-5015 MUSI, Deliverable D3.1, 2001.
18. LONGOBARDI G., Reference and Proper Names: A Theory of N-Movement in Syntax and Logical Form, «Linguistic Inquiry», XXV (1994), 609-665.
19. MENGALDO P.V., Aspetti della lingua di Calvino. In: La tradizione del Novecento, Terza serie, Torino, Einaudi, 1991.
20. ROVENTINI A., Palomar: a Computer-aided Analysis of some Lexical and Stylistic Features, PhD thesis, Università di Pisa, 1973.
21. RUIMY N., BATTISTA M., CORAZZARI O., GOLA E., SPANU A., Italian Lexicon Documentation, WP3.11 LE-PAROLE, Pisa, 1998.
22. SORNICOLA R., Sul parlato, Bologna, Il Mulino, 1981.
23. VOGHERA M., Sintassi e intonazione nell'italiano parlato, Bologna, Il Mulino, 1992.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None