Beautiful lips and porcelain cheeks: extracting physical descriptions from recent Dutch fiction

paper, specified "long paper"
  1. 1. Corina Koolen

    University of Amsterdam

  2. 2. Sander Wubben

    University of Amsterdam

  3. 3. Andreas van Cranenburgh

    University of Amsterdam

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

In literary analysis, description – as opposed to narration – has previously often been an underestimated part of fiction. Literary theorists such as Bal, Lopes and Nünning however have made a case for its relevance [1, 6, 8]. Lopes reviews how well-known theorists like Barthes have dismissed description as ‘extra’, irrelevant or stalling the plot; he counters these notions with the statement that “[d]escription and narration constitute the two most basic modes of structuring any prose fiction text” [6, p. 19]. How the plot is conveyed, is relevant for how a text is judged. Literary theorist Wells for instance argues that description is the distinguishing factor between quality literature and ‘simple’ chick-lit novels [15]. Indeed, research has shown that literary novels contain significantly more noun phrases and prepositional phrases than chick lit, indicating a larger amount of description [5]. In this paper, the first steps are taken of a larger project in which description in fiction is computationally analyzed, as opposed to the now popular computational analysis of narrative (see for instance 7). The preliminary question that we want to answer is: how (well) can we extract descriptions from fiction? This will be tested in the current paper by zooming in on a specific domain: the physical description of fictional characters.
2. Motivation

Descriptions of physical appearance are chosen as a test case as they are more likely to occur in a current-day novel than for instance landscape description. Moreover, main characters are often introduced in the first chapters. This makes it possible in case of manual tagging (which we have done) to tag only the first chapters of a novel. Finally, it would be an interesting feature for further literary interpretation. Connotations of beauty in folk tales have been researched [i.e. 14], but this has not yet been done for novels.
3. Method

The corpus of [5] is used, consisting of 32 novels of recent Dutch fiction, half chick-lit, half literary novels. Two of them were tagged from beginning to end for descriptions of physicality, including clothing. One is a literary novel, De schilder en het meisje (‘The painter and the girl’) by Margriet de Moor, the other chick lit, Zwaar verliefd (‘Heavily in Love’) by Chantal van Gastel. Bal defines description as “a textual fragment in which features are attributed to objects” [1, p. 36], a definition we will follow. We tagged full sentences that were either mainly concerned with physical appearance (example 1a, Van Gastel), mentioned a single feature (1b, De Moor) or somewhere in between.
1a. Hij heeft mooie lippen. He has beautiful lips.
1b. Door de rook heen keek hij naar de porseleinen wangen van mevrouw Cloeck[.] Through the smoke he watched madam Cloeck’s porcelain cheeks[.]
For the extraction, two approaches are compared: (1) manual development of lexical-linguistic patterns and (2) a Naive Bayes and an SVM classifier. For the former, because patterns were manually developed on the basis of two novels, the patterns were subsequently tested on the other 30 novels, each of which the first 500 sentences were manually tagged.
3.1 Lexical-linguistic patterns

After an initial exploration of the two main novels’ tagged sentences, an approach was adopted of manually developing patterns to detect sentences containing description. Hearst uses similar patterns to harvest hyponyms [3]. Patterns consist of a combination of linguistic and lexical information, see example 2 below. A set of 13 patterns was written. The manual exploration showed that sentences containing physical descriptions, as opposed to sentences with no such descriptions, (a) contain more nouns and adjectives, (b) are regularly coupled with a few specific, static verbs, and (c) contain a couple of recurring base lexical-linguistic patterns, e.g., 'He was [a manNP] [[withPP] [brown eyesNP]]’. To perform extraction, the corpus was parsed with Dutch parser Alpino [2, 12]. Alpino parse trees provide rich linguistic annotations of sentences such as grammatical function of constituents. The trees can be queried with XPath, which was integrated in Van Cranenburgh's TreeSearch interface [5]. Linguistic information alone does not suffice however to target physical descriptions, so we used Cornetto, the Dutch WordNet [13], to expand a manually constructed lexicon of nouns and adjectives related to physical descriptions. The lists were cleaned to exclude words that were not relevant to the topic, resulting in a lexicon of almost 600 words.
An example of a pattern translated to an xPath query is:
//node[@cat="pp" and @rel="mod"]//node[%uiterlijkA%]/../node[%uiterlijkN% or %kleding%]
Example 2: This pattern searches for a modifying prepositional phrase which contains an adjective and a noun from the lexicon.
3.2 Machine learning

We cast the task of extracting physical descriptions as a text classification task in order to use machine learning methods. The task then becomes for a given text to automatically assign a class to it (in our case: physical description or no physical description). Usually, text classification is done on the document level. This means that for each document a corresponding class is predicted [10]. Algorithmic methods used for the classification task vary widely. Naive Bayes classification and Support Vector Machines (SVM) were used, two established straight-forward approaches to text classification [4, 9, 11]. We adapted these approaches to our task of classifying sentences. Each sentence was classified as either a description or not, in order to extract the descriptions.
4. Results

4.1. Lexical-linguistic patterns

Precision, recall and F-measure were calculated for each pattern separately for the two main novels, for the test set of 30 novels, for a cumulative set of all pattern results, and for chick-lit versus literary novels; the most important results can be found in table 1. Sentences that were extracted more than once were calculated as one hit.
4.1. Lexical-linguistic patterns

F-measure (%) Precision (%) Recall (%)
Test set-all novels 31 29 35
Test set-litterature 25 29 22
Test set-chick lit 18 28 13
Main novels 16 24 12
Table 1: Results for lexical-linguistic pattern-based extraction
An unexpected outcome was that the results were much better for the 30 novels in the test set than for the two novels on the basis of which the patterns were developed; the percentage of descriptions might be higher in the first chapters. Another interesting result was the performance on literary novels, which was better than on chick lit. An explanation might be that in chick lit, sentences are shorter [see 5], more often elliptic (‘And his mouth… He has beautiful lips. Precisely full enough.’) and regularly discuss physicality through dialogue, for which it is hard to develop patterns. Generic patterns, containing little more than lexical information, achieved higher scores than more specific ones. The specific patterns did improve the cumulative outcome. Further research is needed, but an expansion of the lexicon might raise performance.
4.2 Machine learning

We trained our classifiers on the two annotated novels. The features selected as input for the classifiers are words weighted with tf.idf, for which we considered the sentences as documents and the novels as the collection of documents. Experiments were also performed for bigrams and part-of-speech tags, but the results were comparable to the results we report here. We performed ten-fold cross validation on the set of sentences from each novel and both novels combined. We found that Naive Bayes outperforms SVM for this task, as can be observed in Table 2.
F-measure (%) Precision (%) Recall (%)
Both novels
Naive Bayes 60 57 62
SVM 58 59 58
Zwaar verlifd
Naive Bayes 62 61 64
SVM 57 59 56
De schilder en het meisje
Naive Bayes 58 55 62
SVM 52 53 51
Table 2: Results for the Naive Bayes and SVM classifier
Performance is considerably higher than that of the pattern-based approach. The skewedness of the class distribution (descriptions form only a small portion of a novel) makes this classification task a hard one, but overall this is a promising method. This machine learning approach can be regarded as a baseline: more sophisticated methods might yield better results.
5. Conclusion

A comparison of two methods for extracting sentences containing physical descriptions paints a clear picture: extracting such information is a complex matter but not impossible, and machine learning performs better than a manual-based approach. However, the main benefit of using the manual tagging and patterns is the insight they give in the form of the sentences that contain the sought-after descriptions, whereas the bag-of-words approach of the machine learning method is limited to finding features based on individual words. A possibility for future research is extension of the patterns and the lexicon to see if the results can be improved, but we prefer to pursue a bottom-up approach. A combination of the methods could be fruitful: using the patterns as features for machine learning. We could also explore descriptions on a different textual level; especially for the chick-lit novels, where use of ellipsis and dialogue confuses sentence extraction, larger fragments of texts should be analyzed. Targeted topic modeling might be useful for this purpose.

1. Bal, M. (2009). Narratology: Introduction to the Theory of Narrative. Toronto: University of Toronto Press.
2. Bouma, G., Van Noord, G. and Malouf, R. (2001). Alpino: Wide-coverage computational analysis of Dutch. Language and Computers, 37(1). 45–59.
3. Hearst, M.A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th Conference on Computational Linguistics, vol. 2 .539–545.
4. Hearst, M.A., Dumais, S. T., Osman, E., Platt, J., and Scholkopf, B. (1998). Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4). 18-28.
5. Jautze, K., Koolen, C., Van Cranenburgh, A. and De Jong, H. (2013). From High Heels to Weed Attics: a Syntactic Investigation of Chick Lit and Literature. In Proceedings of the Second Workshop on Computational Linguistics for Literature.
6. Lopes, J. M. (1995). Foregrounded Description in Prose Fiction: Five Cross-literary Studies. Toronto: University of Toronto Press.
7. Mani, I. (2013). Computational Narratology. In The Living Handbook of Narratology. Eds. Hühn, P., Schmid W. and Schönert, J.
8. Nünning, A. (2007). Towards a Typology, Poetics and History of Description in Fiction. In Description in Literature and Other Media. Eds. Wolf W. and Bernhart W. Amsterdam, New York: Rodopi. 91–128.
9. Rish, I. (2001). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, 3(22). 41-46.
10. Sebastiani, F. (2002). Machine learning in automated text categorization. In ACM computing surveys (CSUR), 34(1). 1-47.
11. Steinwart, I. and Christmann, A. (2008). Support vector machines. New York: Springer.
12. Van Noord, G. (2006). At last parsing is now operational. In TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles. 20–42.
13. Vossen, P., Hofmann, K., De Rijke, M., Tjong Kim Sang, E., and Deschacht, K. (2007). The Cornetto Database: Architecture and User-scenarios. In Proceedings of the Dutch-Belgian Information Retrieval Workshop. 89–96.
14. Weingart, S. and Jorgensen, J. (2012). Computational Analysis of the Body in European Fairy Tales. Literary and Linguistic Computing, 28(4).
15. Wells, Juliette. (2005). Mothers of Chick Lit? Women Writers, Readers, and Literary History. In Chick Lit: The New Woman’s Fiction. Eds. Ferriss S. and Young, M. New York: Routledge. 45–70.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO