Modelling the relationship between morphosyntactic features and discourse relations in a multimodal corpus of primary school science diagrams

paper, specified "short paper"
  1. 1. Tuomo Hiippala

    University of Helsinki

  2. 2. Jonas Haverinen

    University of Helsinki

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In 1998, Watanabe and Nagao published a pioneering article on the relationship between written language and pictorial representations in diagrams (Watanabe and Nagao 1998). By manually analysing 31 diagrams from Japanese books of flora that describe the shape, features and environment of plants, Watanabe and Nagao showed that morphosyntactic features of textual elements could be mapped to specific discourse relations that held between the text and pictorial representation of a plant. They also formulated a set of rules to support the computational processing of diagrammatic representations, which could be used to infer what kinds of relations hold between textual and pictorial elements.
From a contemporary standpoint, the diagrams studied by Watanabe and Nagao (1998) can be approached from the perspective of multimodality theory, which studies how human communication relies on intentional combinations of multiple “modes” of expression (Bateman et al. 2017). From a multimodal perspective, individual diagrams may be treated as instances of the diagrammatic semiotic mode, which integrates natural language and diverse visual expressive resources into a common discourse organisation (Hiippala and Bateman 2021). Against this backdrop, the rules formulated by Watanabe and Nagao (1998) can be treated as descriptions of their multimodal discourse structure, which guide the viewers towards interpretations of what combinations of modes mean in their context of occurrence (Bateman 2020).
In this contribution, we revisit the work of Watanabe and Nagao (1998) using a recently published multimodal corpus of 1000 primary school science diagrams in English. This openly-available corpus, named AI2D-RST, contains multiple layers of cross-referenced annotations for expressive resources, layout and discourse structure, which have been created by trained experts (Hiippala et al. 2021). Our aim is to establish whether a similar mapping between morphosyntactic features and discourse relations proposed by Watanabe and Nagao (1998) can be found in English-language diagrams that serve similar communicative goals, that is, depict and explain various natural phenomena. Acknowledging the multimodal nature of diagrams, we also complement the morphosyntactic features with information about diagram layout and use of lines.
In contrast to the manual analysis in Watanabe and Nagao (1998), we adopt a corpus-driven approach to examine discourse relations between textual and pictorial elements. We extract 2580 discourse relations from the AI2D-RST corpus that hold between pictorial and textual elements, focusing on relations that name entire objects (“identification”) or describe part-whole relations (“elaboration”). We extract the following features for each pair of elements: (1) whether the textual element consists of a nominal, clause, modifier or numeral, (2) the distance between elements in the layout, (3) the angle between pictorial and textual elements, and (4) whether the elements are connected using a line.
We use the aforementioned features to train a random forest classifier with 10 decision trees to predict whether the textual element names or describes a pictorial element. We use 10-fold cross-validation to evaluate the classifier, which achieves an average macro F1-score of 0.86 (standard deviation: 0.06). An analysis of how much each feature contributes to classification decisions reveals that apart from numerals, linguistic information is largely irrelevant. The distance between the pictorial and textual elements and whether they are connected using a line are the most important features for determining the function of a text element (see Figure below).

Figure 1: The importance of each input feature, averaged over ten decision trees. The bars show standard deviation for each feature.

Our results suggest that layout and diagrammatic elements such as arrows and lines are crucial for making inferences about the multimodal discourse structure of diagrams. Detecting textual elements and lines may thus help to unpack the structure of diagrams. This has broader implications to emerging work in digital humanities, particularly within the paradigm of “distant viewing” (Arnold and Tilton 2019) and the growing interest in applying computational methods to multimodal data (Wevers and Smits 2020; Smits and Kestemont 2021). Compared to purely linguistic data, computational treatment of multimodal data in digital humanities rarely addresses fundamental questions such as how to identify basic units of analysis and the discourse relations that hold between them. Understanding the structure of multimodal discourse is a prerequisite for performing more complex analyses that are now regularly pursued using linguistic data, such as tracking semantic shifts. Achieving a similar capability for multimodal data requires a deeper understanding of discourse structures within individual modes of communication, such as the diagrammatic semiotic mode, and how individual modes are combined in multimodal artefacts.


Arnold, T. and Tilton, L. (2019). Distant viewing: analyzing large visual corpora. Digital Scholarship in the Humanities 34(Supplement 1): i3–i16.

Bateman, J.A., Wildfeuer, J. and Hiippala, T. (2017). Multimodality: Foundations, Research and Analysis. De Gruyter: Berlin.

Bateman, J.A. (2020). The foundational role of discourse semantics beyond language. In Zappavigna, M. & Dreyfus, S. (eds) Discourses of Hope and Reconciliation. On J. R. Martin’s Contribution to Systemic Functional Linguistics. Bloomsbury: London, pp. 39–55.

Hiippala, T. and Bateman, J.A. (2021). Semiotically-grounded distant viewing of diagrams: insights from two multimodal corpora. Digital Scholarship in the Humanities. DOI: 10.1093/llc/fqab063/6374705

Hiippala, T., Alikhani, M., Haverinen, J. et al. (2021) AI2D-RST: a multimodal corpus of 1000 primary school science diagrams. Language Resources & Evaluation 55: 661–688.

Smits, T. and Kestemont, M. (2021). Towards multimodal computational humanities: using CLIP to analyze late-nineteenth century magic lantern slides. In Proceedings of the Computational Humanities Research Conference (CHR), pp. 149–158.

Watanabe, Y. and Nagao, M. (1998). Diagram understanding using integration of layout information and textual information. In Proceedings of the 17th International Conference on Computational Linguistics (COLING), pp. 1374–1380.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO