A picture is worth a thousand words: Image analysis for the Digital Humanities: https://tutorial.thousandwords.art

workshop / tutorial
Authorship
  1. 1. Stuart James

    Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia, Italy

  2. 2. Mathieu Aubry

    LIGM (UMR 8049), École des Ponts ParisTech, France

  3. 3. Nanne van Noord

    Multimedia Analytics Lab (MultiX), University of Amsterdam, Netherlands

  4. 4. Noa Garcia

    Institute for Datability Science (IDS), Osaka University, Japan

  5. 5. Leonardo Impett

    Cambridge Digital Humanities (CDH), University of Cambridge, UK

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
In this tutorial, we look at Computer Vision (CV) approaches developed to investigate Digital Humanities (DH) data and, more specifically, fine-art and cultural heritage. We will explain what approaches can achieve, how to train and use with a basic understanding of python to be applied to different types of visual data. By breaking down the tutorial into five parts (one per presenter), the tutorial will provide an overview of the research within CV and its current and future application within DH. We additionally attempt to provide some reflections on the use of Asian data and the limitations or challenges. While considering the current extensive narrative within the CV research community on bias in datasets and collections.

Tutorial Content

Part 1: Retrieval and Knowledge Graphs
The use of CV for distant reading in image collections are generally within the setting of retrieval — searching via an example within a collection. To do this, a computational description of the image needs to generate to be then able to compare one image to another. How such a representation is learned is important to provide a powerful retrieval system. While using pre-trained approaches such as neural networks are useful, they fail to bridge the visual difference between photo-realistic images and common art or humanities image collections. In this part, we explore how anyone can train a neural network representation that is specific to their dataset with varying degrees of supervision, and specifically exploiting supervision that can be provided through Knowledge Graphs (or Semantic Web) to enhance the differential power of the representations.

Part 2: Content-based analysis
Most Deep Learning image techniques rely on annotated collections. While these might be available for some well-studied types of documents, they cannot be expected for more specialized studies or sources. Instead, one would have to rely on techniques that do not require training data. This part will discuss several such techniques to establish links between artworks and historical documents, including the use of generic local features, synthetic data, self-supervised learning, and object discovery techniques. In addition, this will include examples of applications for repeated patterns discovery in artwork collections, fine artwork alignment, document images segmentation, historical watermarks recognition, scientific illustration propagation analysis, and unsupervised Optical Character Recognition. In all cases, it will be shown that standard approaches can give useful baseline results when tuned adequately, but that developing dedicated approaches that take into account the specificity of the data and the problem significantly improves the results.

Part 3: Multi-Task Learning
Multi-Task Learning (MTL) is an increasingly prominent paradigm in CV and in Artificial Intelligence in general. It centers around the ability to perform multiple tasks based on a single input. For instance, it is possible to predict for a single image of an artwork when it was made by who and using what materials. Jointly performing these tasks involves specific modeling choices, resulting in clear benefits (robustness, improved performance), but it also has potential downsides (negative interference, increased complexity). In this part, we show when and how we might want to apply MTL, through a number of use cases, as well as an overview of the technical underpinnings. In addition, highlight the possibilities MTL provides for interpretability by shedding light on relations between tasks.

Part 4: Automatic interpretation
In CV, visual arts are often studied from an aesthetics perspective, mainly by analyzing the visual appearance of an art reproduction to infer its attributes (author, year of creation, theme, etc.), its representative elements (objects, people, locations, etc.), or to transfer the style across different images. However, understanding an artistic representation involves mastering complex comprehension processes, such as identifying the socio-political context of the artwork or recognizing the artist's main influences. In this part, we will explore fine-art paintings from both a visual and a language perspective. The aim is to bridge the gap between the visual appearance of an artwork and its underlying meaning by jointly analyzing its aesthetics and semantics. We will explore multimodal techniques for interpreting artworks, and we will show how CV approaches can learn to automatically generate descriptions for fine-art paintings in natural language.

Part 5: Using Computer Vision within humanities research
Most models in CV research are built to solve specific problems with measurable outcomes (often tied to a set of reference datasets): pixelwise segmentation, object detection, image captioning, keypoint detection, etc. With many open-source computer vision models for each kind of task, we have a wide horizon of powerful tools at our disposal: yet most of them don’t easily fit with research questions in art history, visual culture studies, or the visual humanities more generally.
By dissecting a series of previous projects in this area, this part will look at how researchers have negotiated these connections, including complex and difficult questions of bias, interpretability, and the epistemology of computational results within the humanities (and especially within cultural history). We will look at several methodological modes compatible with the affordances of CV, including image replication, computational iconography, and the study of visual phenomena captured through notational systems.

Tutorial presenters’ brief bios

Stuart James, Istituto Italiano di Tecnologia (IIT) & UCL DH
Researcher (Assistant Professor) in Computer Vision at the Istituto Italiano di Tecnologia (IIT). Stuart's research focus is on Visual Reasoning to understand the layout of visual content from Iconography (e.g. Sketches) to 3D Scene understanding. He is a PI on the MEMEX RIA EU H2020 project and Co-PI on the RePAIR EU FET H2020. Stuart has previously held PostDoc positions at IIT, University College London (UCL) and the University of Surrey. Also, at Surrey, Stuart was awarded his PhD in visual information retrieval using sketches. Stuart continues to hold an honorary position at UCL and UCL Digital Humanities.

Mathieu Aubry, École des Ponts ParisTech
Mathieu Aubry is a tenured researcher in the Imagine team of Ecole des Ponts, focussing on Computer Vision and Digital Humanities. He obtained his PhD from ENS in 2015 and his Habilitation (HDR) in 2019. He had a leading role in several digital humanity projects such as the Young Researcher EnHerit ANR project on enhancing heritage image databases. He was a keynote speaker in several venues, including most recently the ACM Multimedia 2021 workshop on Structuring and Understanding of Multimedia heritAge Contents (SUMAC). He is an associated editor for CVIU and was an area chair at numerous CV conferences.

Nanne van Noord, University of Amsterdam
Nanne van Noord is Assistant Professor in the Multimedia Analytics Lab of the University of Amsterdam. His research is focused on the intersection of multimedia analysis and visual culture. He did his PhD at Tilburg University on modeling the artist's style for recognition and reproduction, as part of the NWO Science4arts project Reassessing Vincent van Gogh. He previously worked in The Sensory Moving Image Archive (SEMIA) project, and coordinated the Computer Vision taskforce in the national infrastructure project CLARIAH.

Noa Garcia, Osaka University
Noa Garcia is an Assistant Professor at the Institute for Datability Science at Osaka University. Her research interests lay at the intersection of computer vision, natural language processing, and art. She has been involved in multiple projects related to computer vision for art, with a special focus on language description and interpretation. She moved to Japan in 2018, after completing her Ph.D. in Computer Science at Aston University, United Kingdom.

Leonardo Impett, University of Cambridge
Leonardo Impett is an Assistant Professor of Digital Humanities at the University of Cambridge. His main work has to do with computer vision for the "distant reading" of art history (CS applied to the humanities), and visual studies as a route to understanding computer vision (the humanities applied to CS). He was previously based at Durham University, Harvard University, the Max Planck Institute for Art History, and EPFL.

Schedule

Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 870743

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO