Transfer Learning for Olfactory Object Detection:

paper, specified "long paper"
  1. 1. Mathias Zinnen

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

  2. 2. Prathmesh Madhu

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

  3. 3. Peter Bell

    Philipps-Universität Marburg

  4. 4. Andreas Maier

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

  5. 5. Vincent Christlein

    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Smells are an important, yet overlooked part of cultural heritage (Bembibre and Strlič, 2017). TheOdeuropa project analyzes large amounts of visual and textual corpora toinvestigate the cultural dimensions of smell in 16th – 20th century Europe.The study of pictorial representations bears a specific challenge: the substrate of smell is usually invisible (Marx, 2021). Object detection is a well-researched computer vision technique, and sowe start with the recognition of objects, which may then serve as a basisfor the indirect recognition of more complex, and possibly more meaningful,smell references such as gestures, spaces, or iconographic allusions (Zinnen, 2021).

Figure 1: Smell. The Five Senses. 1558 – 1617. Jan Pietersz Saenredam. National Gallery of Art.
However, the detection of olfactory objects in historical artworks is achallenging task. The visual representation of objects differs significantlybetween artworks and photographs (Hall et al., 2015). Since state-of-the-art object detectionalgorithms are trained and evaluated on large-scale photographic datasetssuch as ImageNet (Russakovsky et al., 2015), MS COCO (Lin et al., 2014), or OpenImages (Kuznetsova et al., 2020), their performancedrops significantly when applied to artistic data. This domain gap betweenstandard object detection datasets and artistic imagery can be mitigatedby training directly on artworks, either by using existing datasets or bycreating an annotated dataset for the target domain. Another challengeis the mismatch between object categories present in modern datasets andhistorical olfactory objects, caused by historical diachrony on the one hand (Marinescu et al., 2020), and the particularity of some smell-relevant objects on the other (Ehrich et al., 2021).

To overcome the domain gap and category mismatch between our applicationand the existing datasets, we apply transfer learning – a training strategy where machine learning algorithms are pre-trained in one domain and then
fine-tuned in another, greatly decreasing the amount of required training
data in the target domain (Sinno and Yang, 2009; Zhuan et al., 2020, Madhu et al., 2020).
We are continuously collecting and annotating artworks with possible
olfactory relevance from multiple museum collections. Based on these, we
created a dataset of olfactory artworks containing 16728 annotations on
2229 artworks. From this full set of annotations, we created a test set of
3416 annotations on 473 artworks, while the remaining data was used for

Figure 2: Category overlap between Odeuropa & OpenImages categories

A common transfer learning procedure is to use detection backbones that
have been pre-trained on ImageNet and fine-tune them for object detection
(Zhuang et al., 2020). We expand this strategy by an additional pre-training step, where we
train an ImageNet pre-trained object detection network (Ren et al., 2015) using different
datasets. Finally, we fine-tune the resulting model using our olfactory
artworks dataset (fig. 3).

For pre-training, we use three

Figure 3: Category overlap between Odeuropa & OpenImages categories
different datasets, deviating to varying
amounts from our olfactory artworks dataset in terms of categories and style(table 1): a) Same Categories, Different Styles - A subset of OpenImages (OI)
containing only odor objects results in a complete category match (fig. 2);
however, since OpenImages contains only photographs, there is a considerable
style difference. b) Different Categories, Same Styles - We apply two object
detection datasets from the art domain, which are more similar in terms of
style but contain different object categories, namely IconArt (IA,Gonthier et al., 2018) and
PeopleArt (PA,Westlake et al., 2016).

Table 2: Evaluation of object detection performance. The best performing model pre-trained with OI achieves an improvement of 6.5% pascal VOC mAP, and 3.4% COCO mAP over the baseline method without intermediate training. We report the evaluation for each pre-training dataset, averaged over five models, fine-tuned for 50 epochs on our olfactory artworks datasets. Best evaluation results are highlighted in bold. The merge of two datasets D1 and D2 is written as D1 ∪ D2.

Domain Similarity
Category Similarity
# Categories

OpenImages (OI)
Complete match

IconArt (IA)

PeopleArt (PA)

To ensure a fair comparison between the different pre-training datasets, we
reduce each of the datasets to the same size, train three models, and select
the best according to a fixed validation set for each dataset. Additionally,
we merge all three datasets, i. e., combining OI, IA, and PA, using the union
over their respective classes. The resulting models are then fine-tuned on the

training set of the olfactory artworks dataset and evaluated on a separate
test set. To mitigate random variations that can occur during the training
process, we train five separate models for each experimental setting and
report their average. Evaluation results are reported in pascal VOC (mAP 50, Everingham et al., 2010) and COCO mAP (mAP 50:95:5 (Lin et al., 2014), the two standard metrics to evaluate
object detection models. We conduct two separate sets of experiments: In
the first, we fine-tune the whole network, including the backbone, to assess
the detection performance under realistic conditions (table 2).

Table 3: Evaluation of object detection performance for fine-tuning of thedetection heads only. All pre-training schemes increase the detection perfor-mance, while pre-training with OI leads to the best results with an increaseof 7.7% mAP 50 or 4% COCO mAP. For every pre-training dataset, we report the evaluation averaged over five models, fine-tuned for 50 epochs onour olfactory artworks datasets each. Best evaluation results are marked in bold. The merge of two datasets D1 and D2 is written as D1 ∪ D2.

Pretraining Dataset
Pascal mAP (%)
COCO mAP (%)

None (Baseline)
16.8 (±1.3)
8.4 (±0.4)


23.3 (±0.5)

.8 (±0.4)

22.6 (±1.2)
10.9 (±0.9)

21.9 (±0.4)
10.5 (±0.2)



21.8 (±0.1)
10.5 (±0.3)



22.0 (±0.8)
10.6 (±0.3)



22.6 (±0.3)
10.8 (±0.2)



21.8 (±0.4)
10.5 (±0.2)

Figure 4: Figure 4: Exemplary object predictions for a detection model without intermediate training (left), with PeopleArt pretraining (middle), and ground truth bounding boxes (right). Painting: Boy holding a pewter tankard, by a still life of a duck, cheeses, bread and a herring. 1625 – 1674. Gerard van Honthorst. RKD Digital Collection (
We observe
a performance increase for all used pre-training datasets, with an increase
of 6.5%/3,4% boost in mAP 50 and COCO mAP, respectively, for the best
performing pre-training scheme, which was achieved using the OI dataset.
The exemplary object predictions in fig. 4 show that adding an additional
pre-training stage can increase the number of recognized objects.
In a second set of experiments, we train only the detection head while
the backbone remains frozen, to compare the quality of the intermediate
representations that have been learned using the different pre-training schemes
(table 3).

Pretraining Dataset
Pascal mAP (%)
COCO mAP (%)

None (Baseline)
11.7 (±0.2)
5.5 (±0.1)


19.4 (±0.3)

9.5 (±0.1)

13.8 (±0.4)
6.4 (±0.2)

13.5 (±0.2)
6.7 (±0.1)



16.0 (±0.3)
7.4 (±0.2)



14.6 (±1.0)
6.7 (±0.5)



15.8 (±0.7)
7.3 (±0.4)



16.4 (±0.6)
7.6 (±0.2)

While all pre-training schemes increase the performance, the
relative increase for the OI dataset is remarkably higher. This suggests that
the style similarity between the IA and PA datasets and our target dataset
is less important than we expected. We can not yet conclude whether the
superior performance of the OI dataset is due to the similarity in target
categories. It could also be caused by other properties of the dataset. Further
ablations, e. g., varying the set of OI categories are needed to more precisely
assess the impact of category similarity on the detection performance, which
we plan to conduct in a follow-up study. Interestingly, the performance
of the merged datasets increases even in cases where OI is not part of the
dataset merge. Given that we did not apply a sophisticated merging strategy,the performance increase for training with merged datasets is encouraging.
Developing strategies to improve the consistency of the merged dataset, e. g.,
weak labeling of categories not present in the respective merge partners,
represents another promising line of future research.
We conclude that including an additional stage of object-detection
pre-training can lead to a considerable increase in detection performance. While
our experiments suggest that style similarities between pre-training and target
dataset are less important than matching categories, further experiments are
needed to verify this hypothesis.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004469


Bembibre, C. and Strlič, M. (2017). Smell of heritage: a framework for the identification, analysis and archival of historic odours.
Heritage Science,
5(2): 1–11.

Ehrich, S. C., Verbeek, C., Zinnen, M., Marx, L., Bembibre, C. and Leemans, I. (2021). Nose-First. Towards an Olfactory Gaze for Digital Art History.
2021 Workshops and Tutorials-Language Data and Knowledge, LDK 2021. pp. 1–17.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J. and Zisserman, A. (2010). The pascal visual object classes (voc) challenge.
International Journal of Computer Vision,
88(2). Springer: 303–38.

Gonthier, N., Gousseau, Y., Ladjal, S. and Bonfait, O. (2019). Weakly Supervised Object Detection in Artworks. In Leal-Taixé, L. and Roth, S. (eds),
Computer Vision – ECCV 2018 Workshops, vol. 11130. (Lecture Notes in Computer Science). Cham: Springer International Publishing, pp. 692–709 doi:
(accessed 11 April 2022).

Hall, P., Cai, H., Wu, Q. and Corradi, T. (2015). Cross-depiction problem: Recognition and synthesis of photographs and artwork.
Computational Visual Media,
1(2). Springer: 91–103.

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., et al. (2020). The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale.
International Journal of Computer Vision,
128(7): 1956–81 doi:

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T. (eds),
Computer Vision – ECCV 2014, vol. 8693. (Lecture Notes in Computer Science). Cham: Springer International Publishing, pp. 740–55 doi:
(accessed 13 September 2021).

Madhu, P., Villar-Corrales, A., Kosti, R., Bendschus, T., Reinhardt, C., Bell, P., Maier, A. and Christlein, V. (2020). Enhancing human pose estimation in ancient vase paintings via perceptually-grounded style transfer learning.
ArXiv Preprint ArXiv:2012.05616.

Marinescu, M.-C., Reshetnikov, A. and Lopez, J. M. (2020). Improving object detection in paintings based on time contexts.
2020 International Conference on Data Mining Workshops (ICDMW). Sorrento, Italy: IEEE, pp. 926–32 doi:
(accessed 9 June 2021).

Marx, Lizzie (2021). Perfume and books of secret.
Exhibition Catalogue Mauritshuis, The Hague.

Pan, S. J. and Yang, Q. (2009). A survey on transfer learning.
IEEE Transactions on Knowledge and Data Engineering,
22(10). IEEE: 1345–59.

Ren, S., He, K., Girshick, R. and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks.
Advances in Neural Information Processing Systems,

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., et al. (2015). ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision,
115(3): 211–52 doi:

Westlake, N., Cai, H. and Hall, P. (2016). Detecting people in artwork with cnns.
European Conference on Computer Vision. Springer, pp. 825–41.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H. and He, Q. (2020). A comprehensive survey on transfer learning.
Proceedings of the IEEE,
109(1). IEEE: 43–76.

Zinnen, M. (2021). How to See Smells: Extracting Olfactory References from Artworks.
Companion Proceedings of the Web Conference 2021. Ljubljana Slovenia: ACM, pp. 725–26 doi:
(accessed 16 September 2021).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO