Tutorial on Fuzzy String Matching with DeezyMatch

workshop / tutorial
Authorship
  1. 1. Mariona Coll Ardanuy

    The Alan Turing Institute

  2. 2. Kasra Hosseini

    The Alan Turing Institute

  3. 3. Federico Nanni

    The Alan Turing Institute

  4. 4. Valeria Vitale

    The Alan Turing Institute

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Fuzzy string matching is a common challenge of linking data in many digital humanities projects, which often deal with noisy, historical, or non-standard text (Olieman et al., 2017). Named entities (in particular place names) are often present under a variety of forms, which can range from regional spelling differences to cross-linguistic or diachronic variation, sometimes due to a change in the political and cultural context, to lack of standardization, or to a process of linguistic standardization. In working with digitized materials, an additional, artificial layer of variation can occur, introduced by optical character recognition errors (Butler et al., 2017; Coll Ardanuy and Sporleder, 2017; De Wilde and Hengchen, 2017; van Strien et al., 2020).
Several studies have warned of the importance of fuzzy string matching for entity linking, especially in noisy and non-standard text (Coll Ardanuy et al., 2020; De Wilde and Hengchen, 2017; Hachey et al., 2013). However, to date, most entity linking systems rely on either exact or partially overlapping string matching. This is due to the high computation time required by most fuzzy string matching approaches, such as Levenshtein distance (Santos et al., 2018a). In this tutorial, we will introduce DeezyMatch (Hosseini et al., 2020), an open-source, user-friendly Python library for fuzzy string matching and candidate ranking for entity linking that has been developed in the Living with Machines project (https://livingwithmachines.ac.uk/). DeezyMatch builds and expands on Santos et al. (2018b), an approach to fuzzy string matching that uses a deep learning architecture to classify pairs of toponyms as either potentially referring to the same entity or not. DeezyMatch is a tool that integrates recent deep learning advances, and has been specifically designed to be flexible, user-friendly, and fast, and therefore ready to be used in real entity linking scenarios.
In this tutorial, we will show how DeezyMatch can be used to mitigate the problem of name variation in noisy, historical, or non-standard data. We will show how to create string pair datasets that can be used to train and test a DeezyMatch model, and how DeezyMatch models can be used to retrieve candidate entities from a gazetteer or knowledge base. By way of motivation, we will provide and discuss some real digital humanities examples which require fuzzy string matching and will show how DeezyMatch can be used to tackle them. During our tutorial, we will focus on the following case studies:

Case study 1: We will show how a DeezyMatch model can be created from token-level alignments of OCRed text and their manual corrections. We will use the aligned tokens generated in van Strien et al. (2020) using a corpus of OCRed newspaper texts (from the National Library of Australia Trove digitized newspaper collection) that are aligned with human corrections performed by volunteers (Evershed and Fitch, 2014). We will show how to train a DeezyMatch model that learns OCR transformations from newspaper data and will show how it can be used to find a match for a given OCRed query from a pool of potential candidates from a specific knowledge base.

Case study 2: We will show how to create DeezyMatch models that are trained on name variations of places, which will enable us to find the best entry in a gazetteer, for a given query. As an example, we will show how these models can be used to consolidate data about names of heritage locations in Arabic speaking countries, like in the Heritage Gazetteer of Libya (https://slsgazetteer.org/). Currently, the high level of spelling variation in Arabic placenames (across time and transcriptions) makes it difficult to consolidate data that lies in different archives and collections, which at the moment rely on perfect string matching to find connections. We will show how DeezyMatch can be used to more easily associate a heritage location to a number of variant names, thus improving accuracy of data and metadata, and facilitating alignment with other knowledge bases such as Wikidata or Geonames.

This is a hands-on tutorial: participants will be shown how to train a DeezyMatch model and use it for candidate ranking. We will allocate time at the end for discussion, including how to adapt DeezyMatch to different digital humanities projects in different languages and time periods.
We will build on the experience gained on providing two different tutorials on DeezyMatch in the past:

December 2020: "Linking and Enriching GeoData through Test and Play: a tutorial on DeezyMatch", as part of the
LinkedPasts conference (Mariona Coll Ardanuy, Kasra Hosseini, Katherine McDonough, and Federico Nanni), followed by a round table. It was held virtually, and there were around 40 participants. Link to the tutorial:
https://github.com/LinkedPasts/LaNC-workshop/tree/main/deezymatch

July 2021: "Best practices in collaborative coding and on using GitFlow for data science research", as part of the
Digital Humanities & Research Software Engineering virtual summer school, hosted by the Alan Turing Institute. It was held virtually, and there were 25 participants. The focus wat not so much on fuzzy string matching but on collaborative coding. Link to the tutorial:
https://github.com/alan-turing-institute/DH-RSE-Summer-School/tree/main/Day%201/gitflow

Outline
This is a half-day tutorial which will cover the following core content:

Part 1: Introduction to DeezyMatch and motivation [60 min]

[10 min] Introduction to fuzzy string matching and entity linking
[30 min] Description of case studies and data obtaining and preparation
[20 min] Overview of DeezyMatch

Part 2: Interactive hands-on session [1h20 min]

[10 min] Demo 1: candidate ranking using a pre-trained model
[20 min] Hands-on exercise
[10 min] Touch base
[10 min] Demo 2 and hands-on session: DeezyMatch training and candidate ranking
[20 min] Hands-on exercise
[10 min] Touch base

Part 3: Discussion and feedback [40 min]

[20 min] How to adapt DeezyMatch for your project
[20 min] Questions

Instructors

Mariona Coll Ardanuy: Mariona is a computational linguist at the Alan Turing Institute in the Living with Machines project. Her research interests lay in the intersection between the humanities and language technology.

Kasra Hosseini: Kasra is a Research Data Scientist at The Alan Turing Institute. He is interested in (artificially) intelligent systems, machine learning, and data analysis and visualisation.

Federico Nanni: Federico is a Research Data Scientist at The Alan Turing Institute. He is a historian by training and works exploring the intersections between digital humanities, computational social science, and natural language processing.

Valeria Vitale: Valeria Vitale is a researcher in the field of digital cultural heritage. She works at the Alan Turing Institute as Research Associate on the Machines Reading Maps project.

Target audience
Based on past experience, we believe the number of participants should be 20 at most. Participants should have some experience in programming in Python and running scripts, and ideally be interested in entity linking or fuzzy string matching.

Funding statement
This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1). The Living with Machines project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with the Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.

Bibliography

Butler, J. O., Donaldson, C. E., Taylor, J. E., and Gregory, I. N. (2017). Alts, Abbreviations, and AKAs: historical onomastic variation and automated named entity recognition.
Journal of Map & Geography Libraries,
13(1), 58-81.

Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D., and Nanni, F. (2020). A deep learning approach to geographical candidate selection through toponym matching. In
Proceedings of the 28th International Conference on Advances in Geographic Information Systems (pp. 385-388).

Coll Ardanuy, M., and Sporleder, C. (2017). Toponym disambiguation in historical documents using semantic and geographic features. In
Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (pp. 175-180).

De Wilde, M., and Hengchen, S. (2017). Semantic enrichment of a multilingual archive with linked open data.
Digital Humanities Quarterly.

Evershed, J., & Fitch, K. (2014). Correcting noisy OCR: Context beats confusion. In
Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 45-51).

Hachey, B., Radford, W., Nothman, J., Honnibal, M., & Curran, J. R. (2013). Evaluating entity linking with Wikipedia.
Artificial intelligence,
194, 130-150.

Hosseini, K., Nanni, F., and Coll Ardanuy, M. (2020). DeezyMatch: A flexible deep learning approach to fuzzy string matching. In
Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 62-69).

Olieman, A., Beelen, K., van Lange, M., Kamps, J., and Marx, M. (2017). Good applications for crummy entity linkers? the case of corpus selection in digital humanities. In
Proceedings of the 13th International Conference on Semantic Systems (pp. 81-88).

Santos, R., Murrieta-Flores, P., and Martins, B. (2018). Learning to combine multiple string similarity metrics for effective toponym matching.
International journal of digital earth,
11(9), 913-938.

Santos, R., Murrieta-Flores, P., Calado, P., and Martins, B. (2018). Toponym matching through deep neural networks.
International Journal of Geographical Information Science,
32(2), 324-348.

Van Strien, D., Beelen, K., Ardanuy, M. C., Hosseini, K., McGillivray, B., and Colavizza, G. (2020). Assessing the impact of OCR quality on downstream NLP tasks.
Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities, in
Proceedings of the 12th International Conference on Agents and Artificial Intelligence (pp. 484-496)

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO