The Vectorian API – A Research Framework for Semantic Textual Similarity (STS) Searches

poster / demo / art installation
Authorship
  1. 1. Bernhard Liebl

    Universität Leipzig (Leipzig University)

  2. 2. Manuel Burghardt

    Universität Leipzig (Leipzig University)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


In the humanities, texts are often quoted, referenced or alluded to (see Bamman & Crane, 2008). In order to automatically detect complex cases of so-called intertextual references, it is not enough to match two texts on the purely lexical level, but rather to also take into account the semantic level (see Fig. 1).

Example for an intertextual reference to Shakespeare’s “Macbeth”.

The task at hand is typically referred to as
semantic textual similarity (STS; Cer et al., 2017) and neural text embeddings have long been recognized as a foundational building block of SOTA solutions (Zhelezniak et al., 2020). In recent years, many different approaches to neural embeddings have emerged and have been deemed suitable for different application scenarios. One line of approaches focuses on providing a single embedding for a span of text, as for example in the case of sentence embeddings (Wang et al., 2021). Another line of approaches uses high quality word embeddings and builds algorithms on top of them, to operate on spans of tokens. Common approaches on this group are optimal transport (Kusner et al., 2015), fuzzy sets (Zhelezniak et al., 2019), or various statistical approaches (Zhelezniak et al., 2020). Interestingly, most of the existing tools for the detection of intertextuality – for instance
Tesserae (Scheirer et al., 2016) or
Tracer (Büchler et al., 2014) – do not utilize such neural embeddings at all.

To close this gap, we present the
Vectorian as an intertextuality search engine (see Manjavacas et al., 2019) that aims to serve as a research framework for running STS queries using established embedding methods on both token and span levels. Besides various types of embeddings, the Vectorian can also combine custom alignment algorithms and further NLP operations, such as the weighting of POS. Two notable features of the Vectorian framework are its fast instantiation of new search indices on pre-processed corpora – including full support for pre-computed static and contextual embeddings – and a fast and optimized alignment search implementation that scales reasonably well to moderately sized corpora.

In our poster, we present the Vectorian API as a software demonstration. The Vectorian can be used to experiment with various configurations of embeddings and alignments for different tasks of intertextuality detection. Figure 2 shows the overall workflow of the Vectorian API

Vectorian API: https://poke1024.github.io/vectorian/index.html

.

The Vectorian workflow and core elements of the API.

First, the
Importer is used to preprocess text resources. Essentially, the documents are segmented into tokens and sentences. If the document contains additional structural XML markup, the importer can also be customized to parse this information. Moreover, POS tags are annotated utilizing spaCy. The result of this step is a
Corpus of segmented and annotated
Documents. In the next step, the corpus is enriched with contextual information for each word to provide an additional layer for semantic analyses. This is solved via embeddings. At this point, different
TokenEmbeddings are calculated and stored as vectors. The Vectorian implements various static (e.g. fastText, GloVe) as well as contextual (e.g. BERT-based models) token embeddings.

For technical reasons,
SentenceEmbeddings are generated in a later step if required. At this stage, all the necessary steps have been taken to instantiate a
Session, which is an optimized in-memory representation of the given corpus and the selected embeddings. The purpose of a session is to generate a searchable
Index of the embedded corpus.

For the similarity comparison in
SentenceSim, first of all a similarity measure (e.g. cosine similarity) is defined. Next, the approach for the actual string comparison is chosen. This can be a local, global, or semi-global
Alignment approach (Aluru, 2005) with variable gap costs, or the
Word Mover’s Distance (Kusner et al., 2015). Finally, there is an option to use the previously annotated POS as additional weights. The idea of
POS weights is based on Batanovic ́& Bojic (2015) and ensures that differing tokens that still have the same POS are classified as more similar than if they have a POS mismatch. SentenceSim also allows for an entirely alternative approach to compare the query and document partitions, which is an approach that utilizes the aforementioned
SentenceEmbeddings. With this approach, sentences are represented as embedding vectors of their own, which means similarity can be directly assessed by comparing sentence vectors.

Once the index has been created, a
Query can be searched in the previously created corpus. Figure 3 shows an example query for the Shakespeare phrase “old men’s crotchets”. Two example results that were retrieved by the Vectorian are also provided. These results illustrate how the Vectorian evaluates every word according to the selected embeddings and then provides a score for its match to the original query.

Example results for a query that is matched with the predefined corpus through the Vectorian API.

With our poster, we hope to spark some discussion with the DH community on how to apply and further develop the Vectorian API, which we believe will be a useful resource for any kind of intertextuality research in the DH.

Bibliography
Aluru, S. (Ed.). (2005). Handbook of Computational Molecular Biology. Chapman and Hall/CRC. https://doi.org/10.1201/9781420036275)
Bamman, D. & Crane, G. (2008). The logic and discovery of textual allusion. In In Proceedings of the 2008 LREC Workshop on Language Technology for Cultural Heritage Data.
Batanovic ́& Bojic, 2015). Using Part-of-Speech Tags as Deep Syntax Indicators in Determining Short Text Semantic Similarity. Computer Science and Information Systems, 12(1):1–31, January.)
Büchler, M., Burns, P. R., Müller, M., Franzini, E., & Franzini, G. (2014). Towards a historical text re-use detection. In Text Mining (pp. 221-238). Springer, Cham.
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. https://doi.org/10.18653/v1/S17-2001
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (n.d.). From Word Embeddings To Document Distances.
Manjavacas, E., Long, B., & Kestemont, M. (2019). On the Feasibility of Automated Detection of Allusive Text Reuse. arXiv:1905.02973 [cs], May.
Scheirer, W., Forstall, C., & Coffee, N. (2016). The sense of a connection: Automatic tracing of intertextuality by meaning. Digital Scholarship in the Humanities, 31(1), 204-21.
Zhelezniak, V., Savkov, A., Shen, A., Moramarco, F., Flann, J., Hammerla, N. Y., & Health, B. (2019). Don’t settle for average, go for the max: Fuzzy sets and max-pooled word vectors.
Zhelezniak, V., Savkov, A., Hammerla, N., & Health, B. (2020). Estimating Mutual Information Between Dense Word Embeddings.
Wang, K., Reimers, N., & Gurevych, I. (2021). TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO