Evaluating Semantic Relation Alignment in Historical Bibles usingHistorical Embeddings

lightning talk
  1. 1. Maria Berger

    Deutsches Forschungszentrum für Künstliche Intelligenz (German Research Center for Artificial Intelligence) (DFKI GmBH)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Current techniques for typical NLP tasks such as plagiarism detection and paraphrase prediction do not necessarily work well for historical language text due to the heavy modification (caused by cul- tural and linguistic change, induction of errors, se- mantic change in meaning) that historical text is encountered with. To perform reuse (e.g., cita- tions, allusion) detection using machine or deep learning techniques, models are rare and text for a specific time period of genre exists seldom.Previously, we modeled changes in parallel, his- torical English Bibles determining the ratio and type of modification. This work gives an under- standing of the degree, type and partition of mod- ification in historical text, which again is impor- tant to consider for example upfront a stylometry task. Modification can reach from inflection to replacement of semantically related words. Our approach therefore used normalizing tools and looked-up lexical databases (e.g., WordNet and BabelNet Navigli and Ponzetto (2010)) to identity semantic relations (synonymy, hyper(o)nymy, co- hyponymy, see Fig. 1). We found 90% of the rela- tions that two statistically aligned words share.Now, we attempt to relate the last 10% using embedding representations of words from one pre- existing embedding model—based on historical English ranging from 1800 to 1900, henceforth StanHistEmb(1)—and one self-trained model— based on the Royal Society Corpus (Kermes et al., 2016) ranging from 1665 to 1869, henceforth RoySocEmb). We think the latter performs bet- ter being tailored well to the actual historical vo- cabulary. We complete the work with first manual evaluation. Before, Faruqui et al. (2015) refined word vec- tor representations using precise relations of semantic lexicons. Evaluation against shows a per- formance increase of 13% compared to earlier ap- proaches. Naya (2015) use neural models to pre- dict the correct hypernym of a word in a vec- tor space and achieves an accuracy of 20% for predicting whether the correct vector is among the 10 nearest. Weeds et al. (2014) train an SVM on word vector pairs achieving an error- rate-reduction of about 15% and find that distance scores between paired vectors work well for en- tailment tasks while adding vectors is useful for co-hyponymy determination.Data Used: We use aligned Bibles from 1500–1900 (each with each other). Berkeley Aligner (Liang et al., 2006) aligns verse-aligned Bibles on token-level. Figure 1: Modification examples during alignment Recall Using StanHistEmb: First, we calcu- late the cosine distance for two word vectors us- ing the StanHistEmb vectors attempting to retrieve both words from each couple in the model. We find a cosine distance to 260,641 couples, which are normally distributed over the distance scale. Figure 2: Distribution of cosine distance measures of the word couples determined with the StanHistEmb (o) and the RoySocEmb (x) modelRecall Using RoySocEmb: Now, we calcu- late the cosine similarity for two words using a self-trained RoySocEmb model and the Gensim li- brary. We find a cosine score to 302,812, much more than using StanHistEmb. Here, couples are even better normally distributed (Figure 2, curve of x), but values are lower due to a better tailored model, which in the sum is much smaller while supporting more historical vocabulary. Precision Using StanHistEmb: We manually evaluate 138 sample couples that have a cosine similarity(5) > .5. We assign 1.0 when two words are synonyms and 0.0 when there is no semantic relation. Principally, scores from 0.0 to 0.4 represent false positives. While many couples scored above .5 are strongly related or synonym. We not yet manually evaluated sam- ples for RoySocEmb. However, we looked up the same samples using that model and find 119 out of which 69 scored > .5 and 77 scored > .4. How- ever, looking at the skipped couples, they tend to be false positives for high similarity. Table 1: Semantic relation score between word cou- ples filtered using a cosine similarity of > .5 using the StanHistEmb modelForecast: We plan to investigate the selftrained model of Early Modern English more deeply. Specially to see in more detail whether and how the size of the model influences the accu- racy of the words’ relations. Therefore, we con- sider to refine the pre-trained model by retrain- ing and consider joining both models, because the RoySocEmb fits our time, but not our domain.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO