The Hebrew Bible (the Tanakh) is the most ancient and sacred collection of Jewish texts. Throughout the history, additional religious Jewish texts have been written such as the Mishna, the Babylonian Talmud, and many more. These additional texts are often related to (or inspired by) the Bible. As such, many of them quote verses
The Bible is divided into basic text elements called
from the Bible (as in Figure 1). Depending mostly on their frequency and location within the text, the quotations may indicate a weak or strong semantic relation between a given text and a specific portion of the Bible. Knowing these semantic relations may be beneficial for those interested in studying or investigating the Bible.
Nowadays, a variety of Jewish texts are publicly available over the Internet, yet the identification of Bible quotations within them is often sparse and sometimes entirely absent. Moreover, the existing identification lacks a rigorous representation, which makes it difficult to automatically infer semantic correspondence and to develop supporting software applications.
We report an ongoing project that aims to establish the machinery for the automatic detection and rigorous representation of quotations of Bible verses within Jewish texts. The project consists of three interleaving components. In the first component, an algorithm for identifying Bible quotations in text is developed. In the second, the results of executing the algorithm on a large and open text corpus are represented as a
Linked Data graph (RDF dataset). In the third component, we develop a web frontend for making the dataset accessible to end users. Exposing the data to end users may also engage their participation in data-driven crowdsourcing (Ched et al, 2015), and hence, will serve to collectively help in improving the dataset quality. In what follows, we elaborate on each of the project components.
Quotation detection is gaining popularity in fields such as copyright enforcing and political analysis, and within ancient texts (Ernst-Gerlach and Crane, 2008; Gesche et al, 2016). The algorithms in use share common characteristics, yet each domain brings its own specificities and challenges. Given an input text, our algorithm first matches maximal n-grams
n-gram is a contiguous sequence of n words from a text.
in the text to candidate Bible verses. For example, the green bigram (2-gram) in the first line of Figure 1 will have one matching verse, since its text ('לך לך') appears in exactly one Bible verse. This matching is maximal, since the words that appear before and after the bigram are not part of the quoted verse.
A portion of ancient Jewish Text (from Midrash Raba), that quotes two Bibles verses. Quotations to the same verse are marked in a similar color. Note that each quotation refers only to a part of the verse (1-4 words of it).
A first challenge that we face is related to variations found between the quoting text and the original Bible text, mostly related to the omission (or inclusion) of Hebrew vowel letters. As an example, consider the red quotation in the second line of the figure, that contains the word 'המוריה', where in the original Bible source the 'ו' (vav) vowel is omitted. We have implemented two alternative solutions, one is based on
fuzzy search (Levenshtein distance), and the other on
exact search performed simultaneously on two versions of the Bible, with and without vowels.
Not all verse candidates are valid quotations of Bible verses in the text. For instance, the phrase 'בית אביך' in the third line of the figure (underlined) is a common phrase that appears in eleven different Bible verses. Nevertheless, the phrase is mentioned in a different context, which is not related to any of them. False candidates occur mostly in bigrams and trigrams (3-grams), and the algorithm makes an effort to filter them out. One approach is to keep a candidate if a matching candidate appears in a larger n-gram in the same text. For instance, the green bigrams and trigram shown in the figure are reported as valid quotations since there is a 4-gram that quotes the same verse in the text ('אל הארץ אשר אראך', line 3). We are considering additional filtering approaches related to statistical data inference and machine learning. We are also creating collections of labeled data for a systematic evaluation of the algorithm.
The detected quotations are represented as RDF Linked Data, making them accessible to machines for standard consumption and integration. We use a lightweight ontology that we have defined, augmented with standard properties taken from known ontologies such as RDF, RDFS, and Dublin Core (DC). We are working on the integration of additional ontologies such as CIDOC-CRM, FRBR, and SPAR. Key ontology classes are
Book is composed of
Sections, that may be composed of other
Sections, and eventually of
Text elements. Each Bible verse is a node of type
Text in the RDF graph. To date, our graph contains a total of 23,206
Text nodes of Bibles verses. Additional 355,181
Text nodes represent text elements within other Jewish books (where quotations are searched for). An edge from a
Text node of the latter kind to one of the former kind indicates a 'quotes' relationship. Nodes of class
Quotation hold additional details such as the exact location wherein a quotation appears in the text.
A SPARQL query that retrieves all text elements quoting the first verse of the Bible.
A Linked Data graph may be accessed by expert users using the SPARQL query language. An example SPARQL query is shown in Figure 2. To make our data widely accessible, we have implemented a graphical web frontend that acts like a search engine for Bible verses. A user selects a set of verses from the Bible, and then being presented with all text elements that quote one or more verses from the set. (The elements are retrieved from the RDF graph.) The results are sorted by significance, and may be filtered using predefined categories. We plan to enhance the web interface with data-driven crowdsourcing support, where the crowd will help in improving the accuracy of the algorithm by marking false negatives (places in the text that the algorithm has missed), as well as false positives (incorrect detections). The web tool, as well as the detection algorithm and related artifacts, are accessible via our main
Ched, L. and Lee, D. and Milo, T. (2015).
Data-driven Crowdsourcing: Management, Mining, and Applications. International Conference on Data Engineering (ICDE), Tutorial.
Ernst-Gerlach, A. and Crane, G. (2008).
Identifying Quotations in Reference Works and Primary Materials. Research and Advanced Technology for Digital Libraries, 78-87.
Gesche, S. and Egyed-Zsigmond, E. and Calabretto , S. (2016).
Was it better before? Automated Quotation Detection in Ancient Texts. CORIA-CIFED, 167-182.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at El Colegio de México, Universidad Nacional Autónoma de México (UNAM) (National Autonomous University of Mexico)
Mexico City, Mexico
June 26, 2018 - June 29, 2018
340 works by 859 authors indexed
Conference website: https://dh2018.adho.org/