Mining an Interpolated Commentary for Linguistic Markup

paper, specified "long paper"
Authorship
  1. 1. Joshua Waxman

    Stern College for Women, Yeshiva University, United States of America

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
The Babylonian Talmud is a multi-volume legal, ethical, narrative, and religious work containing scholastic discourse spanning several centuries and two countries - Iraq and Israel. It is written in a mixture of Biblical Hebrew, Biblical Aramaic, Middle Hebrew and Babylonian Aramaic, all Semitic languages written in Hebrew characters. The Talmud is the focus of religious and academic study, and there have been several Talmud-related digital humanities projects in recent years, e.g. (Waxman, 2021a; Satlow and Sperling, forthcoming; Zhitomirsky-Geffet and Prebor, 2019; Jutan and Regenbaum, 2019).
However, the Talmudic text is difficult for both humans and machines to parse, for several reasons. The Hebrew alphabet is consonantal; vowels are represented by optional diacritics, which are missing in the Talmud. Hebrew words are highly inflected, and it’s sometimes difficult to separate stem from morphology. בצל could mean “in shade”, “in the shade”, “onion”, or “onion of”. Each word is thus heavily ambiguous, and each sentence exponentially so. The Talmudic corpus contains more than 1.8 million words with a unique linguistic profile. Its multiple languages, each with unique vocabulary and word forms, introduces further ambiguity. It’s unpunctuated, so sentence and phrase boundaries are uncertain, introducing ambiguity into dependency or constituent parses and NLP tasks such as NER and relation extraction. The Talmud frequently code-switches between its four Semitic languages, increasing ambiguity / incomprehensibility. Our prior Digital Humanities work (Waxman, 2021a) compensated by focusing primarily on an aligned English translation and projected to the Hebrew side, but there are good reasons to focus on the original text rather than translation. How can we build richly-tagged Talmudic corpora and associated linguistic tools?
Sefaria, an open-source Jewish library, includes a digital edition of the Talmud. They segmented the text of the Talmud into paragraphs, and aligned many rabbinic commentaries to these paragraphs. Among these commentaries are Steinsaltz’s English commentary and his Modern Hebrew Commentary. Each commentary is “interpolated”, meaning that there is the literal Talmudic text (bolded), which is extremely concise in style, as well as elaborative gloss text (unbolded), turning brief cryptic statements into well developed ideas and sentences. One such interpolated paragraph, drawn from Taanit 17a, appears in Figure 1.

The Hebrew interpolated commentary doesn’t precisely match Talmud’s original mix of Hebrew and Aramaic. Sometimes, it draws from variant textual sources, so words and phrases are missing. Since Steinsaltz’s goal is to produce flowing Hebrew sentences, in Hebrew / Aramaic hybrid words and cognates, he’ll sometimes replace Aramaic prefixes with their Hebrew equivalent. Thus the Talmud’s דאסור became שאסור in Figure 1. Some abbreviations and acronyms are expanded in the commentary. Still, we realized that we could word-align the Talmud with commentary and project linguistic features to create a rich Talmudic text.

Approach
We first tokenized the Hebrew commentary, separating off punctuation and gloss, and word-aligned it with the original Talmudic text. Match-reward was based on overlap of character bigrams of the source and target words.
We projected punctuation from the Hebrew commentary to the Talmudic text, recovering punctuation embedded within the gloss and applying a heuristic by which stronger punctuation replaced weaker punctuation (e.g. period for comma). By analogy, if the bolded English words in Figure 1 were Talmudic text, we would produce the sentence:
If so, even nowadays as well. and matching pairs of quotation marks surrounding literal text are selected for projection as well. We segmented the now-punctuated sentences using the projected periods, question marks, exclamation marks, and interrobangs (Waxman, 2022).

Steinsaltz deals with arcane Middle Hebrew words / phrases by bolding the original Hebrew and following it with unbolded Modern Hebrew translation, in parentheses. For Aramaic words / phrases, he gives the bolded original and provides the Modern Hebrew in brackets, with literal translation in small text (Figure 1) and glossed prefixes and phrases in regular-sized text. He also employs parentheses to source Biblical verses.
We next developed a rough heuristic to mark Talmudic words for language. Words default to Hebrew. When encountering open parentheses, if the following word is a Hebrew Biblical book, the words until the open quote are marked as Biblical Hebrew. If a partly Aramaic book, the quoted phrase is marked as Biblical Aramaic. When encountering open brackets, the preceding phrase until punctuation is marked as Aramaic. We trained an RNN-CRF model on this corpus to perform language identification for new texts (Waxman, 2021b).

Our new work builds on this. We produce an aligned translation corpus. Words are generally word-aligned, but we align full parenthesized / bracketed phrases to align to words / phrases (see Figure 2). For instance, the Aramaic phrase אי הכי, meaning “if so”, is paired with the equivalent Hebrew phrase אם כך, and the individual Aramaic word יין, “wine” is paired with its equivalent Hebrew יין. Some errors are present, for instance the prefix ד, "of" in דשתויי, wasn’t present in the Hebrew commentary. This translation corpus is input for our POS tagging. We employ Dicta’s API (
Shmidman et al., 2020) to obtain ranked POS and morphological analyses for the original Talmudic text, selecting the top Aramaic choice when we know the word is Aramaic. We also POS-tag the translated Hebrew side. Figure 3 shows an Aramaic phrase, where אין means “yes/indeed”, the same as Hebrew כן. Where there is overlap of POS options, we’ve eliminated some ambiguity and rerank the POS selections.

Finally, we perform discourse classification of our segmented sentences, to classify sentences as statements, questions, or answers. Features include sentence-initial and final words / phrases, punctuation, and commentary’s framing words such as ושואלים / ומשיבים (“they ask / answer” - see the first Hebrew word in Figure 1). Future work involves tweaking heuristics to handle more edge cases and improve accuracy, particularly in quoted text, and developing semantic grammars to discover named entities / relations from our punctuated sentences.

Results
Our punctuation results are generally in the 90-95% range for precision and recall (see Table 1). Quotes have 95.6% recall and 93.1% precision.

Table 2 shows accuracy for our language identification process, using a CRF model. Using an RNN-CRF model we achieve 95% accuracy (93% on non-punctuation). Our POS tagging / discourse analysis are harder to assess on a large scale, but we’ll publish the resulting corpora.

Bibliography

Jutan, D. and Regenbaum, S. (2019). The Future of the Talmud: A Digital Humanities Case Study. Georgia Tech.

https://dilac.iac.gatech.edu/node/66

Satlow M. and Sperling, M. (forthcoming). The Rabbinic Network. In
AJS Review.

Shmidman, A., Shmidman, S., Koppel, M., and Goldberg, Y.
(2020). Nakdan: Professional Hebrew Diacritizer.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
pp. 197-203:

Waxman, J. (2021a). A graph database of scholastic relationships in the Babylonian Talmud. In Quené et al. (eds),
Digital Scholarship in the Humanities. Oxford University Press, pp. ii277–ii289.

Waxman, J. (
2021b). Projecting and Detecting Code Switching in the Babylonian Talmud. Presentation at
ACH 2021.

Waxman, J. (2022). Projecting Punctuation from an Interpolated Translation and Commentary. In Chesner et al (eds),
Jewish Studies in the Digital Age
,

Studies in Digital History and Hermeneutics

. De Gruyter.

Zhitomirsky-Geffet, M. and Prebor, G. (2019). SageBook: Toward a cross-generational social network for the Jewish sages’ prosopography. In
Digital Scholarship in the Humanities. Oxford University Press, pp.
676–695.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO