Finding Inexact Quotations Within a Tibetan Buddhist Corpus

poster / demo / art installation
Authorship
  1. 1. Benjamin Eliot Klein

    Tel-Aviv University

  2. 2. Nachum Dershowitz

    Tel-Aviv University

  3. 3. Lior Wolf

    Tel-Aviv University

  4. 4. Orna Almogi

    Universität Hamburg (University of Hamburg)

  5. 5. Dorji Wangchuk

    Universität Hamburg (University of Hamburg)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
One thing that literary scholars routinely look for – regardless of the specific field – is textual citations, where one work quotes or paraphrases another work. In historical works, even quotations are frequently quite inexact. To complicate matters further, there is often no clear indication that a passage is being quoted, let alone which work is being cited. It is, therefore, only natural to use algorithmic tools to search for such occurrences in texts and present the results to scholars for consideration.

One such corpus is the Tibetan Buddhist canon. Altogether, there are more than 300 volumes, averaging about 800 pages (400 folios) of 200 words each. In addition to the canon, there are many other important collections in the Tibetan Buddhist literary corpus. And, of course, there are many Buddhist corpora in other languages. Scholarly editions are still wanting for many of these works, so we have set out to design computerized tools to help deal with the masses of data.

The most relevant previous work is by Prasad and Rao [2] who search for citations within Sanskrit texts. They break the text into units (lines, say) and then compare each potential citation with each unit in the cited corpus (Smith-Waterman-Gotoh), using approximate match. To constrain the search, they first sort the units, so they only need compare units that begin similarly. We approach the problem of reducing the complexity of the search differently. We borrowed an algorithm designed for finding all (sufficiently long) approximate subsequence matches in genomic data and adapted it for finding common approximate subtexts between two large corpora. This involved parallelizing the algorithm, adding some simple preprocessing, and some less trivial post-processing.

It is important to distinguish between various string-matching tasks. Given two or more passages known to be similar, or several recensions of the same work, one can seek the best alignment. The task we address of finding all, short or long, approximately similar texts that appear at arbitrary locations within large corpora is vastly different from the alignment tasks Juxta (available at http://www.juxtasoftware.org/) and CollateX (http://collatex.net/) help solve. In order to address the reviewers’ concerns we tried these tools. Juxtas results on our texts were completely irrelevant; CollateX, after running for a long time, produced an empty output.

2 The Corpus
In this exploratory work, we compared two major Buddhist texts, transliterated into Latin characters from the Tibetan, the Sūtrasamuccaya and the Síkṣāmuccaya. Each is over 150,000 words long.

The Sūtrasamuccaya. The “Compendium of [Citations from Mahāyāna] Sūtras” is a compilation ascribed to the famous Nāgārjuna (2nd century ce). The text has survived only in its Tibetan (P5330, D3934) and Chinese translations. It is an important source for early Mahāyāna sūtras, and it is invaluable for the study of the early phase of Mahāyāna. The Tibetan translation was done by the Tibetan famous translator Ye-shes-sde (8th century), in collaboration with Jinamitra and Sílendrabodhi.

The Síkṣāmuccaya. The “Compendium of Teachings” (i.e. citations from mostly early Mahāyāna sūtras) was compiled by the famed Indian scholar Sāntideva (7th c.). It was translated into Tibetan by the same Tibetan translator Ye-shes-sde (8th c.), in collaboration with Jinamitra and Danśīla. Later on the translation was revised by the Tibetan translator Blo ldan shes rab (1059–1109) in collaboration with the Kāśmirian scholar Tilakakalaśa.

3 Method
We designed an algorithm to solve the problem of finding local regions with high similarity in the two texts. The main workhorse is an efficient algorithm for solving the “threshold all against all” variant of the problem, based on that of Barsky et al. [1]. It finds all maximal substrings of the text of some minimal length L0= 60, with an edit distance between them bounded by some given value k = 10.

We worked with transliterations of the Tibetan texts. The texts contain multiple spaces, line breaks, page numbers, punctuation marks and the like. In a preprocessing step, we clean the texts and remove all such.

The texts we are using are very long and therefore we created a parallel version, running on large overlapping chunks of length l = 25000 on a cluster of processor cores. After collecting all the results, some post-processing steps are required in order to build a non-redundant and meaningful collection of local regions with high similarity. Splitting increases the quantity of overlapping results. In addition, some results are very near to each other and should be merged into a longer match. We address these issues by uniting every pair of overlapping or nearby results.

The second problem arises when we have a meaningful result with length that is smaller than the threshold and with a very small edit distance. In this scenario, the algorithm extends the result in order for the minimal length constraint to be satisfied, resulting with a less meaningful result. We solve this issue by applying local alignment on each result, which removes these uninformative extensions.

From the final collection of matched substrings we get two main outcomes; the trivial one is finding the local regions of high similarity in the two texts. We built a designated interface that has tools for investigating the matches. It presents the two texts side by side and a list of all the matches in descending order of their edit distance. Using it, one can focus on a specific match, and see the relevant substrings in both texts. The substrings are also presented in another window, aligned to each other for convenient comparison. Additionally, by selecting a substring in one of the texts, one can see all the matches that overlap with the selection. See the screenshot in Fig. 3.

The second outcome is computing statistics on all the results. This requires us to carefully align each result, as the quality of the statistics depends on the alignment of each single word. The alignment is done by a variant of global alignment that penalizes gaps that occur between words (or at one of the ends of the string) differently from gaps that occur within a word. This simple alignment allows us to derive meaningful statistics on differences between collections.

4 Results
Overall, 2514 matches were found between the texts. These matches cover a significant fraction of the texts and 9.15% of the Sūtrasamuccaya as well as 10.85% of the Sūtrasamuccaya as 10.85% of the Síkṣāmuccaya of regions for which at least one match in the other text was found. Some of the matches are quite long, as can be seen in the histogram in Fig. 1. Example matches, as exported to files by the developed research tool, are presented in Fig. 2.

Sample matches could be verified and found to be correct. In some cases, however, they needed to be extended to regions before or after the marked texts. Many of the omissions and substitutions were not surprising, such as cing --> zhing, bting --> gding, and po wang --> pi bang. Some of the variations seem to be simple typos, either by the scribe of the original or simply during the digital transliteration. Typical transcription errors include substitutions between b and p, or ng and d, which appear similar in Tibetan. In some cases, making the distinction whether a variant stems from an accidental typo or is in fact a substantive variant is not clear. For example, gtor means to scatter and ’thor, which was correctly matched is some of the quotations, means to be scattered. Table 1 exhibits the “confusion matrix”, i.e., the most common letter substitutions between the two texts.

5 Discussion
Two well-known and studied works were chosen as a trial case for our first experiment, namely, transliterated texts of the two Buddhist works that were translated into Tibetan in the 8th century. Due to the large number of shared citations found in these works, they made for a good trial case for algorithmically locating matches. But because these are two different anthologies (i.e., not two versions of the one and the same work), they are different enough to provide sufficiently many instances of approximate matches and discrepancies.

Besides the value of the citations themselves, in the absence of a critical edition of one of the works, statistics regarding the types of variations (particularly in the usage of particles) hint at the nature of the editing done by the 11th-century revisers. Through the statistics, scholars may be able to learn about some stylistic differences and editorial practices. Continuing this line of work, we may even have a case where through a careful analysis of the differences one could become aware of some philosophical differences and developments.

Fig. 1: Histogram depicting the count (y-axis) of matching texts of a certain length (x-axis) [the length in Sūtrasamuccaya]. As can be seen, while most of the matches are for texts of up to 200 characters (median=61), there are also matches for texts of a few thousand characters. Note that some texts (including long ones) have multiple matches.

It has been argued before that the so-called “revisions” often involve only very minor and unimportant changes, and indeed some of the revisers had often been accused by some Tibetans of plagiarism. The difficulty, however, is that in most cases we only have access to the revised version(s), and thus cannot compare the revision with the initial translation. Our alignment and statistical tools can help scholars trace and re-evaluate this phenomenon.

Lastly, going beyond direct matches, it may be possible to identify two different texts that share similar passages, which are paraphrases, not citations. The phenomenon of borrowing is very common in Buddhist texts. In order to identify and locate such cases, the large number of Tibetan texts at our disposal would need to be searched and compared. The scholarly implications promise to be far reaching, as this would enable the discovery of the history and emergence of texts and scriptures, and allow for the estimation of the popularity of certain texts that are cited more often (and to determine in which circles these are cited). Moreover, since in the case of the Tibetan canonical texts, translated material is being used, the translation and editorial practices can also be explored.

Fig. 2: Examples of quotations found. (a) The quotations with the 90th highest score, including its context, before and after. Line breaks have no significance. (Vertical bars serve as commas.) Two hues depict the two texts, red/pink for the Sūtrasamuccaya and blue/magenta for the Síkṣāmuccaya.

Fig. 2: Examples of quotations found. (b) Part of the quotation with the fourth highest score, which is relatively clean except for some omissions.

Fig. 3: Substitution counts between the Sūtrasamuccaya text (rows) and the Síkṣāmuccaya (columns).

Fig. 4: A screenshot of the user interface, with the two matching texts displayed side by side. The text regions for which matching texts exist are emboldened. The first out of two texts that match the blue text are shown in red. This match has a score of 343 and is ranked 29th out of all matches. The panel at the bottom of the screen displays the two texts aligned character by character.

References
Barsky, M., Stege, U., Thomo, A., Upton, C. (2008): A graph approach to the threshold all-against-all substring matching problem. ACM Journal of Experimental Algorithmics 12

Prasad, A.S., Rao, S. (2010): Citation matching in Sanskrit corpora using local alignment. In: Jha, G. (ed.): Sanskrit Compu- tational Linguistics. Lecture Notes in Computer Science, Vol. 6465. Springer Berlin, Heidelberg 124–136

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO