Orthographic Variety in Medieval Slavic Texts: How to Study and Model it?

  1. 1. Milena Dobreva

    Institute of Mathematics and Informatics

  2. 2. Dobrislav Dobrev

    Institute of Mathematics and Informatics

The Medieval Slavic written tradition, opposed to the Latin and Greek ones, can be characterised by high level of variety on all linguistic levels. A famous example of comparison of one sentence from the Gospel shows that in more than 100 manuscripts there were no two ideal matches. Researchers carefully study these variations because they bring important knowledge about the development of the Slavic languages both synchronically and diachronically.
The text variety in Medieval Slavic texts was still not studied by quantitative computer-implemented methods (e.g. in [1] there is no paper treating statistical analysis of Medieval Slavic texts). Our study is an initial attempt to collect quantitative data concerning the orthographic variety in different witnesses of a Medieval Slavic text.
The study of text variety phenomena is important for two reasons which go beyond the simple ‘learning more about language’. First, it might be of help in building more adequate computer models of a text in standards like SGML. Second, it can assist the development of tools for computer processing of such texts.
The Studied Text
The Medieval Slavic Psalter belongs to the group of texts serving to the church needs. For this reason, one should expect that the text of the Psalter should be more ‘homogeneous’ than canonical texts. The reasons to choose the Psalter for this quantitative study are the availability of numerous witnesses together with its belonging to the group of canonical texts.
Recently, the results of an attempt to study the tradition of the Psalter using information technology applications were published in [2]. The author studies the lexical variation in the Medieval Slavic Psalter based on the material of seven Psalters belonging to different written traditions. In this study, we used the same text sources, which will make possible the comparison of the findings from the two approaches in the future.
We studied comparatively excerpts from 6 Psalters created in different places and in different time, consisting of:
Codex Sinaiticus, 10th century, (the only one in our excerpt of manuscripts written in Glagolitic); belonging to the Bulgarian written tradition
Bolognian Psalter, 13th century, belonging to the Bulgarian written tradition
Serbian Psalter, 13th century, belonging to the Serbian written tradition
Norovian Psalter, 13th century, belonging to the Russian written tradition
Kievian Psalter, 14th century, belonging to the Russian written tradition
Genadievian Psalter, 15th century, belonging to the Russian written tradition
For a comparison we included the modern Church Slavic Psalter texts as a seventh set of texts for study.
From all these Psalters1, psalms with numbers 1, 39, 40, 41, 44, 45, 64, 73, 74, 75, 89, 91, 98, 102 and 134 were included in the experiment. Thus, we have a set of 104 text fragments taken from different parts of the Psalter. At the same time, we have the same excerpt of texts from different manuscripts, which is an ideal starting point for conducting experiments on the importance of orthographic differences between manuscript witnesses. The logic underlying the former statement is the fact that if two excerpts representing the same text piece but taken from distinct witnesses differ statistically, the orthographic variation is substantial to make separation based on the texts’ origins.
An important problem in statistical analysis is to choose size of the text excerpt sufficient for analysis. We started by studying the data on single psalms, taking into account that a psalm usually consists of about 1000 letters (this is a standard excerpt for letter frequencies study). At the same time, we were working with whole small text instances (psalms) instead of artificially excerpting 1000-letter fragments (they are convenient for comparing absolute measures, but do not present structurally meaningful text fragment).
Since in some cases 1000 letters is not enough to get a clear result, the relative frequencies of excerpts consisting of randomly chosen two, three, four, five and more psalms from those which were already available, were calculated and included as additional data for our experiments. Thus the resulting data set included 210 excerpts from each manuscript - 15 of each size (one psalm, two psalms, three psalms, and so on up to fourteen psalms).
Building Criteria for the Statistical Study
The traditional approach to study the orthographic variety in Medieval Slavic texts is qualitative. The decision about the time period when a text was written and the localisation of its creation is based on checking the occurrences of graphemes and groups of graphemes, which form lists of the specific features of a certain region, time period and, in some cases, literary school. For the Medieval Slavic texts such features, as presented in the last edition of the Academic Grammar of Old Bulgarian [3] form five groups of letter strings which vary in Bulgarian, Serbian and Russian texts (e.g., nasals; jers; iotated vowels, groups containing r and l, etc.).
Knowing these differences, we would expect that the use of the groups which are typical for the Russian texts, for example, will show a difference in the quantitative characteristics compared to the use of the same groups in Serbian or in Bulgarian manuscripts.
For this reason, we decided to study the relative frequencies of all letters from the Medieval Cyrillic alphabet and all members of such characteristic groups. Our basic aim was to check whether the qualitative characteristics of the text origin lead also to quantitative differences in the texts. These differences should be significant enough to allow us to make a clear distinction between the texts, which will correlate to the place or time they were created.
Thus, we compiled a list of 55 letters and five groups of characteristic features (those include the possible substitutions presented above). For all such strings we obtained the relative frequencies of their use in all text excerpts included in our study.
To do this, after some initial experiments with TACT, we created a Word for WINDOWS macro, which counts all strings of interest. The macro receives as input data the names of the files containing (i) the alphabet of the text taking into account that we are processing non-Latin text; and (ii) the sequence of all strings which has to be counted. A third input datum is a name of a directory where all text files to be processed are stored. Thus all data are processed in a batch mode without the necessity to enter all text files’ names manually (which for a set of more than 100 files would be waste of time and effort).
Since, as we mentioned earlier, the international academic community has different views on the qualitative features, this experiment may be conducted with different sets of strings constructed accordingly to different views. Moreover entering the name of a file storing the alphabet makes our macro usable for texts transmitted in different encoding systems.
Experiments - Cluster Analysis
After the receipt of the letter frequencies from all studied excerpts, the data are transferred to STATISTICA for Windows‘ where the relative letter frequencies are calculated.
Thus we receive a large data file which can be used for different statistical analyses. All studied strings (alphabet letters and strings of research interest) form the columns in a extensive table. Every text excerpt is presented in a single row in this table. Such a row contains all calculated frequencies together with an identifier of the row, and labels presenting the name of the manuscript, its dating, its origin (Bulgarian, Serbian, or Russian).
This data organisation allows us to choose a subset of texts for study (as a subset from all available rows), and/or to choose some of the alphabet letters or groups for analysis. From the view-point of the Slavist, this would mean that we are able to study, e.g., the nasal vowels usage in one text; or in all the texts belonging to the same manuscripts, or in all texts classified as Bulgarian, Russian or Serbian, or in the whole available group of texts. We could also decide whether we would like to study the use of jers, or of jers together with nasals, or of iotated vowels, or of all available strings (this allows the user to choose the most informative subset of data for his concrete needs).
We started our statistical experiments with cluster analysis. The purpose of this type of analysis is to check whether texts belonging to several classes can be grouped using statistical calculation of an abstract distance between the studied objects. All studied objects are grouped in clusters according to the distances, which were measured between them.
Having Bulgarian, Russian and Serbian texts (which form Bulgarian, Russian and Serbian classes), we conducted a set of experiments checking whether analysis of texts belonging to different manuscript origin leads to correct ‘clustering’.
The set of experiments was based on using as variables in the analysis the frequencies of different sets of letters from the alphabet and ‘qualitative’ strings on texts from different manuscripts. The results show clear clusterisation of the texts from different manuscript when the size of the text excerpt studied is five or more psalms. The texts showing closest results belong to the group of Russian manuscripts.
In some of the experiments even an one-size of the excerpt was sufficient to obtain clearly distinguishable clusters containing the texts from the different manuscripts. This is applicable in the differentiation of the Bulgarian and Serbian texts based on the use of jers.
The results of our work discussed with specialists in Medieval Slavic Studies, were estimated as useful and the quantitative study of the Orthographic variety will continue with involving other types of texts.
We hope that in future our study will help to build a clearer picture of the historical development of Slavic languages during the Middle Ages. This will also be an input in the methods for studying and modelling witnesses of texts with significant variations.
The work of M. Dobreva on this topic was partially supported by a RSS grant 481/1997, Cyril and Methodius and the Early Medieval Slavic World: Byzantium and the Slavs in the 9th Century AD.
1The only exception is the Serbian psalter, where psalm N 1 was lost.
1. D. Birnbaum, A. Bojadzhiev, M. Dobreva, A. Miltenova (eds.).Proceedings of the 1st Int. Conference Computer Processing of Medieval Slavic Manuscripts, 24-28 July 1995, Blagoevgrad
2. M. Camuglia, The Psalter: from the Oral Tradition to the Written one, In: Palaeobulgarica, vol. XX (1996), N 1, pp. 3-13.
3. I. Duridanov et al. (eds.), Gramatika na starobalgarskija ezik, Sofia, 1993. (Grammar of the Old Bulgarian Language, in Bulgarian).

