Distant reading of Russian Soviet diaries (Prozhito Database)

paper, specified "long paper"
  1. 1. Anastasia Bonch-Osmolovskaya

    National Research Unversity Higher School of Economics

  2. 2. Viktoria Vorobieva

    National Research Unversity Higher School of Economics

  3. 3. Artem Kriukov

    National Research Unversity Higher School of Economics

  4. 4. Maria Podriadchikova

    National Research Unversity Higher School of Economics

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The diary as a genre has always been ambiguous and complex, balancing “between literary and historical writing, between the spontaneity of reportage and reflectiveness of the crafted text, between selfhood and events, between subjectivity and objectivity, between the private and the public” (Langford and West 1999, 8). For a long time, researchers have been struggling to define its specifics, and the historical framework has only recently been replaced by an intertextual approach. It has become realizable mostly due to the works of French structuralist Philippe Lejeune (Lejeune 2009).
Due to its genre ambiguity and comprehensiveness, the diary gains a lot of interest both from researchers and common readers. In Russia, this interest was embodied in the Prozhito project—a large database of intimate papers written by people in the Russian Empire and modern Russia but mostly in the Soviet Union. Сreated in 2015,
Prozhito has become a source for reflection on history on both national and personal scale.

Prozhito database offers an opportunity to explore diaries on a large scale, with computable methods. However, opportunities go hand-in-hand with challenges, and from this perspective, a researcher who studies the
Prozhito database faces certain intricacies:

Prozhito database is heterogeneous. Unlike printed literary works, not a lot of diaries are well preserved. Unsurprisingly, the diaries written by prominent people or created during dramatic times have more chances to be saved.

The diary is an intimate and personal writing, and this aspect influences the data as well. Typos and incomprehensible passages inevitably affect the research process and its outcomes.

Nevertheless, the
Prozhito database is still an encompassing source of material. From a quantitative perspective, several research questions can be asked:

How various historical events are reflected in the diaries of that time?
How do people express their feelings in writing? Are there any patterns that people use to describe their emotions?
Assuming that diaries are diverse and heterogeneous, is it possible to create a classification that would take this heterogeneity into account?

To answer these questions, we conducted several researches based on Digital Humanities methods—topic modeling, collocation analysis, and classification tasks.

How historical events are reflected in the diaries

We compared Soviet publicist articles with diaries created at the same time in order to define how historical names or events are covered in these different texts.
With topic modeling applied, we found that collocations like “Soviet power” (“sovetskaya vlast’”), “class struggle” (“klassovaya bor’ba”), and “class enemy” (“klassovyi vrag”) were popular both in the articles and diaries throughout the whole period presented in data. Some of them like the phrase “enemy of the people” (“vrag naroda”) saw a dramatic growth during the time of the enormous purges.

Figure 1. Number of the bigram “enemy of the people” during the selected period.
With distant and close reading methods combined, we could find how people expressed their attitude towards these events in writing. We found a correlation between the records mentioning the Civil (1917-22/23) and the Second World wars (1939-45) in which people compared their severe experience and conditions.
These outcomes were defined owing to analysis of collocations and co-occurrences. We tried a similar approach in answering the second question—whether quantitative methods are viable for analyzing emotions in diaries.

How to analyze emotions expressed in writing

Using the bootstrapping method, 6283 records connected with the love topic were selected. It was challenging to define words and phrases used for expressing feelings of love owing to flexible language structure. Nevertheless, splitting data into groups was fruitful in comparing patterns during different time periods.
We started by comparing the proportion of emotional records depending on the period. Due to the database's heterogeneity, we found two periods with a large proportion of emotional records—at the beginning of the 20th century and during the Second World War.

Figure 2. Emotional records ratio by period.
We selected repeating collocations and analyzed how their frequency changed over time. The postwar period that had not a lot of emotional records has seen a dramatic growth of several bigrams such as “to love passionately” (“strastno lubit’”) and “to love terribly” (“uzhasno lubit’”). The ratio of these bigrams is plummeting during the Second World War when other bigrams like “beloved person” (“lubimyi chelovek”) rocketed. Being separated, people expressed their sorrow in writings to beloved ones.

Figure 3. Ratio of the selected bigrams by period.

Figure 4. Ratio of the selected bigrams by period.
The way people describe feelings is changing throughout history. Undoubtedly, shifting in words’ meaning makes these explorations challenging. Nevertheless, certain strong patterns can be observed despite these limitations.

How to classify heterogeneous data

Diaries balance between historicity and emotionality, which raises classification issues. We concluded that a proper classification should be based on the parameters apart from genre and author. Instead, we can classify diary records relying on the author's intention to express a certain type of information—a description of a particular life episode, emotion, or even literary text.
Adding such aspects as eventual density and writing style, a preliminary classification can be prepared:

Object of description
Eventual density
Writing style

Occurred events: everyday life

Occurred events: specific activity
High or middle
Mostly formal

Feelings, emotions and reflections
Middle or low

Occurred events: everyday life

Fictional events

After deleting short (less than 500 symbols) texts, we trained a logistic regression model on randomly selected records—documents were represented as vectors using the TfidfVectorizer tool.
At the next stage, in a team of annotators, we manually marked up two datasets twice—one with unbalanced data consisting of 3780 records and a balanced dataset with 2240 records. After annotating, we evaluated the consistency of markup.

Figure 5. Consistency of markup
Using built datasets, we tried various linear classifiers such as logistic regression, Naive Bayesian classifier, and Support vector machine. Logistic regression shows better results than other models (f-measure=0.81).

Figure 6. The error matrix of the basic solution on the balanced dataset.

Figure 7. The error matrix of the basic solution on the unbalanced dataset.
EPISODE and NAR categories were ambiguous. After combining them into one category NAR_EPISODE, the f-measure of the category increased; for a balanced dataset it was 0.78, for an unbalanced one—0.88.

Figure 8. The error matrix of the basic solution on the balanced dataset.

Figure 9. The error matrix of the basic solution on the unbalanced dataset.
The resulting classification can be variously applied. It becomes possible to analyze what topics are peculiar for authors of a certain age and how individual narrative behavior alters throughout life.

Figure 10. Ratio of different categories in the diaries of authors of different ages.

Figure 11. Oleg Amitrov’s narrative behavior—his diaries are among the most thorough in the database.

In our project, we present a framework based on a quantitative method for approaching diaries. We considered both historical and emotional aspects, and these investigations resulted in designing a classification, which, in its turn, is fruitful for new opportunities of computable research in this field.


Langford, Rachael and Russell West. (1999). Introduction: Diaries and Margins.
Marginal Voices, Marginal Forms: Diaries in European Literature and History. Amsterdam: Rodopi.
pp. 6-21.

. (2009).
On diary
. Honolulu: University of Hawaii Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO