Introducing MPCD – Middle Persian Corpus and Dictionary:

poster / demo / art installation
  1. 1. Claes Neuefeind

    Universität zu Köln (University of Cologne)

  2. 2. Francisco Mondaca

    Universität zu Köln (University of Cologne)

  3. 3. Øyvind Eide

    Universität zu Köln (University of Cologne)

  4. 4. Iris Colditz

    Ruhr-Universität Bochum

  5. 5. Thomas Jügel

    Ruhr-Universität Bochum

  6. 6. Kianoosh Rezania

    Ruhr-Universität Bochum

  7. 7. Alberto Cantera

    Free University Berlin, Germany

  8. 8. Chagai Emanuel

    Hebrew University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The project "Zoroastrian Middle Persian – Digital Corpus and Dictionary (MPCD)"

aims at creating a comprehensive, open-access corpus of Zoroastrian Middle Persian texts in Pahlavi script, accompanied by a digital Middle Persian-English dictionary based on this corpus. Started in mid 2021, MPCD is funded by the DFG as a long-term project

, with a duration of nine years in total. The cooperative project is being carried out at the universities of Bochum, Berlin, Jerusalem and Cologne.

While the partners in Bochum, Berlin and Jerusalem focus on the philological aspects of the project, the Cologne Center for eHumanities (CCeH) is responsible for the technical implementation of a collaborative working environment, which at the same time serves as user interface for research and analysis of the processed resources. Of key importance to both the philological work and the technical design of the application is a common data model, which thus will be addressed in this poster.

Scope of the project
Middle Persian was the official language and lingua franca of the Sasanian Empire (3rd-7th century) and was of high cultural and supra-religious importance. From late antiquity to the early Islamic period it connected the different areas of the Iranian East and West in both linguistic and cultural terms. However, the extensive corpus of Middle Persian texts has only been partially indexed to date and there is no comprehensive lexicographical resource covering the full variety of its vocabulary.
The aim of the MPCD project is to fill this gap by creating a corpus of all Zoroastrian Middle Persian texts in Pahlavi script (about 54 texts, approx. 687.000 words). This corpus will be made accessible in transliteration and transcription (cf. Rezania 2020) as well as in manuscript photographs of the 15 oldest codices, some of which can be obtained from Alberto Cantera's CAB project

(Corpus Avesticum Berolinense). This comprehensive digital corpus will subsequently be used as a basis for the creation of a digital Middle Persian-English dictionary, expected to comprise approx. 7.000 lemmata.

With its close interlocking of text and dictionary, the project complements existing text collections on Middle Persian such as TITUS (Thesaurus of Indo-European Text and Language Materials)

and extends existing concise dictionaries such as MacKenzie (1971) or Nyberg (1964/1974). The project is conceived as a basis for identifying internal and external factors in the complex fabric of the texts of Zoroastrian Middle Persian literature, and for providing an adequate means for a differentiated analysis of cultural, religious and social history.

Modeling MPCD
The digital corpus and dictionary represent two closely interlocked analytical tools with different emphases – text structure and semantics – that are also closely intertwined in the work process. This has to be taken into account by the internal data model, which at the same time determines the corpus structure and the collaborative working environment.
At the current stage of the project we focus on the corpus model. The corpus consists of texts (element in figure 1), with each text holding basic metadata and a number of sentences. The metadata comprises basic information like sigle, title, creation date, source, the responsible editor and his/her collaborators, while each -element contains the full sentence, one or more translations, an (optional) comment and finally one or more tokens.

Fig. 1:
Excerpt of the corpus model reflecting a single text (element ).

Each token (element in figure 2) holds information on the token language, a transcription and a transliteration, its lemma, (optional) information on the text structure to mark the beginning of a new section, folio etc (element ). Besides that, the token model includes both morphosyntactic annotations and lexicographic information, where the latter will prospectively serve as a direct link to the corpus-based dictionary (see element in ).

Fig. 2:
Excerpt of the corpus model reflecting a single token (element ).

The morphosyntactic annotations will largely follow the Universal Dependencies

standard, which is adapted for the MPCD project by determining the subset of tags necessary for the annotation of Middle Persian and by adding Pahlavi-specific tags. These fine-grained linguistic annotations on token-level will allow for differentiated searches according to linguistic parameters that will be implemented on the basis of elasticsearch; search and CRUD operations will be available via a GraphQL-API (cf. Mondaca et al. 2019a and 2019b).

With its focus on the data model, the poster will provide a compact overview of the MPCD project, reflecting the corpus structure, the transcription process and the philological decisions as well as the implications for the technical design of the working environment to be established.


MacKenzie, D. N. (1971): A Concise Pahlavi Dictionary. London/New York/Toronto.

Mondaca, F., Rau, F., Neuefeind, C., Kiss, B., Kölligan, D., Reinöhl, U., Sahle, P. (2019a):
C-SALT APIs - Connecting and Exposing Heterogeneous Language Resources. In: Book of Abstracts of the Digital Humanities Conference 2019 (DH2019) 09.07-12.07.2019. Utrecht, Netherlands.

Mondaca, F., Schildkamp, P., Rau, F. (2019b):
Introducing Kosh, a Framework for Creating and Maintaining APIs for Lexical Data. In: Electronic Lexicography in the 21st Century. Proceedings of the eLex 2019 Conference, Sintra, Portugal. Brno: Lexical Computing CZ, s.r.o., pp. 907–921.

Nyberg, H.S. (1964):
A Manual of Pahlavi. Part I: Texts, Alphabets, Index, Paradigms, Notes and an Introduction. Wiesbaden.

Nyberg, H.S. (1974):
A Manual of Pahlavi. Part II: Ideograms, Glossary, Abbreviations, Index, Grammatical Survey, Corrigenda to Part I. Wiesbaden.

Rezania, K. (2020):
A Suggestion for the Transliteration of Middle Persian Texts in Zoroastrian Middle Persian: Digital Corpus and Dictionary (MPCD): A Three Layered Transliteration System. Estudios Iranios y Turanios 4: pp. 153–73.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO