Open Islamicate Texts Initiative: a Machine-Readable Corpus of Texts Produced the Premodern Islamicate World

Maxim Romanov; Masoumeh Seydi; Sarah Bowen Savant; Matthew Thomas Miller

Authorship

1. Maxim Romanov

Universität Wien (University of Vienna)
2. Masoumeh Seydi

Leipzig University of Applied Sciences
3. Sarah Bowen Savant

Aga Khan University—London
4. Matthew Thomas Miller

University of Maryland, College Park

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The written heritage of the “Islamicate”

Introduced by the University of Chicago historian Marshall Hodgson, the term “Islamicate” refers to anything Islamic and non-Islamic, religious and non-religious that was produced in the vast geographical region we refer to as the “Islamic world” (see, Hodgson, 1974).
cultures that stretch from modern Bengal to Spain is as vast as it is understudied and underrepresented in the digital humanities. The sheer volume and diversity of the surviving works produced in Arabic and Persian in the premodern period makes this body of texts ideal for computational analysis. While a great number of texts has been digitized over past two decades, OpenITI is the first corpus of Islamicate texts that is open, machine readable, and aims at being comprehensive. OpenITI strives to provide the essential textual infrastructure in Persian and Arabic for new forms of macro textual analysis and digital scholarship. The corpus is already actively used in several ERC projects.

Currently, OpenITI includes about 4,300 unique book titles (predominantly in Arabic) by over 1,800 authors, which amounts to approximately 750 million words (1,46 billion words with all versions). Most texts have been collected from open-access online collections of high-quality digital reproductions of premodern texts.

Such as
http://shamela.ws/,
http://shiaonlinelibrary.com/,
https://ganjoor.net/, etc..

Figure 1. Chronological distribution of authors and books in OpenITI.

OpenITI is organized in compliance with Canonical Text Services (CTS) guidelines as implemented in the
CapiTainS Suite (Clérice et al., 2017), with two exceptions made for practical purposes. First, we implemented human-readable URNs (
uniform resource name) as this allows for easier subsetting of the corpus and makes it easier to engage non-DH specialists in Islamicate studies in collaboration. Second, we are postponing conversion to TEI XML, since it poses a number of challenges for right-to-left languages, connected scripts, and extensive texts (many texts are over 1 million words).

Figure 2. CTS URN Structure: al-Ḏahabī’s
Taʾrīḫ al-islām.

Figure 3.
CapiTainS
-Compliant Folder Structure: Versions of al-Ġazālī’s
Iḥyāʾ ʿulūm al-dīn incorporated into the repository
0525AH within OpenITI.

CTS offers a powerful mechanism for building expandable and interoperable corpora. The power of CTS lies in the URN, which provides the permanent canonical references to texts and are used by CTS to identify or retrieve passages of text (
Figure 2). The tree-like structure of CTS URNs together with modular folder organization (
Figure 3) allow one to easily expand the corpus in a decentralized manner as well as to accommodate as many texts and their versions in any number of languages as might be necessary.

Texts have been automatically converted into our custom format—
OpenITI mARkdown (
AR stands for Arabic). Facilitating conversion of raw texts into machine-actionable formats, this flavor of markdown: 1) simplifies work with multivolume texts that make up the core of the corpus; and 2) helps to avoid problems that one faces when paired symbols (e.g., angle brackets), LTR and RTL languages, and connected scripts occur in the same document, when even a simple editing task becomes overly complicated. Additionally,
mARkdown will facilitate conversion into TEI XML, the de facto standard for publishing digital editions. The detailed description of
OpenITI mARkdown can be found at
https://maximromanov.github.io/mARkdown/.

Figure 3. OpenITI mARkdown: A sample text (
biography) tagged morphologically and semantically (using EditPad Pro,
https://www.editpadpro.com/).

Available on GitHub (
https://github.com/OpenITI), OpenITI includes four different types of repositories:

Raw texts repositories (naming pattern:
`RAWrabica\d+`), where texts collected from online libraries are stored in their original formats. The total count exceeds 50,000 files. Texts here must be manually reviewed and moved into
main working repositories (over 14,000 have been processed).

Main working repositories (pattern:
`\d{4}AH`), where texts in
mARkdown are grouped into 25 year periods, based on authors’ death dates. This makes repositories more manageable, allowing scholars to focus on relevant texts. Manual editing, reviewing, and tagging of texts happens here.

Instantiations (pattern:
`i\.\w+`) include all texts of the corpus adapted for use with specific software. For example, in `
i.stylo`, texts are renamed and reformatted as required by the `
stylo` package for R (see, Eder et al., 2016).

Releases (
planned) will have citable time-stamped versions of the corpus.

In order to achieve comprehensiveness, we currently are working on the improvement of Arabic-script OCR (see, Kiessling et al., 2017) and the development of a digital text production pipeline (collaboratively with SHARIAsource, Harvard Law School). Called CorpusBuilder, this user-friendly, web-based, open-source application will significantly lower the technological barriers to entry and help us involve colleagues, librarians and citizen scientists from around the world in our collaborative project of corpus creation.

Bibliography

Clérice, T., Munson, M. and Almas, B. (2017).
Capitains/Capitains.Github.Io: 2.0.0. Zenodo doi:10.5281/zenodo.570516. https://zenodo.org/record/570516 (accessed 29 April 2019).

Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: A Package for Computational Text Analysis.
The R Journal,
8(1): 107–121.

Hodgson, M. G. S. (1974).
The Venture of Islam: The Classical Age of Islam, 3 vols. Chicago: University of Chicago Press.

Kiessling, B., Miller, M. T., Savant, S. B. and Romanov, M. G. (2017). Important New Developments in Arabographic Optical Character Recognition (OCR).
Al-ʿUṣūr Al-Wusṭá (The Journal of Middle East Medievalists),
25: 1–13.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO

Open Islamicate Texts Initiative: a Machine-Readable Corpus of Texts Produced the Premodern Islamicate World

1. Maxim Romanov

2. Masoumeh Seydi

3. Sarah Bowen Savant

4. Matthew Thomas Miller

ADHO - 2019

"Complexities"