Corpus Linguistics for Multidisciplinary Research: Coptic Scriptorium as Case Study

paper, specified "short paper"
  1. 1. Caroline T. Schroeder

    University of the Pacific

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Coptic language is the last phase of the Egyptian language family, descending ultimately from the ancient hieroglyphs. Coptic Scriptorium has developed a multidisciplinary research platform using core Corpus Linguistics tools and methods in collaboration with other disciplinary methods. This paper will argue that this collaborative, interdisciplinary approach allows for the creation of research resources that enrich even
disciplinary work.

Coptic Scriptorium has created the first open source natural language processing tools for any phase of the Egyptian language family, including a tokenizer, normalizer, part of speech tagger, language of origin tagger (for loan words from Greek, Latin, and other languages), and lemmatizer. We have also contributed annotated data to the universal dependency Treebank project. A fully searchable corpus annotated with these tools is available online at, and all tools and corpora can be downloaded from our GitHub repositories.
This paper will argue that multidisciplinary collaboration improves even disciplinary research. Three examples are provided here; these and others will be demonstrated live in the short paper.
Collaboration with Egyptologists creating a TEI Coptic lexicon file enabled the creation of an online Coptic Dictionary, in which words in our searchable database are hyperlinked to the dictionary entries. The dictionary entries likewise show frequency statistics for the terms in our database. This collaboration benefits Egyptology, by providing an open source corpus for teaching and research linked to a dictionary, and it benefits corpus linguistics, by providing clear frequency data and lexical resources for linguists.
Collaboration with Religious Studies scholars has enabled including in our corpora transcriptions of Coptic manuscripts that have never before been published in print. Scholars in Religious Studies have provided transcriptions of texts to the project, enabling scholars in other disciplines, such as Linguistics, to conduct computational corpus research on important, previously inaccessible texts. Likewise Religious Studies scholars can use the database to conduct philological and historical research on religious texts.
Coptic Scriptorium also annotates manuscript information of interest to archivists, philologists, and codicologists within a multilayer annotation model. This enables codicologists, philologists, and archivists to use the query syntax of our corpus linguistics database (ANNIS) to investigate research questions about scribal practices, spelling and morphology, and other manuscript-related issues over multiple manuscripts, including utilizing metadata such as repository information, dates and locations of the original manuscripts, etc.
We presented the very beginnings of the Coptic Scriptorium project at DH 2014 in Switzerland. This short paper will demonstrate the extensive progress made as a result of collaboration and interdisciplinary partnerships.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO / EHD - 2018

Hosted at El Colegio de México, Universidad Nacional Autónoma de México (UNAM) (National Autonomous University of Mexico)

Mexico City, Mexico

June 26, 2018 - June 29, 2018

340 works by 859 authors indexed

Conference website:

Series: ADHO (13), EHD (4)

Organizers: ADHO