Managing Ancient Textual Corpora through READ: Optimizing Text Input and Text Analysis, Multilingual Support, Recovery and Preservation

panel / roundtable
  1. 1. Carlos Pallan Gayol

    University of Bonn

  2. 2. Deborah Anderson

    University of California Berkeley

  3. 3. Gabrielle Vail

    University of North Carolina at Chapel Hill

  4. 4. Christine Hernandez

    Tulane University - Loyola University New Orleans

  5. 5. Céline Tamignaux

    University of Bonn

  6. 6. Andrew Glass

    Microsoft Bing

  7. 7. Stephen White

    Università  Ca' Foscari

  8. 8. Francesco Borghesi

    University Of Sydney

  9. 9. Lorenzo Calvelli

    Università  Ca' Foscari

  10. 10. Ian McCrabb

    University Of Sydney

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Moderated by Carlos Pallán Gayol

University of Bonn, Germany

and Deborah Anderson

University of California at Berkeley

General panel description:
Working with historic text corpora poses specific difficulties for scholars and students: manuscripts and other text materials may be dispersed across different institutions, making it hard to compare materials; the materials may have restricted access, limiting the ability to view and/or publish text recorded on an artifact; and the writing may be difficult to reproduce digitally in a standardized font, impeding the ability to search the text digitally.  In addition, proprietary software tools can make sharing texts and commenting on them nearly impossible. This panel will discuss the use of new tools that address the challenges of work on any historic text, drawing on examples from projects working on Mayan hieroglyphic and Latin texts. The tools are extensible and can be used for a wide variety of scripts and languages.
This panel will include four papers, each of which brings a different perspective to the challenges of working with historic texts, and discusses how the Research Environment for Ancient Documents (READ) platform can be used to address those hurdles. A common thread linking all papers together is the evolution of the READ platform, originally created to support alphasyllabic languages such as Gandhari and Sanskrit, but later expanding to accommodate both fully alphabetic languages (Latin) and more recently, logosyllabic ones (Mayan and Egyptian hieroglyphs).
Each panelist of the following papers will speak for twenty minutes, allowing ten minutes for questions and discussion at the end of the session.
The first paper addresses ongoing work to integrate three significant image collections of Mayan hieroglyphic texts on the READ platform that come from two different institutions (INAH in Mexico and Tulane University, New Orleans). The paper highlights the advantages of having a significant thematic overlap between collections that can be readily consulted, analyzed, and annotated by researchers in a single open-access repository.
The second paper discusses the development of advanced input methods, enabling real-time typing and accurate rendering of Mayan texts by relying on two Unicode-compliant OpenType fonts and a novel virtual keyboard. These tools will make the Mayan corpus fully machine-readable, while also bypassing image restrictions and limited availability for certain research materials. 
The third paper will showcase the READ software system by describing the sophisticated database architecture at its core. The architecture bestows READ with increased flexibility to accommodate the varying needs and standards of different research communities working with various ancient languages and scripts. READ provides a range of tools to support palaeographical, phonological, morphological and lexicographical analyses.
The fourth paper will discuss various case studies of current research on Latin epigraphic, manuscript, and printed text relying on READ. The authors will address some of the highly specialized tools and flexibility that this innovative platform offers for performing specific tasks and outcomes, such as identifying text forgeries, or more accurately, characterizing text genres through comparative and iterative pattern analyses.

Recovering, Integrating and Preserving Textual Archives and Collections of Historical and Archaeological Value: The Case of the Maya Hieroglyphic Corpus

Gabrielle Vail,

The University of North Carolina at Chapel Hill

Christine Hernandez,

Tulane University

Céline Tamignaux

University of Bonn, Germany

and Carlos Pallán Gayol

University of Bonn, Germany

Access by researchers to extant Maya hieroglyphic texts is complicated by the global dispersal of the primary documentation across institutional and personal archives, which may in turn be site- or country-based. Our goal is to reunite Maya texts in a virtual environment to develop a comprehensive repository in which materials from different collections can coexist and dialog within a single platform. Not only does this allow the development of a more historically informed archive to house the documentation history for each ancient text, but it makes it possible to seamlessly integrate data from Maya sites across international borders within a single corpus, while multiple types of documentation for specific objects can be readily compared.
We have taken steps to building collaborations with institutions holding significant collections. Through these partnerships, and operating within the institutions’ policies, it is possible to bring collections together within an open-access platform designed to include the complete corpus of prehispanic Maya texts. Presently, our platform integrates four such collections: two from the National Institute of Anthropology and History (INAH), both focused on sites within Mexico, one from Tulane University in New Orleans, which has a broader focus across the Maya region, and the renowned Linda Schele Drawings collection. The first of the Mexican collections includes historical photographs from INAH’s Technical Archive from the mid- to late-20th century, which often enhance analysis by showing monuments in a better state of preservation than today. Several monuments from this collection also show monuments that were later lost or show monuments still
in situ that were later removed from their original contexts. The second INAH collection consists of systematic digital image acquisition made by the Agimaya-INAH project between 2006-2011 at several Maya archaeological sites, museums, and storage facilities in Mexico. The Tulane repository houses the collection of Merle Greene Robertson, which consists of primary documentation of carved texts from over 60 sites across the Maya area spanning the entire prehispanic period, beginning with the Late Preclassic (c. 400 BC-AD 250). These include rubbings of carved monuments, drawings of carved and painted texts, and extensive photographic documentation. The best documented sites are Palenque, Yaxchilan, Bonampak, and Chichén Itzá. Lastly, the extensive Linda Schele Drawings collection is available through a partnership with the Los Angeles County Museum of Art (LACMA) and David Schele, and it encompasses over a thousand high quality drawings made by the late Mayanist Linda Schele at Maya sites such as Palenque, Copan, Yaxchilan and Chichen Itza.

The Mayan-READ platform that we are currently developing is a powerful tool to manage and integrate robust collections of annotated image and metadata. To illustrate its workings, we will present examples from the sites of Palenque and Yaxchilan, for which considerable overlap of available documentation exists. Additionally, we illustrate examples of historic images showing portions of monuments that have since been lost. Once these collections are integrated on a single, comprehensive resource, scholars will be able to access them and generate their own analyses and digital editions of the hieroglyphic texts on several levels, and connect them with READ’s additional resources, such as period-specific glypharies, character lists, quadrats lists, site specific syllabaries, and glossaries cross-referencing terms attested in the hieroglyphic corpus across multiple Mayan languages.
The Mayan-READ tool is also designed to pair documentation with entirely digital renderings of glyphic texts based on text input performed with encoded characters. Having these digital renderings online allows researchers to study and discuss them, even when access to high quality images may be restricted due to institutional policies. Moreover, monuments that are damaged, incomplete, or fragmented can be reconstituted by superimposing extant portions from different documentary sources over the digital renderings using an advanced multilayer viewer tool. Ultimately, the resources we discuss are geared towards creating an environment where these ancient documents can be recovered contextually to a significant degree, reintegrated within the larger Maya corpus, and preserved for future scholarship.

Figure 1. Screen capture showing database integration of historical records from INAH’s Technical Archive collections (by Cëline Tamignaux and Carlos Pallán, in collaboration with INAH, Mexico). Images © Instituto Nacional de Antropología e Historia, México

Advanced Text Input, Rendering and Visualization of ancient Mayan texts: towards a fully digital repository of encoded texts

Andrew Glass

Carlos Pallán Gayol

University of Bonn, Germany

Recent breakthroughs in Open-Type font development and virtual keyboards made by our team will enable users for the first time to perform text input with encoded Mayan characters directly on a standard document-type or website. Two prototype fonts being developed realistically render the sign-repertoire found within the Mayan hieroglyphic codices and the expanded character set from the stone inscriptions from the Classic period. Since Mayan glyphic signs were not written in linear fashion, but arranged to conform into glyph-blocks, we addressed the challenge of rendering non-linear sign sequences. By thoroughly mapping the myriad possible arrangements that individual signs can take within these clusters, we are now able to describe them with only a small set of descriptors and joining controls. In our text-input method, we indicate not only the precise signs and variants involved, but also their specific cluster configurations. Thus, the user types sequences and presses a conversion key to change to the associated hieroglyph. Structuring is dynamic, based on internal font logic for the signs and joining controls and the prototype font uses a technique that will be fully Unicode conformant, once Mayan has been added to the scripts supported by this standard.
While some of the patterns into which Mayan signs could be arranged closely resemble relatively simpler sign-clusters found in other ancient scripts, such as Sino-Japanese-Korean-Vietnamese characters (CJKV), the Mayan script’s greater degree of visual complexity required us to expand on strategies recently developed for Egyptian hieroglyphs, by introducing additional joiner controls. This enables rendering of arbitrary glyph blocks including complex arrangements with as many as seven signs in a single block. Rendering Mayan texts authentically also requires replicating the ancient layouts by which blocks were arranged. In general, Classic texts were meant to be read from left to right and from top to bottom in paired columns (i.e., A1, B1, A2, B2). Accordingly, we are experimenting with a layout manager that allows users to display texts on a grid of
n number of columns and rows using paired columns. This system will allow blocks to be arranged either in a purely vertical fashion, as in the Codices, or in a purely horizontal way (e.g., the rim of a ceramic vessel). It should also support circular layouts (e.g., stone disks)

The Mayan-READ tool that our team is currently developing addresses these text-input challenges from a hollistic perspective, fully realizing that contributing digital editions of Mayan texts to our repository require access to auxiliary tools, such as high-quality images of monuments and metadata from various collections, updated catalogs of Mayan characters (glypharies), syllabaries specific to the region and period under study, lexical lists of attested terms in the script, and glossaries able to cross-reference several thousand cognate sets across extant Mayan languages. We will illustrate this integration at work, by showing how our tools enable users to annotate passages from the Dresden Codex and the site of Chichen Itza, and render them electronically in realistic, non-linear fashion with Unicode-compliant, encoded characters, and to create online publications and digital editions of texts.
Our team is currently integrating these resources to provide users with a fully openly accessible platform able to integrate the combined outcome of multiple researchers and projects into a vast open-source, online Mayan text repository. These standard-conformant electronic texts will be fully machine-readable, thereby enhancing access, searchability, interchangeability and benefits for long-term preservation. These encoded Mayan texts are expected to also be of value in faithfully rendering historical documents without solely depending on images owned by historical collections, which may carry usage restrictions placed by various institutions.

Figure 2. Our virtual keyboard for Mayan hieroglyphs, together with dedicated Open-Type fonts being created, allow for fast text-input and layouting of realistic digital renderings of Mayan glyph-blocks in complex arrangements or
quadrats (Shown here: Glyphs from the Dresden Codex, Almanac 61).
Work by Andrew Glass and Carlos Pallan, NcodeX Project.

Digital Text Analysis: Automating Text Workflows and Capturing the Standard of Practice

Stephen White

Ca’ Foscari University of Venice, Italy

READ (Research Environment for Ancient Documents) is a web-based software platform that supports scholars in researching and presenting studies of ancient documents. It is designed especially for documents such as manuscripts, inscriptions, coins, etc. which have a preserved representation on a physical surface. The system supports workflows for text transcription, translation, and palaeographic, phonologic, morphological, and lexicographic analyses, as well as the ability to represent multiple interpretations (readings) of a single document, which can be aligned in various output presentation formats.
READ uses a linked data model that represent individual entities such as syllables, words and segments (e.g. the location of a glyph on the writing surface) along with the layers of interpretation identified by the researcher, such as physical lines of glyphs, grammatical structures and lemma with attested forms. At the core of the system, READ captures text transcription, the segment and the individual link between a transcription and a glyph. It uses defined ontologies to tag the wide range of research information gathered from the different disciplines using specialized tools. The system also manages the complexity of the link data using software engineering techniques like constraint systems, state machines and state tables.
Originally designed to streamline research workflows for ancient Gandhāran texts (manuscripts, documents, inscriptions and coins) written in Kharoṣṭhī (an akṣara-based script), where the writing system knowledge was encoded into entities, relationships, state tables and lookup tables, READ has been recently extended for other language types such as Mayan (logo-syllabic) and Latin (alphabetic). This presentation discusses the challenges encountered while extending READ to support these writing systems and the different standards of practice used by the researchers that study them. Particular focus is given to text import (parsing), text critical markup, paleography, and presentation formats.

READ Extension to Alphabetic Languages: Case Studies for Latin

Francesco Borghesi,

University of Sydney, Australia

Lorenzo Calvelli,

Ca’ Foscari University of Venice, Italy

Ian McCrabb,

University of Sydney, Australia

and Stephen White

Ca’ Foscari University of Venice, Italy

This paper reports on progress of a research collaboration involving the University of Sydney (Australia), Ca’ Foscari University of Venice (Italy), Brown University (USA) and Prakaś Foundation (Sydney), to undertake the extension of the READ (Research Environment for Ancient Documents) platform from its original support for alphasyllabary languages (Gandhari, Pali and Sanskrit) to support for the alphabetic language group. The extension of the underlying READ model and enhancement of READ editors to support presentation of alphabetic languages provides a comprehensive platform for Latin epigraphic, manuscript and printed text research, and allows for the transcription, research, understanding, analysis and publishing of scholarly editions and studies.
In the first section, the authors will outline the formation of the collaboration emphasizing the advantages of a multi-institutional strategy and alignment with an organization with specialist technical and project management expertise in spite of the challenges that it poses in terms of rigidity of the academic structure and funding difficulties. They will show some preliminary results of case studies on texts ranging from ancient Roman epigraphy to fifteenth century incunabula, and outline the relationship of these research outputs to each of the team's research projects.
The project has exercised and extended READ paleographic features to explore the identification of inscribed forgeries. The underlying READ design model of atomization of data to its smallest indivisible components and the linking and sequencing of these entities supports a READ module with the capacity to tag individual characters with palaeographic properties and generate palaeographic reports. This granular annotation of individual characters may also support the identification of epigraphic forgeries through faceted comparison of individual characters.
The project has extended READ’s generic annotation and sequencing features and exercised a ‘strata management’ workflow, to structure the implementation of multiple orthogonal analysis strata (grammatical, structural, semantic and syntactic) on an edition substrate (linked text and image). An analysis stratum can be any thematic complex of annotations, sequences and semantic links identified with an existing edition substrate. Uniquely labelling the constituents supports collaborating researchers in managing the ownership (for editing) and visibility (for exposure) of an analysis stratum themselves.
This stratified implementation was predicated on the delivery of READ capability through the READ Workbench portal. Hosted at the University of Sydney, Workbench delivers READ configured for language/script for individual projects, as software as a service. READ Workbench supported the management of research consultants and research specialists in the development of edition substrates and their collaborations to register their own analysis strata in support of diverse research objectives.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.