Linguistic Issues in the Entry, Character-Encoding, Processing, and Rendering of Sanskrit

panel / roundtable
Authorship
  1. 1. Peter Scharf

    Brown University, Linguistics - University of Pennsylvania

  2. 2. Malcolm D. Hyman

    Harvard University, Classics Dept - Max Planck Institute for the History of Science / Institution Max Planck Institut für Wissenschaftsgeschichte

  3. 3. Venu Govindraja

    University at Buffalo, State University of New York (SUNY)

  4. 4. Ralph Bunker

    Maharishi University of Management

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This panel will study the principles upon which text-encoding schemes for Sanskrit are based and the uses to
which they are suited. The issues under investigation range from character encoding to more complex markup
such as the indication of word boundaries. The choices made in the design of a text-encoding scheme have
important ramifications for text-processing functions such as data entry, efficiency of encoding, linguistic
processing (e.g., morphological and phonological analysis), and rendering. Speakers will present solutions
that utilize new technologies such as XML and OpenType. Although the discussion in this panel will focus on
Sanskrit, many issues are relevant also to other languages and writing systems.
Character-encoding schemes may range from the purely sound-based to the purely graphic. In a
sound-based scheme, the basic unit of analysis is the phone (speech sound). A sound-based scheme may
either take the phone as an atomic unit or decompose the phone into a bundle of phonetic features. Taking the
phone as the atomic unit, a purely phonemic system includes only the distinctive sounds (phonemes) of the
language in its inventory. Most systems which take the phone as the atomic unit, however, extend their
inventory to include at least some contextually conditioned sounds, and most sound-based systems mix
phonemic and phonetic principles. In graphic schemes, the basic unit is the graph (written shape). Graphic
schemes may involve three principles. They may take the character as an atomic unit and encode only graphemes; they may decompose the character into a set of strokes and encode partial glyphs; or they may
encode complex glyphs (ligatures).
Sanskrit, the primary culture-bearing language of India, is written in the Devanagari script. This script
includes both syllabic and alphabetic features. Consonant graphs imply an inherent short /a/ vowel, unless
another vowel is explicitly indicated or the absence of a vowel is indicated by the virama sign. Consonant
sequences are written as ligatures; traditional Sanskrit orthography requires glyphs for more than a thousand
such sequences. We will survey and categorize current encodings for Sanskrit, including general
standards-based schemes for Indic languages, and specialized schemes used by Indologists for Sanskrit both
in Devanagari and in Roman transliteration. Moreover, we will examine the principles upon which the various
schemes are based, the applications to which they are best suited, and their potentials and shortcomings.
After clarifying the general issues, Hyman and Scharf will discuss their sound-based encoding
scheme. This scheme has been used for representation of a digital library of Sanskrit texts and as the internal
encoding for an automatic Sanskrit morphological analyzer. Tools are available that facilitate transliteration
and re-encoding into a number of standard formats, including Unicode.
The word-boundary problem has limited the utility of Sanskrit digitized text. Word boundaries are
often obscured phonologically in Sanskrit by the replacement of consecutive vowels with a single sound or
graphically by the rendering of consecutive consonants with a single ligature. Word indices, concordances,
grammatical and lexical analysis, however, require access to word-boundary information. Earlier work in text
encoding has addressed this problem in ad hoc ways, typically by using Roman transcription and by undoing
sandhi. The former sacrifices flexibility in rendering and the latter destroys prosodic information. An ideal
system would provide all the information necessary for both flexible rendering and linguistic analysis. Work
currently under way at Brown to create a computational implementation of a lexically based production
grammar and parser promises to allow automated processing that can correctly divide digitized Sanskrit text
into its component words.
Govindaraju is compiling Devanagari character databases and digitized Hindi lexica and tagged text
corpora to serve as an accurate and comprehensive benchmark to test algorithms used in the field of
Devanagari OCR research. He is also developing tools for truthing scanned documents. He will discuss
techniques for word and line separation, character segmentation, lexically driven and lexicon-free techniques
for word recognition, and linguistic tools for post processing.
Bunker surveys the support that OpenType fonts provide for ligatures. Because Unicode separates
rendering issues from character-encoding, ligature selection need not be encoded in the text but can be left to
the font. Bunker’s research aims at developing a database of all ligatures found in printed Devanagari texts
and software that allows a user to build a customized OpenType font automatically. Any application that
supports OpenType fonts will then be able to render Sanskrit text with the user’s choice of ligatures. This
solution leverages the emerging OpenType standard to allow high-quality Devanagari typography without
cumbersome data-entry conventions or specialized software.
After the four fifteen-minute presentations, the panel will openly discuss cross-linguistic issues, the
potential for mutual application of techniques developed for various scripts and languages, and the
development of intelligent techniques to advance Sanskrit text-processing.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None