Unicode in Multilingual Text Projects: A Status Report from the Script Encoding Initiative, UC Berkeley

paper
Authorship
  1. 1. Deborah Winthrop Anderson

    Department of Linguistics - University of California Berkeley

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The latest version of the international character encoding standard Unicode, Unicode 4.0, contains over 96,000 characters. This represents a remarkable achievement, for it expands the number of scripts (and the languages using those scripts) that can be represented more fully on the Web, in email, and generally for the electronic transfer of texts. Fortunately, Unicode is increasingly supported in software and fonts. The result of the widespread adoption of Unicode is that text materials for many modern and historical languages are now more widely accessible and capable of being transmitted, without requiring fonts with proprietary (non-standard) encodings. This talk provides an overview of outstanding issues related to character encoding and Unicode in multilingual text projects as observed by the Script Encoding Initiative project at UC Berkeley. It will conclude with a wish-list of ways in which members of the academy can help more fully.

The Script Encoding Initiative (SEI) was established at UC Berkeley in 2002 in order to relay various unmet needs of academic multilingual electronic text projects to the character encoding standards bodies and to promote and explain Unicode to scholars. SEI has four objectives.

One goal is to give a presence and voice to the academic point of view (as opposed to that of computer companies) within the character encoding standards process, particularly to speak up for the encoding of historic and minority scripts at the Unicode Technical Committee meetings. Currently no university is a full member of the Unicode Consortium; Columbia University and Tamil Virtual University are the only Associate Members, but they don't regularly attend the Unicode Technical Committee meetings.

The second goal of this project is to encourage the participation of scholars and other users so scripts missing from Unicode are proposed. The number of missing scripts is still over 90, with approximately one third being modern minority scripts, and two-thirds being historic scripts. The active participation of scholars is critical: proposals need the specialized input from experts in order to arrive at a proposal that covers their needs and is complete. The outstanding scripts are often less-well known, so the task is to locate scholars, explain the standards process, and make clear the need to encode these scripts in the international standard Unicode is a necessity.

The third goal is to promote Unicode and an understanding of what it covers -- and what it does not -- more generally amongst academic groups. With the increasing adoption of XML for the Web, Unicode is playing a more prominent role in electronic text projects because it is the default character encoding standard for XML. Text projects now have to grapple with specific problems head-on when dealing with Unicode: How can one include a character (/script) that is not yet in Unicode? How should variants of characters be handled? Given a choice of several Unicode characters, which should be used? How does the font or markup come into play vis-à-vis character encoding? Although the revision of the TEI Guidelines (P5) will address some of these issues, it has not yet been published. SEI was established to help provide guidance in these areas, at least from the Unicode perspective.

The fourth objective of the Script Encoding Initiative is to raise funds for the writing of Unicode proposals by veteran Unicode proposal authors, scholars, and graduate students, and for font designers to work on the creation of free Unicode fonts. To date, work on Unicode proposals has been largely a volunteer effort, with little financial backing. In order to assure that scripts are proposed in a timely way, funding is needed, otherwise the process will drag on for years to come. A call for donations for script proposal authoring over a number of email lists (i.e., Ancient Near East list and the Unicode email list) has resulted in a small number of donations being received. An application to NEH has been submitted by this project, and additional grant-writing is expected. More remarkable, however, is that no funding or basic support from any university for this project -- or for Unicode proposal work in general -- has been received. The cause may be -- in part -- due to the economic situation in the U.S., and California in particular. However, there appears to be a fundamental disconnect on why the university should be involved in standards work, the one place where the lesser-known scripts are regularly studied. While funding and interest in online text projects (for pedagogical and historical preservation, as well as general communication capabilities for modern language communities) has drawn attention and received funding, script encoding still takes a back seat, though it provides the standard upon which all multilingual text projects should be based. At a time when multilingual capabilities are being touted, attention to the encoding of the outstanding scripts (and missing characters) is needed.

A number of general recommendations will conclude this talk:

Scholars should work closely with others in their field to arrive at a "best practices" set of guidelines on character encoding, realizing that Unicode will not encode variants or precomposed characters, and that Unicode is not intended as a means to capture paleographic details. The guidelines should be posted on a publicly accessible website.
Whatever method to cover missing characters is used in a project, document the use fully. If the PUA is used, plan on how conversion to Unicode will be implemented if/when characters are accepted into Unicode.
Work with the Unicode Technical Committee (through SEI, if desired) on characters or scripts that are needed but missing from Unicode (cf. http://linguistics.berkeley.edu/~dwanders/alpha-script-list.html).
Promote the participation of fellow scholars in the Unicode proposal review process. A list of currently proposed scripts that need comments is posted at: http://linguistics.berkeley.edu/~dwanders/ScriptsNeedInput.html
Advocate greater participation and funding from universities (and governments) for Unicode script encoding.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None