The Future of Markup is Multilingual

Alejandro G. Bia-Platas; Manuel Sanchez-Quero

Authorship

1. Alejandro G. Bia-Platas

Libraries - University of Alicante
2. Manuel Sanchez-Quero

Libraries - University of Alicante

Original URL

http://web.archive.org/web/20040903094418/http://www.hum.gu.se/allcach2004/AP/html/prop119.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

Markup is based on mnemonics (i.e. element names, attribute names and attribute values). These mnemonics have meaning, being this one of the most interesting features of markup. Markup allows us to define the structure of a text in a way that can be both processed by computer programs and understood by humans. Human understanding of this meaning is lost when the encoder doesn't have a good command of the language the mnemonics are based on. For example, a Spanish encoder that doesn't know English will find it difficult and error prone to apply or understand TEI markup using the original TEI mnemonics based on the English language.

So, by multilingual markup we mean applying marks using mnemonics in one's own language but still following the rules of the original markup vocabulary. In our own experience, a markup vocabulary exactly equivalent to TEI can be developed based on Spanish, Catalan, French and almost any other language, and the tools for translation back and forth to the original TEI core can be built automatically and can be applied in a transparent and easy way. When we build markup vocabularies equivalent to TEI but in the local language, the structural facilities and constraints of the markup scheme remain the same, only the markup terms used by DTDs, Schemas or documents are different, and the document structure becomes remarkably clearer for the encoder.

In this paper we will show and defend the benefits of using multilingual markup vocabularies for large digitization projects, like reduced learning times, reduction of errors, and incremented production, all due to using markup tags in the local language. We will also describe our implementation of multilingual markup based on the automatic generation of translating scripts using XSLT (Extensible Stylesheet Language Transformation) for both document instances and also DTDs or Schemas. We will also discuss other alternative implementations to the one proposed that look promising. Finally, we will present the conclusions of the implementation and use of this technology within the Miguel de Cervantes Digital Library.

We will also comment on the creation of a TEI Multilingual Markup Special Interest Group (TEI-MM-SIG) and the involvement of the TEI META Workgroup in the development and full implementation of a multilingual term-bank for the TEI.

Markup, meaning and multilinguality.

One of the key aspects of structural markup is the meaning it conveys, which depends on our ability to understand it.

In 1998 Robin Cover wrote: How does XML help with the encoding of information at the semantic level? ... New users sometimes refer to XML as semantic markup, and may be heard to praise XML for its ability to express semantic clarity through markup. ... Someone who uses a text editor to examine an XML document ... will readily judge the XML document more meaningful with respect to the information objects represented by text. The markup itself is a form of 'metadata', explaining to us what the constituent elements are (by name), and how these information objects are structured into larger coherent units.[1]

Sperberg-McQueen et.al. [2] supported the usefulness of markup as a source of meaning: The function of markup is not random. Markup has meaning. ... Why worry about this question?: For better markup language documentation, for better QA (verification), for better automated processes (translation, normalization, query), to provide a way to survey current practice (relevance for software developers) ... and because it's interesting. Because markup means something, ... we know certain things. I.e. because we see certain markup, we are allowed (licensed) to make certain inferences. and concluded that: the meaning of markup is the set of inferences it licenses.

So understanding XML tags is key to correctly delimit complex text structures for further automated processing. This understanding may be compromised when tag names (elements, attributes and attribute values) are in a foreign language.

The largest group of workers in our digital library is by far the proof-reading and markup team, comprised of about 40 persons. They are graduates from different humanities fields, none of them related to the English language. It is in this area where the necessity and importance of translating the original English markup into the local language (Spanish) is made evident.

We learned from practice that using a tagset in a foreign language, compared to using a tagset in our own language, increases the learning time and reduces the quality and amount of digital text production, since tag names are mnemonics that may sound familiar to English speakers but are hard to understand and memorize by users of other languages. Giving our encoders the possibility of applying tags in Spanish has increased the amount and quality of digital text production.

After successfully using XML-TEI for sometime, we embarked in the project of translating TEI element names, attribute names and attribute values to Spanish. Then we developed the translation tools to grant automatic conversion to and from the main TEI English core. These automatic conversion programs translate not only the markup of XML documents but also the corresponding DTDs.

Then we repeated the experience with Catalan and we did some tests with French. Now we are in the process of building other TEI tagsets and translations for several other languages. The purpose is to have many official translations of the TEI tagset, but one core version (the original one). The automation of the language translation of the tags is vital to assure easy interchangeability of documents amongst projects using different languages. In this way, and from the structural and semantic point of view, the tagset is the same, only the names change.

We also believe that having multilingual versions of a given tagset, like TEI, can facilitate its acceptance and use in many parts of the world like Latin America where the use of XML for electronic publishing is still uncommon. This may be of special interest for digital libraries and digital publishers worldwide, but specially within the European Union where multilingual projects can benefit in a remarkable way.

Automatic generation of markup translators

We started by defining the set of possible translations of element names, attribute names, and attribute values to the different target languages. We stored this information in an XML multilingual translation mapping document. An example of this document and its DTD follows:

TRANSLATION MAPPING DOCUMENT FOR ENGLISH, SPANISH AND FRENCH (SAMPLE):

<TAGMAP>
...
<ELEMENT en="body" sp="cuerpo" fr="corps">
</ELEMENT>
...
<ELEMENT en="div0" sp="div0" fr="div0">
<ATTR en="lang" sp="lengua" fr="langue">
</ATTR>
<ATTR en="type" sp="tipo" fr="type">
<VALUE en="news" sp="noticias" fr="nouvelles"/>
<VALUE en="suggestions" sp="sugerencias" fr="sugestions"/>
<VALUE en="biblnews" sp="novedades" fr="publications"/>
</ATTR>
</ELEMENT
...
<ELEMENT en="p" sp="parrafo" fr="paragraphe">
<ATTR en="align" sp="alinear" fr="aligne">
<VALUE en="left" sp="izq" fr="gauche"/>
<VALUE en="right" sp="der" fr="droite"/>
<VALUE en="center" sp="centro" fr="centre"/>
<VALUE en="justify" sp="justificar" fr="justifie"/>
</ATTR>
<ATTR en="indent" sp="sangria" fr="retraitpositif">
<VALUE en="left" sp="izq" fr="gauche"/>
<VALUE en="right" sp="der" fr="droite"/>
<VALUE en="both" sp="ambas" fr="lesDeux"/>
<VALUE en="none" sp="ninguna" fr="aucune"/>
</ATTR>
<ATTR en="specialindent" sp="sangriaespecial" fr="retraitnegatif">
<VALUE en="none" sp="ninguna" fr="aucune"/>
<VALUE en="firstline" sp="primeralinea" fr="premiereLigne"/>
<VALUE en="french" sp="francesa" fr="francaise"/>
</ATTR>
</ELEMENT>
...
</TAGMAP>
DTD FOR THE ABOVE FILE:

<!ELEMENT TAGMAP (ELEMENT)+ >

<!ELEMENT ELEMENT (ATTR)* >

<!ATTLIST ELEMENT
en CDATA #REQUIRED
sp CDATA #REQUIRED
fr CDATA #REQUIRED>

<!ELEMENT ATTR (VALUE)* >

<!ATTLIST ATTR
en CDATA #REQUIRED
sp CDATA #REQUIRED
fr CDATA #REQUIRED>

<!ELEMENT VALUE EMPTY >

<!ATTLIST VALUE
en CDATA #REQUIRED
sp CDATA #REQUIRED
fr CDATA #REQUIRED>
This mapping document which contains all the necessary structural information to develop the language converters is read by the transformations generator, which was built as an XSLT script [3]. XSL can be used to process XML documents in order to produce other XML documents or a plain text document. As XSL stylesheets are XML, they can be generated as an XSL output. In this way, and for each of the languages contained in the multilingual translation mapping file, we produced both an English to local language XSL transformation and a local language to English XSL transformation. In this way we assured both ways convertibility for XML documents.

For each target language we also generate a DTD or a Schema translator. In our first attempts, this took the form of a C++ and Lex parser (see figure 1). Then we changed the approach, and now we first convert the DTD to a W3C Schema, then translate the Schema to the local language, and finally we generate an equivalent translated DTD (see figure 2). This approach has the advantage of not using complex parsers (only XSLT) and also solves the translation of Schemas, which is an interesting goal in itself (see figure 3).

We only considered a one way translation from the English DTD/Schema to a local language DTD/Schema, since we assumed that the DTD/Schema would be first built in the original language (English) and then translated to the local language. We saw no need to translate the local language DTD/Schema back to English (dashed line), but this is a transformation that could easily be generated if the need arises, allowing for maintenance and modifications to be done in the local language and then translated to English.

Many other markup translators can be built to other languages in the same way, as shown by our tests with Catalan and French.

Figure F1
Fig. 1: Automatic generation of markup translators: This figure describes the generation of XSL transformations and C++ parsers to convert English markup and DTDs to Spanish.

Figure F2
Fig. 2: DTD translation using XSLT and an intermediate Schema: This figure describes the same process of figure 1 but using only XSLT.

Figure F3
Fig. 3: Schema translation using XSLT: The solution shown in figure 2 for translating DTDs by first converting them to Schemas and then using XSLT, implicitly solves the translation of Schemas.

Usage and implementation alternatives

We think that markup in the local-language should only be used for tasks which require human intervention, like creation and maintenance of documents. For automated processing and document interchange we think it is more convenient to use markup in the language of the original standard. In this way, processing tools like stylesheets need not be translated to the local language, but the document translated to the original tagset instead.

An alternative, and perhaps the most effective implementation of multilingual markup could be a translating interface integrated into an XML editor. In this way, we would have virtual views of the document with markup in different languages that could be toggled at the touch of a button, but without actually having to translate the document file. An implementation like this is possible today, but can only be done by the software companies who build XML editors. This built-in solution would not require the DTD/Schema to be translated. An editor like this would need to load the mapping information (tag-map), as well as the DTD and the document instance (see figure 4).

A compromise solution that can be integrated into some XML editors by expert users is to build macros that automatically apply the translation to local language on opening the document, and the translation back to English on closing. This would not be as handy as a one-key language-toggling solution, but can be implemented by users. Additional macro programming would also be required for translation before validation and before applying further processing like XSLT.

If multilingual markup becomes a common practice, the mapping structure with the name equivalences for markup translation could well be included as part of a new form of Schema. In any case, this use should be specified and formally integrated into the XML family of standards.

Figure F4
Fig. 4: XML Editor for Multilingual Markup: Apart from the document instance and the DTD/Schema for validation, a mapping structure with information for the translation is also required.

Conclusions

Are the advantages of using a general and widespread markup vocabulary like TEI lost?: Not at all. The two main advantages of using a general markup vocabulary like TEI are document interchangeability and community support (which includes training and tool sharing). Since markup terms can be very easily and automatically translated to the original TEI tagset, interchangeability is not lost and tools like XSLT scripts can still be used unchanged after markup translation. Training materials, however, may need to be translated or adapted, but this is not due to the use of multilingual markup but to the need of non-English-speaking encoders to have documentation in their language.
In our experience, learning times were noticeably reduced.
Production times were also reduced, along with an increase in markup quality. Encoders showed themselves satisfied and more confident in their task.
By using markup in one's own language, the meaning of markup is not lost, and the document structure suddenly becomes more clear.
Scholars and students showed approval for being able to handle documents with markup in the same language of the text.
Cooperative multilingual projects may benefit from the possibility of easily translating the markup to each encoder's language.
Sometimes new non-standard vocabularies are developed just because it seams comparatively easier than learning a standard vocabulary in a foreign language. Having the possibility of using a standard vocabulary in one's own language plays against developing a new custom vocabulary to fulfil a local markup requirement. This may help spread the use of XML vocabularies like TEI or DocBook in non-English speaking countries.
Spreading the use of standard markup vocabularies is good for document interchangeability.
Future work

A special interest group on multilingual markup (TEI-MM-SIG) has been created within the TEI Consortium to exploit and expand the benefits of using multilingual markup. During its first meeting at the 2003 TEI annual meting, the idea, tools and possibilities of multilingual markup have been introduced, and the objectives of the group have been established. Some of them are:

Translate all TEI mnemonics into different languages. This should be done by TEI users from different language zones, with interest in using markup in their own language. This is one of the main reasons to become a member of this SIG.
Using the different sets of mnemonics, we should build an official repository of TEI terms, i.e. a multilingual TEI Term-Bank. The TEI META Workgroup will provide technical support for implementing this Term-Bank.
Betatest the new term-sets and tools. This is another reason to join this group: to be the first to use this technology and provide feedback to improve it.
Study the technical possibilities, limitations and challenges of multilingual markup. There are many aspects to be discussed and decisions yet to be made. To give an example, there may be problems to overcome if we want to build mnemonics using accented or oriental characters.
Bibliography

1. Robin Cover, Cover Pages XML and Semantic Transparency. October 23, 1998. Revised November 24, 1998. http://www.oasis-open.org/cover/xmlAndSemantics.html
2. C. M. Sperberg-McQueen, Claus Huitfeldt and Allen Renear, Meaning and Interpretation of Markup not as simple as you think, in Extreme Markup Languages, Montreal, 15 August 2000.
3. Michael Kay, XSLT Programmer's Reference, Wrox Press, 2000, 1102 Warwick Road, Acocks Green, Birmingham, B27 6BH, UK, 1st. ed., ISBN 1-861003-12-9

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

The Future of Markup is Multilingual

1. Alejandro G. Bia-Platas

2. Manuel Sanchez-Quero

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004