The Multilingual Markup Website

Alejandro Bia; Juan Malonda; Jaime Gomez

Authorship

1. Alejandro Bia

University Miguel Hernández
2. Juan Malonda

University Miguel Hernández
3. Jaime Gomez

University Alicante

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

INTRODUCTION
Markup is based on mnemonics (i.e. element
names, attribute names and attribute values). These mnemonics have meaning, being this one of the most interesting features of markup. Human understanding of this meaning is lost when the encoder doesn’t understand the language the mnemonics are based on. By “multilingual
markup” we refer to the use of parallel sets of tags in various languages, and the ability to automatically switch
from one to another.
We started working with multilingual markup in 2001, within the Miguel de Cervantes Digital Library. By 2003, we have built a set of tools to automate the use of
multilingual vocabularies [1]. This set of tools translates
both XML document instances, and XML document
validators (we first implemented DTD translation, and then Schemas [2]). First we translated the TEI tagset, and most recently the Dublin Core tagset [3] to Spanish, and Catalan. Other languages were added later1.
Now we present a Multilingual Markup Website that
provides this type of translation services for public use.
PREVIOUS WORK
At the time when we started this multilingual markup
initiative in 2001 there were very few similar
attempts to be found [4]. Today they are still scarce
[5, 6].
Concerning document content, XML provides built-in
support for multilingual documents: it provides the
predefined lang attribute to identify the language used in any part of a document. However, in spite of allowing users to define their own tagsets, XML does not
explicitly provide a mechanism for multilingual
tagging.
THE MAPPING STRUCTURE
We started by defining the set of possible
translations of element names, attribute names, and attribute values to a few target languages (Spanish, Catalan and French). We stored this information in an XML translation mapping document called “tagmap”, whose structure in DTD syntax is the following:
<!ELEMENT tagmap (element)+ >
<!ELEMENT element (attr)* >
<!ATTLIST element
en CDATA #REQUIRED
es CDATA #REQUIRED
fr CDATA #REQUIRED>
<!ELEMENT attr (value)* >
<!ATTLIST attr
en CDATA #REQUIRED
es CDATA #REQUIRED
fr CDATA #REQUIRED>
<!ELEMENT value EMPTY >
<!ATTLIST value
en CDATA #REQUIRED
es CDATA #REQUIRED
fr CDATA #REQUIRED >
Fig. 1. Structure of the original tagmap.xml file
This structure is pretty simple, and proved useful to
support the mnemonic equivalences in various languages. It was meant to solve ambiguity problems, like having two attributes of the same name in English, who should be translated to different names in a given target language.
For this purpose, this structure obliges us to include
all the attribute names for each element and their
translations. The problem with this is global attributes, which in this approach needed to be repeated, once for each element. This made the maintenance of this file cumbersome. Sebastian Rahtz then proposed another structured, under the assumption that an attribute name has the same meaning in all cases, no mater the element it is associated to, and accordingly it would have only one target translation to a given language. This is usually
the case, and although theoretically there could be
cases of double meaning, as above mentioned, they do not seem to appear within the TEI. So the currently
available “teinames.xml” file follows Sabastian’s
structure. Note that “element”, “attribute” and “value” appear at the same level, instead of nested:
<!ELEMENT i18n (element | attribute | value)+ >
<!ELEMENT element (equiv | desc)* >
<!ATTLIST element
ident CDATA #REQUIRED >
<!ELEMENT attribute (equiv | desc)* >
<!ATTLIST attribute
ident CDATA #REQUIRED >
<!ELEMENT value (equiv)* >
<!ATTLIST value
ident CDATA #REQUIRED >
<!ELEMENT equiv EMPTY >
<!ATTLIST equiv
xml:lang CDATA #REQUIRED
value CDATA #REQUIRED >
In 2004, we discussed the idea of adding brief text
descriptions to each element, the same brief descriptions
of the TEI documentation, but now translated to all supported languages. This would allow the structure to provide help or documentation services in several
languages, as another multilingual aid. This capability
was then added to the “teinames.xml” file structure,
although the translations of the all the descriptions still need to be completed:
<!ELEMENT desc (#PCDATA) >
<!ATTLIST desc
xml:lang CDATA #REQUIRED > Fig. 2. Structure of the teinames.xml file.
THE MULTILINGUAL MARKUP WEB
SERVICE By means of a simple input form, the markup of a
structured file can be automatically translated to
the chosen target language. The user can choose a file to
process (see figure 3) by means of a “Browse” button.
Fig. 3. The Multilingual Markup Translator form.
Currently, only TEI XML document instances are
allowed. In the near future, the translation of TEI DTDs,
W3C-Schemas and Relax-NG Schemas will be added,
and later, other markup and metadata vocabularies will
be supported, like Docbook and DublinCore.
The system uses file extensions to identify the type of
file submitted. Allowed file extensions are: .xml for
document instances, .dtd for DTDs, .xsd for W3C
Schemas, and .rng for RelaxNG schemas.
The document to be uploaded must be valid and
well-formed. If the document is not valid, the translation
will not be completed successfully, and an error page will
be issued. Once the source file has been chosen, the user
must indicate the language of the markup of this source
file, as well as the target language desired for the output.
This is done by means of radio buttons.
It would not be necessary to indicate the language of
the markup of the source file if it was implicit in the file
itself. We thought of three ways to do this:
To use the name of the root tag to indicate the language of
the vocabulary of the XML document. In this way, TEI.2
would be standard English based TEI, TEIes.2 would
indicate that the document has been marked up using the
Spanish tagset, and in the same way TEIfr.2, TEIde.2,
TEIit.2 would indicate French, German, and Italian, for
instance.
To add an attribute to the root element, to indicate the
language of the tagset, for instance: <TEI.2 markupLang
= “it”> would indicate that the markup is in Italian.
Use the name of the DTD to indicate the language
of the tagset. TeiXLite.dtd would be English, while
TeiXLiteFr.dtd would be the French equivalent.
Option 3 is by far the worst method, since a document
instance may lack a DOCTYPE declaration, and there
may be lots of customized TEI DTDs everywhere with
very different and unpredictable names. However,
options 1 and 2 are reasonably good methods to identify
the language of the markup. Consensus is needed to make
one of them the common practice.
IMPLEMENTATION DETAILS For the website pages we used JSP (dynamic pages)
and HTML (static pages), and these are run under a Tomcat 5.5 web server. For the translations, we used XSLT, as described in [1, 2, 3]
AUTOMATIC GENERATION OF MARKUP TRANSLATORS USING XSLT
The XSLT model is thought to transform one input XML file into one output file (see figure 4), which could be XML, HTML, XHTML or plain text, and
this includes program code. It does not allow the
simultaneous processing of two input files.
Fig. 4. The XSLT processing model.
There are certain cases when we would like to process
two input files altogether, like markup translation
(see figure 5).
Fig. 5. The ideal transformation required.
As XSLT does not allow this, two alternatives occurred to us, both comprising two transformation steps.
The first approach is to automatically generate translators.
As Douglas Schmidt said: “I prefer to write code that writes code, than to write code” [7]. This is what we
have done for the MMWebsite, i.e. to pre-process the translation map in order to generate an XSLT translation script which includes the translation knowledge embedded in its logic. Then this generated script can perform all the document-instance translations required. The mapping structure supports the language equivalences for various languages, so we should generate a translator for every
possible pair of languages. Whenever the mapping
structure is modified, a new set of translators must be generated. Fortunately, this is an automated process.
Fig. 6. Pre-generation of a translating XSLT script, to then translate the document instance.
The other alternative would be to merge the two input
files into a new single XML structure, and then to process
such file which would contain both the XML document
instance, and the translation mapping information (see
figure 7). This implies joining the two XML tree structures
as branches of a higher level root.
Fig. 7. Merging the two files before applying XSLT.
Although this approach may prove useful for some
problems, we did not use it for the MMWebsite, because the file merging preprocessing must be done for each file to translate, increasing the web service response time. Using preprocessed translators instead proved to be a faster solution.
This limitation, which is proper of the XSLT processing model, could be avoided by using a standard programming language like Java instead.
HOW WE ACTUALLY DO IT
The mapping document which contains all the
necessary structural information to develop the language converters is read by the transformations
generator, which was built as an XSLT script. XSL can be used to process XML documents in order to produce
other XML documents or a plain text document. As XSL stylesheets are XML, they can be generated as an XSL output. We used this feature to automatically generate both an English-to-local-language XSL transformation
and a local-language to English XSL transformation for each of the languages contained in the multilingual
translation mapping file. In this way we assured both ways convertibility for XML documents (see figure 8).Fig. 8. Schema translation using XSLT.
For each target language we also generate a DTD or a Schema translator. In our first attempts, this took the form of a C++ and Lex parser. Later, we changed the
approach. Now we first convert the DTD to a W3C Schema,
then we translate the Schema to the local language, and finally we can (optionally) generate an equivalent
translated DTD. This approach has the advantage of not using complex parsers (only XSLT) and also solves the translation of Schemas. In our latest implementation, the user can freely choose amongst DTD, W3C Schema and RelaxNG, both for input and output, allowing for a
format conversion during the translation process.
Many other markup translators can be built to other
languages in the way described here.
CONCLUSIONS
Amongst the observed advantages of using markup in one’s own language are: reduced learning times, reduction of errors and higher production. It may also help spread the use of XML vocabularies like DC, TEI, DocBook, and many others, into non-English speaking countries. Cooperative multilingual projects may benefit from the possibility of easily translating the markup to each encoder’s language. Last, but not least, scholars of a given language feel more comfortable tagging their texts with mnemonics based on their own language.
FUTURE WORK
Multilingual Help Services: As already said, brief descriptions for elements and attributes in different languages have been added to the mapping
structure. This allows for multilingual help services, like generating a glossary in the chosen language of the
elements and attributes used in a given document, or a given DTD/Schema. We are working on adding this
feature.
Footnotes
♣ This work is part of the METASIGN project, and has been supported by the Ministry of Education and
Science of Spain through the grant number: TIN2004-00779.
1 Translations of the TEI tagset by: Alex Bia and
Manuel Sánchez (Spanish), Régis Déau (French), Francesca Mari (Catalan), Arno Mittelbach
(German)
References
Endnotes
[1] Bia, A., Sánchez, M., and Déau, R. (2003)
Multilingual Markup of Digital Library Texts Using XML, TEI and XSLT. In XML Europe 2003
Conference and Exposition, Organized by IDEAlliance,
5-8 May 2003, Hilton Metropole Hotel, London,
p. 53, http://www.xmleurope.com/
[2] Bia, A., and Sanchez, M. (2004) The Future of Markup is Multilingual. In ACH/ALLC 2004: Computing and Multilingual, Multicultural
Heritage. The 16th Joint International Conference of the
Association for Literary and Linguistic Computing and
the Association for Computers and the Humanities, 11-16 June 2004, Göteborg University, Sweden, p 15-18, http://www.hum.gu.se/allcach2004/AP/html/prop119.html
[3] Bia, A., Malonda, J., and Gómez, J. (2005)
Automating Multilingual Metadata Vocabularies. In DC-2005: Vocabularies in Practice, Eva Mª Méndez
Rodríguez (ed.), p. 221-229, 12-15 September 2005, Carlos III University, Madrid. ISBN 84-89315-44-2. http://dc2005.uc3m.es/
[4] Pei-Chi WU (2000) Translation of Multilingual Markup in XML. In International Conference on the theories and practices of Electronic Commerce, Part II, Session 14, pages 21-36, Association of Taiwan Electronic Commerce, Taipei, Taiwan, October 2000. http://www.atec.org.tw/ec2000/PDF/14.2.pdf
[5] Bryan, J. (2002) KR’s Multilingual Markup,
TechNews Volume 8, Number 1: January/February
2002 http://www.naa.org/technews/TNArtPage.cfm?AID=3880
[6] Cover, R. Markup and Multilingualism, last
visited online 2005-4-25 at Cover Pages: http://xml.
coverpages.org/multilingual.html
[7] Schmidt, D. (2005) Opening Keynote, MoDELS 2005: ACM/IEEE 8th International Conference on Model Driven Engineering Languages and Systems, Montego Bay, Jamaica, 2-7 October 2005.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

The Multilingual Markup Website

1. Alejandro Bia

2. Juan Malonda

3. Jaime Gomez

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006