Using Software Modeling Techniques to Design Document and Metadata Structures

  1. 1. Alejandro Bia

    University Miguel Hernández

  2. 2. Jaime Gomez

    University Alicante

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper discusses the applicability of modelling methods originally meant for business applications,
on the design of the complex markup vocabularies used for XML Web-content production.
We are working on integrating these technologies
to create a dynamic and interactive environment for the design of document markup schemes.
This paper focuses on the analysis, design and
maintenance of XML vocabularies based on UML. It considers the automatic generation of Schemas from a visual UML model of the markup vocabulary, as well as the generation of DTDs and also pieces of software, like input forms.
Most authors that treated the relationship between UML and XML [5, 7] only targeted business
applications and did not consider complex document
modelling for massive and systematic production of XML contents for the Web. In a Web publishing project, we need to produce hundreds of XML documents for Web publication.
Digital Library XML documents that model the structure of literary texts and include bibliographic information (metadata), plus processing and formatting instructions, are by far much more complex than the XML data we usually find in business applications. Figure 1 shows a small document model based on the TEI. Although it may seem complex, it is only a very small TEI subset.
This type of markup is not as simple and homogeneous as conventional structured data. In these documents we usually find a wide variety of elements nested up to deep levels, and there are many exceptional cases that can lead to unexpected markup situations that also need to be
covered. Complex markup schemes like TEI [9] and
DocBook [1] are good examples of this versatility.
However, no matter how heterogeneous and unpredictable the nature of humanities markup could get to be, software engineers have to deal with it in a systematic way, so that automatic processes can be applied to these texts in order to produce useful output for Web publishing, indexing and searching, pretty printing, and other end user facilities
and services. There is also a need to reduce content
production times and costs by automating and systematizing
content production. For these, software, documentation and guides of good practice have to be developed.
The building of all these automation, methods and
procedures within the complexity of humanities content structuring can be called Document Engineering. The purpose is to reduce costs, and facilitate content production by setting constraints, rules, methods and implementing automation wherever and whenever is possible.
XML, DTD or Schemas, XSL transforms, CSS
stylesheets and Java programming are the usual tools to
enforce the rules, constraints and transformations necessary to turn the document structuring problem to a systematic automated process that lead to useful Web services. But the wide variety of Schema types, and the individual
limitations of each of them, make the task of setting a production environment like this very difficult.
On one hand we need a markup vocabulary that can
cover all document structuring requirements, even the most
unusual and complex, but that is simple enough for our purposes. In other words, we need the simplest DTD/Schema that fits our needs. We previously treated the
problem of DTD/Schema simplification in [2, 3].
But DTD/Schema simplification, although useful, doesn’t solve all the problems of Document Engineering, like building transformations to obtain useful output or
assigning behaviour to certain structures (like popup
notes, linking, and triggering services). This kind of
environments are usually built incrementally. The design information, if any, is dispersed into many pieces of software (Schemas, transformation, Java applets and servlets), or does not exist at all. A system like this includes
document design (DTD/Schemas), document production techniques and tools (XSL and Java), document
exploitation tools (indexing, searching, metadata,
dictionaries, concordances, etc.) and Web design
UML modelling may be the answer to join all those bits and pieces into a coherent design that reduces
design cost, improves the quality of the result, provides
documentation and finally may even simplifies
maintenance. UML modelling for massive Web content production may also lead to automatic generation of some of the tools mentioned.
Apart from modelling the structure of a class of
documents (as DTDs and Schemas do), UML can capture other properties of elements:
- Behaviour: this is related to event oriented functions (e.g. popup notes)
- Additional powerful validation features (e.g. validating consistency of certain fields like author name against a database.)
- Customization of document models to provide
different views or subsets of the markup scheme to different users (e.g. DTDs for development of
different types of news)
We believe that the dynamic and interactive environment
described here will be very useful to professionals
responsible for designing and implementing markup schemes for Web documents and metadata.
Although XML standards for text markup (like TEI and DocBook) and metadata markup (e.g. MODS, EAD) are readily available [8], tools and techniques for automating
the process of customizing DTD/Schemas and
addingpostprocessing functionality are not.
As Kimber and Heintz define it [7], the problem is how do we integrate traditional system engineering
modelling practice with traditional SGML and XML
document analysis and modelling?
According to David Carlson [5], eXtensible Markup
Language (XML) and Unified Modelling Language (UML) are two of the most significant advances from the fields of Web application development and object-
oriented modelling.
We are working on integrating these technologies to create a dynamic and interactive environment for the design of document markup schemes (see figure2).
Our approach is to expand the capabilities of Visual Wade1 to obtain a tool that allows the visual analysis, design and maintenance of XML vocabularies based
on UML. Among the targets we are working on the
automatic generation of different types of DTD/Schemas from a visual UML model of the markup vocabulary, code generation when possible (like generating HTML forms or XSLT), documentation and special enhanced
validators that can perform verifications beyond those
allowed by DTDs or Schemas (like verification of certain element content or attribute values against a database).
Fig. 2.: a environment for the design
of document markup schemes.
Carlson [5] suggests a method based on UML class
diagrams and use case analysis for business applications
which we adapted for modelling document markup
A UML class diagram can be constructed to visually
represent the elements, relationships, and constraints of an XML vocabulary (see figure 3 for a simplified example).
Then all types of Schemas can be generated from the UML diagrams by means of simple XSLT transformations applied to the corresponding XMI representation of the UML model.
Fig. 3.: Example of a UML class diagram (partial view).
The UML model information can be stored in an XML document according to the XMI (XML Metadata
Interchange) standard as described by Hayashi and Hatton [6]: “Adherence to the [XMI] standard allows other groups to easily use our modelling work and because the format is XML, we can derive a number of other useful
documents using standard XSL transformations”. In our case, these documents are Schemas of various types as well as DTDs. Like Schemas, DTDs can be also generated from the XMI representation of the UML model (doted line), but as DTDs are simpler than Schemas, and all types of Schemas contain at least the same information as a DTD,
DTDs can also be directly generated from them.
In many cases, code generation from a high level
model is also possible. Code generation may include
JavaScript code to implement behaviour for certain
elements like popup notes, hyperlinks, image display
controls, etc. This is the case of input HTML forms that can
be generated from Schemas as shown by Suleman [10].
We have successfully experimented on the generation of XSLT skeletons for XML transformation which save a lot of time. Usually XSL transforms produce fairly
static output, like nicely formatted HTML with tables of contents and hyperlinks, but not much more. In exceptional
cases we can find examples of more sophisticated
This high level of flexible interactivity is the real payoff from the UML-XML-XSLT-browser chain.
This sort of functionality is usually programmed
specifically for individual projects, given that it’s highly
dependent on the nature of the markup in any given
document. We aim to provide the ability to specify this at the UML level. For instance, a note could be processed differently according to its type attribute and then be
displayed as a footnote, a margin note, a popup note, etc. In certain cases it can be hooked to a JavaScript function to be popped up in a message window or in a new browser instance according to attribute values. In this sense, we could provide a set of generic JavaScript functions which
could retrieve content from elements and display it in
various ways (popup, insertions, etc.) or trigger events (like a dictionary lookup).
We should look for document models that allow al kinds of presentation, navigation and cognitive metaphors.
- Sequential reading
- Text reuse (links and includes)
- Non-sequential reading
- Hyperlinks
- Collapsible text
- Foot notes, margin notes, popup notes
- The folder metaphor
- TOCs, indexes and menus
All the elements in a structured document have an
associated semantic and a behaviour or function (as in the above example, a popup note must appear on a popup window when a link to it is pressed). This is not reflected in conventional document models: a DTD/Schema may say that a note is a popup note: ... but the behaviour of this note is not stated at all. Some postprocessing must be implemented for the popup effect to happen. A UML
based document model can incorporate the expected
behaviour like methods in a class diagram.
As additional aiding tools for this project we have incorporated two of our earlier developments:
First the automatic simplification of DTDs based on
sample sets of files [2, 3]. This tool can be applied to obtain simplified DTDs and Schemas customized to fit exactly a collection of documents.
Second, automatic element names and attribute names translation can be applied when multilingual markup is required. A detailed explanation of the multilingual markup project can be found in [4].
See figure 2 for an idea of how these tools interact with the UML document modelling.
The techniques described here can also be used for
modelling metadata markup vocabularies.
Concerning the described set of DTD/Schema
design tools, the integration of UML design with
example based automatic simplification and multilingual vocabulary capabilities, is expected to be a very useful and practical design aid. However, we experienced some
limitations in the use of UML. While commercial non UML products like XML Spy or TurboXML use
custom graphical tree representation to handle XML schemas, comprising very handy collapsing and
navigating capabilities, most general purpose UML
design environments lack these specialized features.
One of the downsides of UML is that it is less friendly when working with the low-level aspects of modelling [11]. For instance, it is easy to order the elements of a sequence in a tree, but it is very tricky to do so in UML.
Although UML proves very useful for modelling document
structures of small to medium complexity (metadata
applications and simple documents), UML models for medium to big sized schemas (100 to 400 elements),
like those used for complex DL documents, become
practically unmanageable2. The diagrams become overloaded
with too many class boxes and lines, which end up being
unreadable. This problem could be solved, or at least
mitigated, by enhancing the interfaces of UML design
programs with newer and more powerful display
functions. Facilities like intelligent collapsing or hiding of diagram parts or elements, overview maps (see figure 3),
zooming, 3-D layouts, partial views, and other browsing capabilities would certainly help to solve the problem.
♣ This work is part of the METASIGN project, and has been supported by the Ministry of Education and
Science of Spain through the grant number: TIN2004-00779.
1 VisualWade is a tool for software development based on UML and extensions. It was developed by our research group, named IWAD (Ingeniería Web y Almacenes de Datos - Web Engineering and
Data-Warehousing), at the University of Alicante. This group also developed the OOH Method (for more information see
2 The DTD used by the Miguel de Cervantes DL for its literary documents contains 139 different elements. The “teixlite” DTD, a simple and widely used XML-TEI DTD, contains 144 elements.
[1] Allen, T., Maler, E., and Walsh, N. (1997) DocBook
DTD. Copyright 1992-1997 HaL Computer
Systems, Inc., O’Reilly & Associates, Inc., Fujitsu Software Corporation, and ArborText, Inc.
[2] Bia, A., Carrasco, R. (2001), Automatic DTD
Simplification by Examples. In ACH/ALLC 2001. The Association for Computers and the Humanities,
The Association for Literary and Linguistic
Computing, The 2001 Joint International Conference, pages 7-9, New York University, New York City,
13-17 June 2001.
[3] Bia, A., Carrasco, R., Sanchez, M. (2002) A Markup Simplification Model to Boost Productivity
of XML Documents. In Digital Resources for the
Humanities 2002 Conference (DRH2002), pages 13-16, University of Edinburgh, George Square, Edinburgh EH8 9LD - Scotland - UK, 8-11 September 2002.
[4] Bia, A., Sánchez, M., Déau, R. (2003) Multilingual Markup of Digital Library Texts Using XML, TEI and XSLT. In XML Europe 2003 Conference and
Exposition, page 53, Hilton Metropole Hotel,
London, 5-8 May 2003. IDEAlliance, 100 Daingerfield
Road, Alexandria, VA 22314.
[5] Carlson, D. (2001) Modeling XML Applications with UML. Object Technology Series. Addison-
Wesley, 2001.
[6] Hayashi, L., Hatton, J. (2003) Combining UML, XML and Relational Database Technologies. The Best of All Worlds For Robust Linguistic Databases. In Proceedings of the IRCS Workshop on Linguistic
Databases, pages 115--124, University of Pennsylvania,
Philadelphia, USA, 11-13 December 2001. SIL
[7] Kimber, W.E., Heintz, J. (2000) Using UML to
Define XML Document Types. In Extreme Markup Languages 2000, Montreal, Canada, 15-18 August 2000.
[8] Megginson, D. (1998) Structuring XML Documents. Charles Goldfarb Series. Prentice Hall, 1998.[9] Sperberg-McQueen, M., Burnard, L., Bauman, S., DeRose, S., and Rahtz, S. (2001).Text Encoding Initiative: The XML Version of the TEI Guidelines.
© 2001 TEI Consortium (TEI P4, Guidelines
for Electronic Text Encoding and Interchange,
XML-compatible edition)
[10] Suleman, H. (2003) Metadata Editing by Schema. In Traugott Koch and Ingeborg Solvberg, editors,
Research and Advanced Technology for Digital
Libraries: 7th European Conference, ECDL 2003, volume 2769, pages 82-87, Trondheim, Norway, August 2003. Springer-Verlag.
[11] Marchal, B. (2994) Design XML vocabularies with UML tools. March 31st, 2004, or

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website:

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

  • Keywords: None
  • Language: English
  • Topics: None