XML-Print. Typesetting arbitrary XML documents in high quality

paper, specified "long paper"
Authorship
  1. 1. Lukas Georgieff

    Fachhochschule Worms (Worms University of Applied Sciences)

  2. 2. Marc Wilhelm Küster

    Fachhochschule Worms (Worms University of Applied Sciences)

  3. 3. Thomas Selig

    Fachhochschule Worms (Worms University of Applied Sciences)

  4. 4. Martin Sievers

    Trier Center for Digital Humanities (Kompetenzzentrum für elektronische Erschließungs- und Publikationsverfahren in den Geisteswissenschaften) - Universität Trier

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction/Motivation

Also in the age of electronic publishing print publications often remain the points of reference. While many humanities' projects finally build on XML and in particular TEI 1 and embrace electronic publications, they still want or need to target print publications as one or even the main form of sharing the results of their scholarship with the community. Paper remains the principal scholarly format accepted in many circles and in spite of all activities on long-term digital archiving 23, it remains a central medium to disseminate and conserve patrimonial content over the decades and centuries.However, how can you combine an XML-based workflow with the need to publish high-quality, multilingual print output respecting the often arcane typesetting requirements of scholarly texts in humanities publishing and notably in the realm of critical editions? While there are commercial 45 and free 6 78 products out there that can help for some parts of the job, they are too expensive or too difficult to use in most humanities projects. This was the starting point of the DFG-funded XML-Print, an Open Source project that tackles the typesetting requirements for multilingual critical editions while offering a user-friendly frontend.XML-Print has already been presented to the DH community in two well-received short papers 910 on specific aspects of the project's progress and technical challenges. In this long paper we present the project’s overall results.
Typesetting Features

XML-Print consists of two components:
an interactive graphical user interface (GUI) based on Eclipse, to define the rules for typesetting the XML texts in direct interaction with the source XML text
a command-line typesetting engine, written from bottom up in F#, to actually transform the XML text into pdf for print
Normally scholars will interact with the GUI to map their XML structures on layout rules. The typesetting engine is then transparent to them. However, more automated workflows can integrate the engine directly. In section 5 we present examples for both scenarios. Beyond most standard typesetting functionalities of XSL-FO 1112 XML-Print supports in particular the following three requirements specific for publishing in humanities scholarship:
Columns

A page always consists of different rectangular regions to add header/footer, marginals and the main text. However, this main text is often not limited to a one-column layout, but rather flows in multiple columns. As XSL-FO lacks in this requirement, XSL-FO+ adds a special interface to set-up arbitrary complex column-layouts, even mixed on one page. Each column has its own width and writing direction (left-to-right vs. right-to-left) allowing even “exotic” layouts to be applied within XML-Print.
Cross references

When using cross-references we must use placeholders, not only in the main text, but also for the header and footer of a page, where page number and sectioning information are commonly used, and for apparatus’ entries, where typographic information like referenced line numbers can change during the editorial process.

Fig. 1: Two user-defined reference systems for an edition
XML-Print incorporates a concept of “reference fields” to define structural and typographic elements to be counted. This way the user can even combine these two types, e.g. for having a global line count and a local one being reset at a specific XML structure. In addition the corresponding “title” of an reference field can as well be made available.
Apparatuses

Based on reference fields users can define “referencing schemas” to be used in apparatuses. Predefined typographic elements like a global page, column and line numbering can be combined with user-defined fields, arbitrary fixed strings and special characters, e.g. a non-breakable space.As the concrete output of the schema might depend on previous apparatus’ entries, exceptions with regard to repeated items, e.g. same chapter, can be formulated as well.
General Architecture and Technologies

Standards

Modern functional programming languages reusing established frameworks and libraries allow to build a high-quality, multilingual typesetting engine generating archiving-ready PDF/A-1 13 much faster than even a decade ago. With a comparatively small development team we have been able to meet the project’s major objectives within the funding period.
XML-Print builds on Open Standards, especially on XML as the input language, XSL-FO to express formatting and XSLT to transform data from XML to XSL-FO. The project has extended XSL-FO to cover features such as apparatus and advanced referencing not currently supported by the specification (XSL-FO+).For the typesetting engine the project uses the .NET functional language F# 14, running on the cross-platform Open Source .NET implementation mono. To handle OpenType 15 fonts and generate pdf we have settled on the Open Source library iText 16 that exists for both .NET and Java. We contributed to the library's support for some of advanced OpenType features such as “real” small caps and aspects of bidirectional scripts.
Algorithms

A major advantage of functional programming languages is the lack of mutable variables and states. Algorithms are commonly more compact and easier to parallelize without mutable variables to share across multiple threads. The XML-Print backend for example parallelizes the parsing of certain XSL-FO elements and the rendering of chapters respectively page sequences.
Initially the rendering module was mainly based on the iText library. Now we are replacing all iText algorithms by our self-developed algorithms. They are specialized on the requirements of XML-Print and so are more efficient and easier to extend. We also decided to develop an own line-breaking algorithm. Going beyond the algorithm by Knuth and Plass 17, we take advantage of today’s hardware capacities.
The line breaking algorithm creates a tree structure for all possible line combinations of an entire paragraph. The best path in this tree structure is calculated by taking several characteristics into account, e.g. interword spaces, hyphenations, etc. The final implementation will be parallelized and produce a tree structure with “cross-connected” nodes, i.e. nodes that represent identical paragraph sections are reduced and replaced by additional arcs, thus increasing the efficiency by avoiding redundant line combinations. Figure 1 illustrates the process of line breaking. Each node represents a possible line. The numbers at the arcs represent the processing order. Equal numbers on the same level mean a parallelized section. The bolded path represents the final paragraph, consisting of the nodes Line_1^2, Line_2^5, Line_3^2 and Line_4^4.

Fig. 2: Line breaking algorithm
Graphical User Interface

XML-Print addresses beginners as well as expert users. For the latter the GUI has to offer enough details while the former should not be overwhelmed by too many information at first. To achieve this goal XML-Print categorizes functionality and provides a basic and an expert layer.

Fig. 3: GUI for XML-Print
We face, however, one inherent problem. To guarantee a high-quality output the typesetting incorporates a rich repertoire of typesetting logic and features which the user expects to appear somewhere in the graphical user interface. It is not always possible to shield users from these inherent complexities, while mapping all possible options onto GUI elements.
Apart from defining formats and declaring "mappings" between XML elements and corresponding formats, the GUI offers several possibilities to modify aspects of typesetting, from preprocessing the XML source to altering the PDF output format, from configuring the page size to influencing hyphenation.
Use Case Examples

Edition “Kurt Schwitters: Wie Kritik zu Kunst wird”

During the starting phase of the XML-Print project, staff members of the editorial project “Kurt Schwitters: Wie Kritik zu Kunst wird” 18 already used recent version of the software to proofread their XML transcriptions. At a later stage, formats and mappings for critical and commentary apparatus were added.
Dictionaries: The “Trierer Wörterbuchnetz”

The “Deutsche Wörterbuch von Jacob und Wilhelm Grimm” is a digitized version of the leading German Dictionary with more than 300.000 entries stored as XML inside a database. The pdf is a byproduct, creating pdf files on the fly is the only effective approach. The XML-Print typesetting engine was wrapped via a simple webservice interface, allowing remote access. Decoupling typesetting engine and GUI improves on flexibility and scalability, as it adds the options of cluster-processing and batch processing to the standard interactive processing.
Outlook

To ensure the long-term viability of the project beyond the end of funding in May 2014 XML-Print integrates with TextGrid 19 to complement the current stand-alone mode. In parallel we build up a community on SourceForge http://sourceforge.net/projects/xml-print/, reaching even now more than 50 monthly downloads on average.Further development of XML-Print is also intimately linked to the needs of its customers, especially the critical editions using it, evolving in response to specific requirements.XML-Print is there to play a key role in creating, sharing and preserving our digital and non-digital textual cultural heritage and humanities digital resources on one of the most durable media yet known to humankind – paper.
References

1. Burnard, Lou, and Syd Bauman. (2007). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative.
2. Hedges, M, A Jordanous, S Dunn, C Roueche, Marc Wilhelm Küster, Thomas Selig, M Bittorf, and W Artes. (2012). New Models for Collaborative Textual Scholarship. In 6th IEEE International Conference on Digital Ecosystems Technologies (DEST), 1–6. doi:10.1109/DEST.2012.6227933.
3. Neuroth, Heike. (2009). Nestor Handbuch.
4. Global Publishing Solution. (2013). 3B2 (Aka APP, Arbortext Print Publisher). 2013 ed. Accessed November 29. www.3b2.com/3b2/.
5. Adobe- 2013. Desktop Publishing (DTP), Digital Publishing | Adobe InDesign CC, www.adobe.com/lu_de/products/indesign.html, accessed: 29-Oct-2013.
6. Knuth, D E. 1986. The TEXbook (Computers & Typesetting Volume a).
7. Apache Foundation. (2013). Apache FOP - a Print Formatter Driven by XSL Formatting Objects and an Output Independent Formatter. Xmlgraphics.Apache.org. xmlgraphics.apache.org/fop/.
8. Lamport, Leslie. (1986). LaTeX: User's Guide & Reference Manual. Addison-Wesley 1986.
9. Burch, Thomas, Martin Sievers, Marc Wilhelm Küster, Claudine Moulin, Roland Schwarz, and Yu Gan. (2012). XML-Print: an Ergonomic Typesetting System for Complex Text Structures. In Abstracts of Digital Humanities 2012, 375–379. Hamburg. www.dh2012.uni-hamburg.de/wp-content/uploads/2012/07/HamburgUP_dh2012_BoA.pdf.
10. Küster, Marc Wilhelm, Thomas Selig, Lukas Georgieff, Martin Sievers, and M Bittorf. (2013). XML-Print: Addressing Challenges for Scholarly Typesetting. In Abstracts for Digital Humanities 2013, 269–272. Lincoln, NE. dh2013.unl.edu/abstracts/files/downloads/DH2013_conference_abstracts_print.pdf.
11. Berglund, Anders, ed. (2006). Extensible Stylesheet Language (XSL) Version 1.1. W3C. www.w3.org/TR/2006/REC-xsl11-20061205/.
12. Pawson, D. (2002). XSL-FO: Making XML Look Good in Print.
13. ISO, and ISO TC 171 SC 2. (2005). ISO 19005-1:2005 Document Management -- Electronic Document File Format for Long-Term Preservation -- Part 1: Use of PDF 1.4 (PDF/a-1). 19005. 1st ed. ISO.
14. Smith, Chris. 82012). Programming F# 3.0. 2nd ed. O'Reilly Media. www.amazon.de/Programming-F-3-0-Chris-Smith/dp/1449320295.
15. Microsoft, Adobe. 2009. “Microsoft Typography - OpenType Specification.” Microsoft.com. 21.09.2009. www.microsoft.com/typography/otspec/default.htm. Accessed 2013-10-29.
16. “iText Core | iText Software.” 2013. “iText Core | iText Software.” Itextpdf.com. itextpdf.com/product/itext. Accessed: 2013-10-29.
17. Stanford University. Computer Science Dept, D E Knuth, and M F Plass. (1980). Breaking Paragraphs Into Lines.
18. Kocher, Ursula, and Isabel Schultz. (2013). Wie Kritik Zu Kunst Wird: Kurt Schwitters' Strategien Der Produktiven Rezeption. Avl.Uni-Wuppertal.De. www.avl.uni-wuppertal.de/forschung/projekte/wie-kritik-zu-kunst-wird.html.
19. TextGrid Konsortium. (2010). TextGrid: Über TextGrid.www.textgrid.de/ueber-textgrid.html.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO