From lighthouse to the moon: a guiding light to the corpus of Jules Verne

Nicholas John Hayward

Authorship

1. Nicholas John Hayward

Center for Textual Studies and Digital Humanities - Loyola University, Chicago

Original URL

https://github.com/ADHO/dh2015/blob/master/xml/HAYWARD_Nicholas_John_From_lighthouse_to_the_moon__a_gu.xml

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

From lighthouse to the moon: a guiding light to the corpus of Jules Verne

Hayward
Nicholas John

Center for Textual Studies and Digital Humanities, Loyola University Chicago, United States of America; Department of Computer Science, Loyola University Chicago, United States of America
ancientlives@gmail.com

2014-12-19T13:50:00Z

Paul Arthur, University of Western Sidney

Locked Bag 1797
Penrith NSW 2751
Australia
Paul Arthur

Converted from a Word document

DHConvalidator

Paper

Long Paper

virginia woolf
jules verne
framework
textual studies
corpus

corpora and corpus activities
image processing
information retrieval
interface and user experience design
literary studies
publishing and delivery systems
digitisation
resource creation
and discovery
scholarly editing
software design and development
text analysis
french studies
information architecture
content analysis
bibliographic methods / textual studies
programming
translation studies
data mining / text mining
English

The Woolf Online project,
1 which recently completed its second phase of development at Loyola University Chicago’s Center for Textual Studies and Digital Humanities, in collaboration with the Department of Computer Science, sought to address the following fundamental questions about the nature and development of literature:

• How does a literary text come into being?
• What kinds of influence are at work upon the writer during the process of initial composition, and thereafter?
The Woolf Online project sought to investigate various ways in which different recoverable histories of a particular text could be used to illuminate the process of its composition. By recording the history of a particular text—its cultural, political, and autobiographical contexts and their interaction—visually we began to answer some of the following important questions:
• How is textual history related to other histories of a text?
• What use does literary criticism make of textual and contextual histories?
Publication of Digital Scholarly Editions
As part of the development of the Woolf Online project, we developed an extensible development and publication framework called Mojulem,
2 for editing, publication, and visualisation of digital scholarly editions. We are now continuing this development with the Verne Digital Corpus,
3 which focuses upon the work of Jules Verne, including original French language editions and their myriad, often questionable, English language translations. Mojulem allows us to build on the concept of ‘knowledge sites’, as suggested by Peter Shillingsburg,
4 supplementing a core publication framework with modules/plugins such as OCR, editors, and image viewers.

Mojulem also enables us to host multiple projects within one installed framework, thereby enabling cross-project research, where applicable, and the option to aggregate specified data. Development of Mojulem, with the Woolf Online project and Verne Digital Corpus as examples of the current ongoing working environments, initially followed the need for four underlying core structures. These structures include CorPix, CorTex, CorCode, and CorForm, which are detailed as follows.
CorPix
Manuscripts and printed texts materially unite the iconic and lexical, the autographic and allographic,
5 whereas all digital representations separate these constituent elements into images and transcriptions. With Mojulem projects, the common default display reunites the image and transcription by mapping the one to the other at the pixel level. CorPix software currently includes eHinman,
6 Transparent,
7 TransparentOCR, Magnify, and Zoom. For example, pixel-level positioning and coordinate fixing is an inherent feature of both Transparent and TransparentOCR, within both editor and visualisation tools. eHinman is a digital adaptation of the original Hinman collator, and enables fade from the image of a page from one copy to another, thereby enabling a visual collation of multiple copies. Transparent is used within both the visualisation and editing stages of a project’s development, enabling editor and user alike to view the image as the primary entry consideration for the project.

CorTex
The CorTex is the stable resource containing the merged or compacted plain text transcriptions of the variant expressions of a work. It stores all information about text and variations, ready to be extracted for display of variation amongst versions; it is not necessary to recompute them. The CorTex is the entity to which all standoff properties (markup, annotations, links, etc.) point and on whose stability the system depends. It is the source of each version’s text and variation from other texts. The stability and endurance of the CorTex is protected by multiplying duplicate copies locked with a digital signature, which verifies for each user that a CorTex copy is viable. Analysis of the CorTex variable forms provides statistical feedback to guide the production of a conflated text—for example, with the English language translations of a given Verne edition. Whilst these statistical results are no guarantee of an ultimately correct translation, they offer a conflated text with the highest viable agreement amongst the provided collated texts. Textual disagreements are currently resolved by assigning probability values, a higher value defining a greater probability of accuracy and agreement amongst the collated texts. Using such probability results, we are currently able to filter problematic passages in each translation to conflate a text with the highest probability of agreement amongst the translations per edition.
These results can then be provided for further research and assessment, and act as a suitable starting guide for further analysis of the conflated text, and translation in the example of Verne’s texts.
CorCode
CorCode is the add-on value of analysis, argument, and explanation. Mojulem stores markup separately, as standoff properties, applying it as the user invokes it for the rendering of a specific item’s image or text within a given visualisation, such as a transparent view of a page of the Initial Holograph Draft of ‘To the Lighthouse’.
8 To do this, Mojulem includes an editor that saves text and encoding separately, and filters for converting legacy, code-embedded transcriptions, including TEI-encoded documents, into separate forms with markup analysed into properties, and filters for reversing this process.

CorForm
A CorForm
9 is a CSS stylesheet, containing special formatting rules, used to transform the overlapping properties of the CorCode into HTML. Each CorCode has a default CorForm, but other CorForms can be used in combination or as alternatives. Since a CorTex may have many CorCodes, and each CorCode many CorForms, structuring or formatting of the text can be attained by specifying some combination of already available resources, or by supplying new ones.

The CorPix, CorTex, CorCode, and CorForm are aggregated for a project within the Mojulem framework. Each such item is identified by a unique key, which is used as an index into the repository or database.
In addition to the initial four cores, identified above, we have begun development of CorAssess, allowing effective assessment and analysis of Cor data for the Verne translations and conflated English language texts.
CorAssess
CorAssess works in tandem with the CorTex to provide analysis of statistical and end results relative to the conflated text output. CorAssess allows us to visualise where text has been conflated based upon resolved disagreements, the points of disagreement and resultant probabilities between collated texts per edition, and variance between collated texts, and also visualise alternative resolution patterns relative to variation distance for given points of disagreements in the conflated text. With Verne texts, for example, we will be able to visualise conflation decisions and offer alternatives for given decisions and disagreements, where applicable, in our conflated English language translations.
Why Verne?
After Woolf Online, we chose to focus upon the corpus of Jules Verne, including original French language editions and English language translations. We have begun collecting, collating, and preparing digitised copies of as many digitised editions as extant online. We have also been digitising early editions to provide an ever-growing dataset of Verne material.
The nature of early English language translations of Verne’s editions is a continuing source of frustration for those interested in the works of Jules Verne. His early categorisation as predominantly a children’s author in English language countries, unlike the publishing by Hetzel, coupled with early restricted access to original French language editions, simply compounded the issue.
The corpus of Jules Verne offers an interesting opportunity for literary and contextual analysis coupled with data processing and automated analysis. The myriad existing digitised English language translations, including US and British variant editions, often more prevalent than their counterpart—the original French editions in our current digitised corpus—allows us to examine the development of those texts by comparing agreements, disagreements, omissions, and continuing revisions in said translations since a novel’s first edition. We are hoping to use this analysis to filter the noise of years of collective translations to collate a unified English translation for each French language edition.
We will then be offering a comparison of French language edition against a filtered, collated English language edition. This will allow further consideration of the requisite merits of the English language translation directly juxtaposed to the original French language edition.
Conclusion
The development and combination of the initial four cores, CorPix, CorTex, CorCode, and CorForm, within the modular and adaptable framework Mojulem, allowed the second phase of the Woolf Online project to begin to approach the fundamental questions about the nature and development of literature, as briefly outlined in the introduction. With the addition of CorAssess, we are now beginning to address additional issues with the publication, transmission, and development of texts. We are also testing, and proving, the viability of Mojulem beyond the Woolf Online project.
The corpus of Jules Verne provides a particularly fascinating opportunity to test these cores and provide a resultant conflated, English language translation per extant French language edition.
This paper will briefly introduce the Mojulem framework and its initial four cores, grounded in the example of the Woolf Online project, and detail the ongoing developments to augment this work with the above new work on the corpus of Jules Verne, and the ongoing Verne Digital Corpus.
Notes
1. Project site is available at
http
://
www
.woolfonline
.
com.

2. Our current test framework for the Woolf Online project, as of 3 November 2014, can be viewed at
http://
dhdev
.
ctsdh
.
luc
.
edu
/
projects
/
edfu
/.

3. Initial development project site is available at
http
://
www
.scifidocs
.
com
/.

4. P. L. Shillingsburg,
From Gutenberg to Google: Electronic Representations of Literary Texts (Cambridge University Press, 2006).

5. Nelson Goodman noted this distinction to separate art forms with a unique authentic form—for example, painting and sculpture—from art forms that are performative with multiple copies, including writing and music.
6. An earlier stand-alone example can be viewed at
http
://
dhdev
.
ctsdh
.
luc
.
edu
/testing
/
imaging
/
ehinman
-
dep
/
v
6/.

7. A test example page of the Initial Holograph Draft can be viewed at
http
://
dhdev
.
ctsdh
.
luc
.
edu
/
projects
/
edfu
/?
node
=
content
/image
/
gallery
&
project
=1&
parent
=2&
taxa
=6&
content
=353&
pos
=21 and transcription =
http
://
dhdev
.
ctsdh
.
luc
.
edu
/
projects
/
edfu
/?
node
=
content
/
text
/
transcriptions
&
project
=1&
parent
=2&
taxa
=6&
content
=5136&pos
=21.

8. For example, Image =
http
://
dhdev
.
ctsdh
.
luc
.
edu
/
projects
/
edfu
/?
node
=
content
/
image
/
gallery
&
project
=1&
parent
=6&
taxa
=24&
content
=385&
pos=49 and transcription =
http
://
dhdev
.ctsdh
.
luc
.
edu
/
projects
/
edfu
/?
node
=
content
/
text
/
transcriptions
&
project
=1&
parent
=2&
taxa
=6&
content
=5164&
pos
=49.

9. An earlier TEI-derived stand-alone example of this concept, using material from the Malory project, can be viewed at
http://
dhdev
.
ctsdh
.
luc
.
edu
/
testing
/
tei
/
teiparser
_
full
/.

Full text license: CC BY 4.0

From lighthouse to the moon: a guiding light to the corpus of Jules Verne

1. Nicholas John Hayward

ADHO - 2015

"Global Digital Humanities"