Processing tangles in the "Frankenstein Variorum"

paper, specified "long paper"
Authorship
  1. 1. Elisa Eileen Beshero-Bondar

    Penn State Erie, The Behrend College, United States of America

  2. 2. Mia Borgia

    Penn State Erie, The Behrend College, United States of America

  3. 3. Jacqueline Chan

    Penn State Erie, The Behrend College, United States of America

  4. 4. Raffaele Viglianti

    Maryland Institute for Technology in the Humanities, University of Maryland, United States of America

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Computer-aided collation is like a power loom that inevitably tangles up threads caught in the machinery. Automating a tedious process magnifies the complexity of error-correction, calling for new tooling to help us smooth the weaving process. For the DH 2022 conference we seek to share our efforts in the
Frankenstein Variorum project (hereafter referred to as
FV) to automate corrections to machine-assisted collation and thereby to refine our collation pre-processing and post-processing algorithms. 

FV began during the recent 1818-2018 bicentennial celebrating the first publication of Mary Shelley's novel and exists now as a partial working prototype. We are constructing a digital variorum edition that highlights alterations to the novel
Frankenstein over five key moments from its first drafting in 1816 to its author’s final revisions by 1831. Source "threads" for the
FV collation "weave" include two well established digital editions:
the Pennsylvania Electronic Edition (
PAEE)
an early hypertext edition produced at the University of Pennsylvania in the mid 1990s by Stuart Curran and Jack Lynch, and
the Shelley-Godwin Archive's edition of the manuscript notebooks (
S-GA)
published in 2013 by the University of Maryland. That interface relies on a backend processing pipeline that involves the software collateX in collating textual data, including markup, from each edition. To finalize our work requires iterative efforts to refine our collation and post-processing algorithms.

The Gothenburg Model, conceptualized by the developers of collateX and Juxta in 2009 (see
https://collatex.net/doc/) , organizes a series of distinct and iterative stages in a workflow for automated collation. These stages involve tokenizing and normalizing the texts to be collated, determining at what points the texts align and diverge, and visualizing the results of collation.  Our current efforts on the
FV involve post-processing the software-generated collation data. We seek to automate the identification of common patterns of misalignment and to produce a more accurate rendering of collation units mapped to the S-GA edition.

At ADHO 2022, we wish to discuss these two efforts with
1. improving collation alignment, and
2. improving our visualization of the S-GA edition within our variorum.

1. Improving collation alignment: We are applying XSLT to seek out patterns of “spurious alignment” generated by our collation software, collateX. The collation algorithm tends to err optimistically, seeking alignment of completely divergent passages on single words like “and” or “the.” One solution to this is to remove these words entirely during the pre-processing stage, but we rejected this approach because we consider the changes from “and” to “or”, or “the” to “an” to be significant. Since we generate collation output as critical apparatus markup in TEI-conformant XML, we are opting to locate patterns of divergence using XSLT as a post-processing stage.

2. Improving our visualization of the S-GA edition:  Representing the
S-GA edition accurately is a challenge because we needed to
re-sequence its encoding to prepare it for tokenization and normalization in our collation algorithm. In the
S-GA's TEI markup, marginalia in the manuscript notebook pages were encoded at the ends of each page file, and they were given attributes that indicate their insertion points in the running flow of the text. It was necessary to re-sequence the order of text on the page to move the marginalia from the end of each file to its insertion point so that we could prepare a continuous sequence of text—the thread of the
S-GA—to compare with the threads of the other four editions. Resequencing the
S-GA meant following a clearly signaled trail of ids and pointers in the original encoding. While it would indeed be convenient to display the re-sequenced TEI in our edition viewer, we seek instead to
map our collation data back onto the original source document by pulling the source document’s code into our reading interface. We seek to apply the information we learned from the collation to point back to specific passages in the source document, identifying them by line and string position in the original files in order to pull those particular passages into our interface viewer. This is an ambitious challenge involving stand-off pointers that require counting characters in the source document for precise identification of a variant passage in the
S-GA’s original encoding. 

We need to improve our stand-off pointing mechanism to the
S-GA edition. Our interface needs to pinpoint in the original
S-GA files the precise location of the variant passages identified by the collation process. To improve this, we are revisiting the process of re-sequencing the document in the first place. We are retracing our steps in the early XSLT written for the project to set clearer markers to identify the locations of marginalia passages in the original
S-GA files. Those markers need to be delivered to the collation output XML, to assist in calculating the XPath and string locations of variant passages in the source S
-GA.

At ADHO 2022 we will share our efforts in both of these areas, in the hope of encouraging lively discussion in the text scholarly community. If we are to smooth the tangled webs of collation, perhaps we need to be able to follow our own complex algorithms backwards to the threading of the machine.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO