University of Texas, Austin
University of Victoria
University of Victoria
Trinity College Dublin
Introduction: This paper presents the results of the Modernist Versions Project’s (MVP) survey of existing tools for digital collation, comparison, and versioning. The MVP’s primary mission is to enable interpretations of modernist texts that are difficult without computational approaches. We understand versioning as the process through which scholars examine multiple witnesses of a text in order to gain critical insight into its creation and transmission [1] and then make those witnesses available for critical engagement with the work. Collation is the act of identifying different versions of a text and noting changes between texts. Versioning is the editorial practice of presenting a critical apparatus for those changes. To this end, the MVP requires tools that: (1) identify variants in TXT and XML files, (2) export those results in a format or formats conducive to visualization, (3) visualize them in ways that allow readers to identify critically meaningful variations, and (4) aid in the visual presentation of versions.
The MVP surveyed and assessed an array of tools created specifically for aiding scholars in collating texts, versioning them, and visualizing changes between them. These tools include: (1) JuxtaCommons, (2) DV Coll, (3) TEI Comparator, (4) Text::TEI::Collate, (5) Collate2, (6) TUSTEP, (7) TXSTEP, (8) CollateX, (9) SimpleTCT, (10) Versioning Machine, and (11) HRIT Tools (nMerge). We also examined version control systems such as Git and Subversion in order to better understand how they might inform our understanding of collation in textual scholarship. This paper presents the methodologies of the survey and assessment as well as the MVP’s initial findings.
Problem: Part of the MVP’s mandate is to find new ways of harnessing computers to find differences between witnesses and to then identify the differences that make a difference (Bateson). In modernist studies, the most famous example of computer-assisted collation is Hans Walter Gabler’s use of the Tübingen System of Text Processing tools (TUSTEP) to collate and print James Joyce’s Ulysses in the 1970s and 1980s. Yet some constraints, such as those identified by Wilhelm Ott in 2000, still remain in the field of textual scholarship, especially where collation and versioning applications are concerned. Ott writes, “scholars whose profession is not computing but a more or less specialized field in the humanities have to be provided with tools that allow them to solve any problems that occur without having to write programs themselves. This leads to a concept of relatively independent programs for the elementary functions of text handling” (97). Indeed the number of programs available for collation work have proliferated since 2000, including additions to TUSTEP (TXSTEP) as well as the newest web-based collation program, JuxtaCommons.
Accordingly, the MVP has reviewed tools currently available for collation work in order to provide an overview of the field and to identify software that might be further developed in order to create a collating, versioning, and visualization environment. Most of these tools were developed for specific projects, and thus do what they were designed to do quite well. Our question is whether we can modify existing tools to fit the needs of our project or whether a suite of collation and visualization tools needs to be developed from scratch. This survey is thus an attempt to chart the tools that may be useful for the kinds of collation and versioning workflows our team is developing specifically for modernist studies, so we can then test methods based on previous tools and envision future developments to meet emerging needs. Our initial research with Versioning Machine and JuxtaCommons suggests that there is potential for bringing tools together to create a more robust versioning system. Tools such as the Versioning Machine work well if one is working with TEI P5 documents; however, we are equally interested in developing workflows that do not rely upon the TEI, or do not require substantive markup. Finally, we are examining whether the development of version control systems such as Git present viable alternatives to versioning methods now prevalent in textual studies.
Method: Our method adapts the rubric Hans Walter Gabler devised for surveying collation and versioning tools in his 2008 white paper, “Remarks on Collation.” We first assessed the code and algorithms underlying each tool on our list, and we then tested each tool using a literary text. In this particular case, we used two text files and two TEI XML files from chapter three of Joseph Conrad’s Nostromo, which we have in OCR-corrected and TEI-Lite marked-up states from the 1904 serial edition and the 1904 first book edition. During each test, we used a tool assessment rubric (available upon request) to maintain consistent results across each instance. All tests were accompanied by research logs for additional commentary and observations made by our research team.
Preliminary Findings: Our preliminary findings suggest that:
• Many existing collation tools are anchored in obsolete technologies (e.g., TUSTEP, which was originally written in Fortran, despite having undergone major upgrades, still relies on its own scripting language and front end to operate; also, DV-Coll was written for DOS, but has been updated for use with Windows 7).
• Many of the tools present accessibility obstacles because they are desktop-only entities, making large-scale collaborative work on shared materials difficult and prone to duplication and/or loss of work. Of the tools that offer web-based options, JuxtaCommons is the most robust.
• The “commons” approach to scholarly collaboration is among the most promising direction for future development. We suggest the metaphor of the commons is useful for tool development in versioning and collation as well as for building scholarly community (e.g., MLA Commons). We note the particular usefulness in this regard of the Juxta Commons collation tool and the Modernist Commons environment for preparing digital texts for processing. The latter, under development by Editing Modernism in Canada, is currently working to integrate collation and versioning functions into its environment.
• Version control alternatives to traditional textual studies-based versioning and visualization presents an exciting set of possibilities. Although the use of Git, Github, and Gist for collating, versioning, and visualizing literary texts has not gained much traction, we see great potential in this line of inquiry.
• Developers and projects should have APIs in mind when designing tools for agility and robustness across time. Web-based frameworks allow for this type of collaborative development, and we are pleased to see that Juxta has released a web service API for its users.
• During tool development, greater attention must be given to extensibility, interoperability, and flexibility of functionality. Because many projects are purposebuilt, they are often difficult to adapt to non-native corpora and divergent workflows.
References
Bateson, G. (1972). Steps to an Ecology of Mind. New York: Ballantine.
Gabler, H. (2013). Remarks on Collation. Academia.edu. Academia.edu, 2008. 14 Mar. 2013.http://www.academia.edu/167070/_Remarks_on_Collation_.
Gabler, H. (2000). “Towards an Electronic Edition of James Joyce’s Ulysses.” Literary and Linguistic Computing 15(1): 115-20.
McGann, J. J. (1991). The Textual Condition. Princeton, NJ: Princeton University Press.
Ott, W. (2000). “Strategies and Tools for Textual Scholarship: The Tubingen System of Text Processing Programs (TUSTEP).” Literary and Linguistic Computing 15(1): 93-108.
Reiman, D. H. (1987). “Versioning.” Romantic Texts and Contexts. Columbia: University of Missouri. 167-179.
Notes
1. For a definition of the “social” life of texts, see Jerome McGann’s The Textual Condition. For a definition of “versioning,” see Donald Reiman, who writes, “In those cases where the basic problem facing the scholar or reader involves two or more radically differing versions that exhibit quite distinct ideologies, aesthetic perspectives, or rhetorical strategies, the alternative to ‘editing,’ as conventionally understood, may be what I call ‘versioning”’ (169).
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Nebraska–Lincoln
Lincoln, Nebraska, United States
July 16, 2013 - July 19, 2013
243 works by 575 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)
Conference website: http://dh2013.unl.edu/
Series: ADHO (8)
Organizers: ADHO