Brown University
Brown University
Data Curation Nightmare: Migrating VM/CMS to
GNU/Linux in 2 weeks
Data curation has been described as "activities [that] enable
data discovery and retrieval, maintain quality, add value, and
provide for re-use [of digital assets] over time" 1 . Data migration
is a major and necessary (but not sufficient) part of long-term
data curation.
At the time of the project described, the authors had worked
at BrownUniversity in the computing arena since 1985 and
1979 respectively. Up until the mid-1990s this work was almost
exclusively on or related to the IBM VM mainframe system,
but from the mid-1990s onward work and computer usage
gradually migrated off the mainframe to Unix systems,GNU/
Linux systems, and desktop Macintosh computers. By early
2009 almost no one at Brown was using the mainframe.
However, the authors were each responsible for a significant
quantity of data and programs on the mainframe, including
various system utilities, application programs each had written,
and almost all of the historical digital assets of the Women
Writers Project. In 2009, Brown decided to retire the mainframe.
Thus the authors migrated a large quantity of digital data,
comprising widely ranging kinds of files, from a legacy IBM VM
system to a modern GNU/Linux system. This was a significant
undertaking, not just because of the quantity of data (~ 3 GiB),
but because a VM system is *very* different than a GNU/Linux
system. Differences include:
– files are named with a different naming convention
– files are stored in one of several internal formats, none of
which is like that on a GNU/Linux system
– files are grouped in a flat, rather than hierarchical, system
– the underlying character encoding is not only not Unicode, it's
not ASCII -- it is EBCDIC, which is quite different
– some files have been automatically "packed" or compressed
by the system
– many files had been put into compressed archives (not unlike
ZIP files) that could not be read on a GNU/Linux system
– many files were stored not on the mainframe itself, but on
tapes that we could not read without the mainframe
– some bits of metadata on the mainframe have no counterpart
in GNU/Linux
Moreover, this project had to be executed under very tight
time constraints without administrative support.
This particular project was further complicated because
it was envisioned not just as a curation of textual data (e.g.,
program source code, text formatting files, or SGML files),
but as preservation of executable programs as well. One goal
was to be able to move this data to a different IBM VM system
without loss of any crucial information, such that the programs
could still be run. It is worth noting that we were not archiving
these materials in the analog archivist's sense -- i.e., we didn't
decide what material to keep and what to discard, we basically
kept it all.
While it must be the case that others have migrated data
off of IBM VMsystems, the authors are not aware of any
similarly ambitious projects. Having no blueprint to follow,
the authors had to invent a method for transferring data from
VM to GNU/Linux in a manner that would both keep all of its
original properties intact (so that it could subsequently be
moved to another VM system), and simultaneously permit direct
use of cross-platform data (e.g., source code, text formatting
documents, JPEGs, etc.). This method involved writing at least
5 separate programs, including a 1751 line-long C program
that can be used to extract files from VM "DISK DUMP" format
files(essentially disk images) that have been transferred to a
GNU/Linux system. (In theory this would work just as well on a
Mac OS X system, and perhaps even on a Windows system.)
This paper treats this project as a case study, interesting both
because it describes what many younger DHers would think
of as a foreign or archaic system (or both) that was in heavy
use at Brown a mere 20 years ago, and is still in use today (as
of 2013-11-01, the latest version was released 2013-07-23),
and because it gives evidence to how significant a problem
migration can present.
Notes
Moreover, this project had to be executed under very tight
time constraints without administrative support.
This particular project was further complicated because
it was envisioned not just as a curation of textual data (e.g.,
program source code, text formatting files, or SGML files),
but as preservation of executable programs as well. One goal
was to be able to move this data to a different IBM VM system
without loss of any crucial information, such that the programs
could still be run. It is worth noting that we were not archiving
these materials in the analog archivist's sense -- i.e., we didn't
decide what material to keep and what to discard, we basically
kept it all.
While it must be the case that others have migrated data
off of IBM VMsystems, the authors are not aware of any
similarly ambitious projects. Having no blueprint to follow,
the authors had to invent a method for transferring data from
VM to GNU/Linux in a manner that would both keep all of its
original properties intact (so that it could subsequently be
moved to another VM system), and simultaneously permit direct
use of cross-platform data (e.g., source code, text formatting
documents, JPEGs, etc.). This method involved writing at least
5 separate programs, including a 1751 line-long C program
that can be used to extract files from VM "DISK DUMP" format
files(essentially disk images) that have been transferred to a
GNU/Linuxsystem. (In theory this would work just as well on a
Mac OS X system, and perhaps even on a Windows system.)
This paper treats this project as a case study, interesting both
because it describes what many younger DHers would think
of as a foreign or archaic system (or both) that was in heavy
use at Brown a mere 20 years ago, and is still in use today (as
of 2013-11-01, the latest version was released 2013-07-23),
and because it gives evidence to how significant a problem
migration can present.
References
1. Cragin, Melissa H.; Heidorn, P. Bryan; Palmer, Carole
L.; Smith, Linda C. (2007), An Educational Program on Data
Curation; poster session presented at ACRL STS 2007.
www.ala.org/ala/mgrps/divs/acrl/about/sections/sts/
conferences/posters07.cfm
hdl.handle.net/2142/3493
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne
Lausanne, Switzerland
July 7, 2014 - July 12, 2014
377 works by 898 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)
Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/
Attendance: 750 delegates according to Nyhan 2016
Series: ADHO (9)
Organizers: ADHO