What is Transcription?

Authorship
  1. 1. Claus Huitfeldt

    University of Bergen

  2. 2. C. Michael Sperberg-McQueen

    Computer Science and Artificial Intelligence Laboratory (CSAIL) - Massachusetts Institute of Technology, World Wide Web Consortium (W3.org)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Background
One common task in digital humanities is the creation of
digital representations of cultural artifacts, often in the
form of transcriptions of existing physical documents. When
we say, though, that a resource is a transcription of a particular
document, what do we mean? What inferences are licensed by
that claim? Given a transcription and adequate knowledge of
the rules it follows, what do we learn about the document of
which it is a transcription?
This paper proposes a formal account of transcription, as a way
of elucidating the concept as it applies in scholarly editing and
in the creation of digital resources. This approach may also
provide insight into the nature of electronic representations of
cultural artifacts, digital preservation, and text encoding more
generally. It may have practical consequences for quality
assurance and for the representation of markup semantics in
formal systems.
In this abstract, we first present the basic outlines of our
approach and our model of transcription, for simplicity's sake
addressing only the very simplest cases. We then address some
issues pertaining to less simple cases and discuss elaborations
of the model. Because this is work in progress, the abstract is
more sketchy on the latter issues, on which we expect to report
more extensively in the paper as presented at the conference.
General outline of the approach
In general, one document (the transcription, T) is said to be
a transcription of another document (the examplar, E) if T
was copied out from E with the intent, successfully achieved,
of providing a faithful representation of a text as witnessed in
E. Often the purpose is to make this representation easier to
make use of, than E itself. (For example, T may be easier to
read or duplicate than E, or may be able to travel while E
cannot.) made and the uses to which it may be put. These matters may
be essential in deciding whether T is a transcription of E. In
this paper, however, we investigate to what extent it is possible
to characterize, without taking such matters into account, the
relationship between T and E which allows the one to serve as
a record of information derived from the other. In the simplest
cases, where both E and T have an unproblematic, clearly
legible inscription, this relation is simple: the two documents
contain the same sequence of letters, spaces, punctuation marks
and other symbols.
But what does it mean to say that a manuscript letter is “the
same” as e.g. a typed or printed transcription of that manuscript?
In simple cases, the question rarely arises, yet in transcriptions
of historical documents it may be urgent. The difficulty of
reading often consists precisely in the difficulty of deciding
which letters the various marks on a manuscript page indicate
(if any), and may be one reason for transcribing in the first
place.
In order to establish a more precise account of the “sameness”
of sequences of letters, we begin by distinguishing the concepts
document, mark, token and type.
By a document we understand an individual physical object
containing marks. A mark is a perceptible feature (normally
something visible, e.g. a line in ink) of a document. Marks may
be identified as tokens in so far as they are instances of types,
and collections of marks may be identified as sequences of
tokens in so far as they are instances of sequences of types. In
other words, a mark is a token if, but only if, it is understood
as instantiating a type.
The distinction among marks, tokens, and types may be applied
at various levels: letters, words, sentences, texts. The
identification and separation of the marks, tokens, and types of
a document requires a competent reader. A theory of
interpretation or of reading is not the concern of our paper,
however — our goal is only to elucidate what it means to say
that T is a transcription of E, however the interpretive processes
resulting in T are properly understood.
Based on this outline of the relation between E and T and the
distinction among documents, marks, tokens, and types, the
following gives a first approximation to a more formal account
of the meaning, in the simplest cases, of the statement that T
is a transcription of E:
1. E and T are documents. (It follows that E and T each contain
marks.)
2. The marks of each document are interpreted as instantiating
a sequence of types:
• Each mark is identified as constituting one and only one
token.
• Each token is identified as instantiating one and only
one specific type.
• To the tokens of each document, an ordering is assigned
which is total, relative and linear.
3. The two sequences of tokens thus identified instantiate
identical sequences of types.
A more explicit account of these points is best given not in
prose but in a more formal notation. The full paper will provide
translations of the description above into the modeling language
Alloy [Jackson 2006]; for brevity, these formalisms are omitted
from this abstract.
This simple account agrees well enough with common usage.
Vander Meulen and Tanselle (1999), for example, describe
transcription as “the effort to report — insofar as typography
allows — precisely what the textual inscription of a manuscript
consists of”. We take “textual inscription” to denote the physical
tokens of the document, and the method of reporting the textual
inscription of the exemplar to be the creation of another
document whose textual inscription produces (when rightly
read) the same sequence of abstract types.
Even commonplace phenomena require further elaboration of
this model (see below), but this simple model already illustrates
some essential features of transcription.
• In a transcription, “an A is an A is an A” — one instance
of a letter type is (for purposes of transcription)
interchangeable with any other. Detailed information about
the shape of individual marks is lost; the only salient
information about the token in the exemplar is which type
it maps to.
• The set of letter types distinguished in a transcription helps
define that transcription; the choice of an appropriate
inventory is a basic responsibility of the transcriber and a
fertile source of disagreements.
• The formal relation between exemplar and transcript is
symmetric and transitive. (Not necessarily so the concept
of transcription — just the formal relation.)
Application and elaboration
Asatisfactory formal account of transcription must deal
with complications ignored above and will require
elaboration of the basic outline. Our account has some affinity
with Nelson Goodman's theory of notation [Goodman 1976].
We have no general characterization of these complications
yet, so we introduce them here in the form of examples:
Some transcriptions must account both for systematic ambiguity
of tokens in the writing system (a particular shape sometimes
instantiates one type, sometimes another) and for uncertainty about which token is actually present. These cases present
special difficulties for the choice of the appropriate type
inventory.
The mapping from mark to type may depend on context: marks
of similar shape may be read now as commas, now as full stops.
Paleographic transcriptions often mark line breaks by vertical
bars; the type sequence in the transcript thus appears to include
items not present in that of the original. Shall the line breaks
of the exemplar be understood as a special kind of letter token?
Some transcribers normalize or regularize the spelling of the
exemplar, sometimes silently. The unit of transcription, in such
cases, is the lexical item or word form, rather than the letter.
Scribal abbreviations pose a challenge: expanding them destroys
the one-to-one correspondence between the tokens in the
original and in the transcript. How can such expansions best
be modeled? Perhaps each token in the exemplar corresponds
to one or more tokens in the transcript? But then what of
shorthand transcriptions made for private use, in which
abbreviations are not expanded but introduced? The relation
between the type sequences of exemplar and transcript must
be more complex than simple identity.
Sometimes transcriptions include marks to which no mark in
the exemplar seems to correspond. Vertical lines in a
paleographic transcription may (as suggested above) be
interpretable as transcribing line breaks in the exemplar. But if
the transcription provides line numbers as well, it's difficult to
stretch the notion of allographic variation to cover them.
Similarly, transcriptions may contain expansions of
abbrevations, supply letters or words assumed to be left out
from the exemplar by slips of the pen, transcriber's comments
on the wording or physical appearance of the examplar, etc.,
i.e. “additional” content not found in the exemplar.
The most fundamental problem, however, is given by the silent
assumption, in the preceding remarks, that the text represented
in a particular document, and thus the document itself, can
usefully be represented as a one-dimensional sequence of types.
This view appears painfully simplistic in the light of recent
years' work on XML and text encoding. Even in conventional
printed books, characters seldom have only one possible
sequence. And digital representations seldom limit themselves
to preserving the sequence of graphemes in the exemplar;
normally, they use markup to identify larger textual units and
make explicit aspects of the text which may lack any
conventional orthographic representation.
One might take the structural units identified by markup (e.g.
the elements of an XML document) as instantiating complex
types; such an approach would in some ways resemble
Goodman's analysis of ‘characters’, with the markup used to
make the identification of complex characters explicit and
reliable. Alternatively, one might treat marked up documents
as representing a mixed notation, with character data content
operating in one way, and markup in a different way. Goodman,
for example, analyses drama as a mixed notation, in which the
dialog functions as a “score”, but the stage directions function
as a “script”; the analogy with markup may prove informative.
Whatever the approach, a key challenge for any formal
treatment of transcription is to make it fit into the more powerful
notions of textual structure reflected in modern markup systems.
This is the most important topic to be addressed in the further
development of the formal model of transcription.
Bibliography
Driscoll, Matthew. "Levels of Transcription." Electronic Textual
Editing. Ed. Lou Burnard, Katherine O'Brien O'Keeffe and
John Unsworth. New York: Modern Language Association of
America, 2006. 254-261.
Goodman, Nelson. Languages of Art. Indianapolis, IN: Hackett
Publishing Company, 1976.
Huitfeldt, Claus. "Multi-dimensional Texts in a One
Dimensional Medium." Computers and the Humanities 28
(1995): 235-241.
Huitfeldt, Claus. "Philosophy Case Study." Electronic Textual
Editing. Ed. Lou Burnard, Katherine O'Brien O'Keeffe and
John Unsworth. New York: Modern Language Association of
America, 2006. 181-196.
Jackson, Daniel. Software Abstractions: Logic, Language, and
Analysis. Cambridge, MA: MIT Press, 2006.
Kline, Mary-Jo. A Guide to Documentary Editing. Second
edition. Baltimore, MD: Johns Hopkins University Press, 1998.
Robinson, Peter. The Transcription of Primary Textual Sources
Using SGML. Office for Humanities Communication
Publications, Number 6. [Oxford: OHC], 2004.
Robinson, Peter, and Elizabeth Solopova. "Guidelines for the
Transcription of the Manuscripts of the Wife of Bath's
Prologue." The Canterbury Tales Project Occasional Papers
Volume I. Ed. Norman Blake and Peter Robinson. Office for
Humanities Communication Publications, Number 5. [Oxford:
OHC], 1993.
Stevens, Michael E., and Steven B. Burg. Walnut Creek, CA:
Altamira Press, 1997.
Vander Meulen, David L., and G. Thomas Tanselle. "A System
of Manuscript Transcription." Studies in Bibliography 52
(1999): 201-212.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None