What is Transcription? (Part 2)

  1. 1. C.M. Sperberg-McQueen

    World Wide Web Consortium (W3.org)

  2. 2. Claus Huitfeldt

    University of Bergen

  3. 3. Yves Marcoux

    Université de Montréal

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In earlier work, Huitfeldt and Sperberg-McQueen have
sketched out a series of formal models of transcription,
characterized by gradually increasing complexity
and descriptive power, but not, in the end, by a satisfactory
explication of transcription [Huitfeldt/Sperberg-
McQueen 2008].
The essential proposition of these models may be called
the Reading Identity Hypothesis: namely, that in simple
cases the relation between a transcription and its exemplar
is that each can be read as containing (a sequence of
tokens which map to) the same sequence of types. When
types and tokens are identified at the character level, for
example, a document containing marks which can be
read as tokens representing the sequence of characters
Come, sit thee downe upon this flowry bed,
can be regarded as a transcription of another document
just in case that document, too, contains marks which
can similarly be read, as tokens of those types, in that order.
The addition of various meta-characters in transcription
(e.g. explicit marking of line breaks in the exemplar)
is modeled by allowing readings to be selective, ignoring
some marks in the document being read. Some common
variation in transcription practice is modeled by postulating
different token/type mappings with different type
inventories: the example just given, for example, will
count as a transcription of one line in Shakespeare's First
Folio only if long and short s are treated as the same
grapheme and if u and v are normalized. When long
and short s are not mapped to the same grapheme, and
u and v are distinguished based on their letter form and
not on current usage, then a transcription will contain a
sequence of tokens readable as: Come, ſit thee downe vpon this flowry bed,
Prominent among the shortcomings of the reading-identity
hypothesis as formulated in [Huitfeldt/Sperberg-Mc-
Queen 2008] is the explicit assumption that a document,
and its transcription, consist essentially of a sequence of
characters (or more generally of tokens).
In this paper, we will assume that a document identifies
not just a sequence of typed tokens, but a structure consisting
of typed relations (such as containment or dominance)
between sequences or sets of typed objects (such
as paragraphs or section headings). The relation between
documents can then be modeled as a relation between
such structures, allowing transcriptions to agree or disagree
not only on character sequences, but also on document
structure, and on assignments of types to document
objects and their relations.[1]
Based on these assumptions we will propose a formal
model of document structure which is able to capture basic
notions underlying conventional markup formalisms
such as the tree structures of XML, the concurrent trees
of XCONCUR [Hilbert et al. 2005], the sequence-plusranges
model of LMNL [Tennison/Piez 2002], and the
directed acyclic graphs of Goddag structures [Sperberg-
McQueen/Huitfeldt 2000]. The formalism we develop is
not intended and should not be understood as an alternative
to any of these, only as a way of trying to reduce
them to a common denominator in terms of which it is
possible to determine that two transcriptions do or do not
agree on particular aspects of the exemplar.
We believe the model can be used to explain, at an appropriate
level of abstraction, how it is possible that
representations using different markup formalisms and
their associated vocabularies can agree or disagree on
the transcription of a given exemplar.
For example, in most current practice, the line given
above as an example is likely to be embedded in a document
structure showing that it occurs within a particular
play, act, scene, and speech: for example
<l>Come, sit thee downe vpon this
flowry bed,</l>
or, using a different XML vocabulary (TEI),
<div type="act" n="IV">
<head>Actus Quartus</head>
<div type="scene" n="i">
<sp who="Titania">
<l>Come, &longs;it thee downe &v;pon
this flowry bed,</l>
In older electronic texts the transcription is likely to
use a different markup language entirely (here COCOA
<T Midsummer Night's Dream>
<A 4> ...
<S 1>
<C Titania>
Come, *sit thee downe vpon this flowry bed,
These three transcriptions identify, by means of their
markup, textual structures which are substantially similar,
but different in details. All three, for example, can
be read as indicating that the line in question is part of
the first scene of the fourth act, and that it is spoken by
Titania, but they differ in details (act 4 or Act IV or Actus
Quartus? “Tita.” or “Titania”? or both?)
One challenge for a formal model of transcription is to
distinguish essential differences among these transcriptions,
which represent disagreements about the text of
the exemplar (and thus disagreements among the transcribers)
on the one hand, from inessential differences
(or in some cases essential differences which happen not
to relate to questions of transcription per se) which follow
from the different markup languages used, or from
the choice of textual features to mark up, on the other
The main body of the paper is the definition of the formal model, followed by an extended example showing the
transcription of a short document in several markup languages
and the representation or analysis of these transcriptions
in the abstract model of transcription.
The formal model associates to each transcription a set
of statements (or assertions) about the transcribed exemplar.
The statements may be expressed in any convenient
notation, e.g. some form of first-order logic. Since different
transcription practices and contexts may consider
widely different sets of textual features as relevant, the
exact predicates allowed in the statements can vary from
one transcription to another.
Essentially, the statements corresponding to a transcription
are obtained as follows. First a set of basic facts
about the exemplar is derived in a straightforward manner
directly from the encoding of the transcription. Then,
the closure of that set under inference is taken. The inference
rules will vary with the specific transcription practice,
conventions, and context in which the transcription
was performed, and can also include special rules for
“translating” statements from one combination of transcription
practice/conventions/context to another.
If two transcriptions agree in every detail (regardless of
how they are coded), the sets of inferences derived from
them will be equal. Intuitively, if one set is a subset of
the other, this indicates that the transcriptions do not disagree,
but that one of them is more detailed or thorough
than the other.
If neither set is a subset of the other, then we have to
look at the (closure under inference of the) union of the
sets. If the union is consistent (i.e., does not allow inferring
both a statement and its negation), the transcriptions
still do not disagree, but each include aspects of the exemplar
that the other does not. However if the union is
inconsistent, then we can conclude that the transcriptions
disagree, and even say on what particular point(s) they
Note that the construction of the set of basic facts from
the encoding of the transcription is not a simple mapping
from generic IDs (or, in general, markup configurations)
to predicate names. Thus, for instance, different generic
IDs could map to the same predicate name. Some markup
may also be ignored, and even entire elements, like
elements that are not part of the transcription as such (for
example, the TEI header[2]). More complex methods of
constructing the set of basic facts are also possible.
Apart from its intrinsic interest, a formal model of transcription
is of use in efforts to support a formal description
of the meaning of various constructs in well known
markup schemes, in which the meaning of certain elements
is inextricably linked to the fact that they can contain,
or be used in, transcriptions of existing documents.
We hope to conclude with some observations about the
applicability of our model in the formal description of
markup vocabularies.
DeRose, Steven J., David G. Durand, Elli Mylonas, and
Allen H. Renear. 1990. “What is Text, Really?” Journal
of Computing in Higher Education 1.2: 3-26.
Hilbert, Mirco, Oliver Schonefeld, and Andreas Witt.
2005. “Making CONCUR work.” In Proceedings of
Extreme Markup Languages 2005. On the Web at
Huitfeldt, Claus, and C. M. Sperberg-McQueen. 2008.
“What is transcription?” L&LC 23.3 (2008): 295-310.
Available on the Web at <URL:http://llc.oxfordjournals.
Power, Richard, Donia Scott, and Nadjet Bouayad-Agha.
2003. “Document structure”. Computational Linguistics
29.2: 211-260.
Renear, Allen, David Durand, and Elli Mylonas. 1996.
“Refining our Notion of What Text Really Is: The Problem
of Overlapping Hierarchies”. In Research in Humanities
Computing ed. Nancy Ide and Susan Hockey.
Oxford: OUP, 1996.
Sperberg-McQueen, C. M., and Claus Huitfeldt.
2000.“GODDAG: A Data Structure for Overlapping Hierarchies”
paper given at Digital Documents: Systems
and Principles. 8th International Conference on Digital
Documents and Electronic Publishing, DDEP 2000,
5th International Workshop on the Principles of Digital
Document Processing, PODDP 2000, Munich, Germany,
September 13-15, 2000. Published in DDEP-PODDP
2000, ed. P. King and E.V. Munson. Lecture Notes in
Computer Science 2023. Berlin: Springer, 2004, pp.
139-160. Available on the Web at <URL:http://www.
Tennison, Jeni, and Wendell Piez. 2002.“The layered
markup and annotation language (LMNL)”Late-breaking
talk given at Extreme Markup Languages 2002 (not
in proceedings). Slides on the Web at <URL:http://www.
2002/Tennison02/EML2002Tennison02.zip> [1] This assumption in some ways resembles the celebrated
assertion, widely known as the OHCO hypothesis,
that “text is an ordered hierarchy of content objects”
[DeRose et al. 1990] but differs from it in some ways.
We are speaking about documents (physical objects) not
texts (abstract objects); we say that documents identify
certain abstract structures not that they are constituted by
those structures; our assumption does not imply that the
abstractions involved are “content objects”, nor that they
are hierarchically structured, nor that they are ordered.
Our assumption does however share with the OHCO
hypothesis the idea that documents have structure more
complex than a simple sequence of types or tokens. See
also the retraction of the hierarchical part of the OHCO
hypothesis by several of its authors [Renear et al. 1996]
and the linguistically motivated analysis of document
structure in [Power / Scott / Bouayad-Agha 2003].
[2] Elements such as the TEI header may however be
very helpful in determining the practice/conventions/
context of the transcription.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

  • Keywords: None
  • Language: English
  • Topics: None