Form, Content, and the Philosopher's Stone

Paul Caton

Authorship

1. Paul Caton

Brown University

Original URL

http://web.archive.org/web/20040903094514/http://www.hum.gu.se/allcach2004/AP/html/prop148-2.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

I.

Buzzetti (2002) believes strongly-embedded SGML/XML markup produces inadequate digital representations of texts. He considers the text model such markup assumes--the "OHCO" model[1]--confused about its object and incapable of appropriately representing textual content. Renear et al ( 2003) respond to one part of Buzzetti's critique by asserting the value of OHCO-type markup as a data model in its own right. This paper, in contrast, examines the case Buzzetti makes for a "deep" data model that reflects the *form of the content* and from which must flow any truly adequate digital representations of a text. In his argument I find occlusions, silences, and ambiguities that, I conclude, fatally undermine his project.

II.

Buzzetti specifies two criteria for an adequate digital representation of a text. These are exhaustivity--"[it] should in no way impoverish the informative content of the text" (61)--and exploitability, that is, its "liability ... to automatic processing and its functionality with respect to the critical operations of reconstructing or interpreting the text." (62) He argues that descriptive markup of the kind exemplified by the Text Encoding Initiative's scheme does not meet the two criteria because despite the foundational claim of OHCO-type markup to be addressing essential content instead of superficial form (DeRose et al 1990), it still confuses the form of the text's expression with the form of the text's content. In mistaking surface for depth and reproducing the form of expression, the underlying text model cripples itself with respect to exploitability. In effect Buzzetti gives a radical turn of the screw to one of SGML's fundamental justifications: the notion of creating an Ur-representation of content from which many different forms can be generated. We should see the entire text as a generated expression whose form we must not take to be the form of the thing it was generated from; the model that adequately represents the one cannot adequately represent the other. The originary 'thing' is a structure of informative content: abstract, non-linear, possessing its own properties and behaviours which an adequate data model (and thus an adequate text model) should represent through an appropriate formalism.

III.

In humanities computing at least, OHCO-type markup has come to dominate digital representations of full texts for scholarly use, which means that the majority of the products of more than a decade of encoding effort don't meet Buzzetti's criteria. The easy pragmatic observation would be that the very prevalence of OHCO-type markup suggests many people have found it quite adequate for their purposes to date,[2] but for a scholarly response this obviously will not do: the case should be properly judged on its own merits. We might assert that Buzzetti's critique is bound to have no lasting impact because a number of the arguments have been made before--albeit not within such a sophisticated theoretical framework--and if OHCO-type markup survived them then it will survive them again. Survival, though, is not the same as triumph. Responding to Buzzetti, Renear et al acknowledge that OHCO proponents have in the past let these issues lie unresolved and that he is right to foreground them again. They also acknowledge some history of terminological/conceptual confusion but argue--I believe correctly--that Buzzetti overstates the case and that his examples "are far from convincing evidence for systematic conflation [of expression form with content form]." However, because their main concern is with defending against a particular charge that Buzzetti takes up from Raymond, Tompa, and Wood (1992, 1996) regarding SGML's supposed lack of a "standard semantics," they do not address the main thrust of Buzzetti's argument.

IV.

The text encoding community has well rehearsed the strengths and limitations of OHCO-type markup.[3] The real interest of Buzzeti's critique lies in his attempt to conceptualise what *should* underpin digital representations. It is clear that Buzzetti has in mind not a digital representation of "a text,"[4] but a comprehensive digital edition, an agglomeration of data that is the raw material from which any desired view can be generated and upon which any scholarly analysis can be based. He cites as congenial to his argument ideas on treating textual materials from Theodore Nelson and from Manfred Thaller, and much of what he says implies a classically data-centric approach. The database has always served not just as a store, but also as a base for algorithmically (re)constructing views of the data.

Equally clear, though, is that Buzzetti’s notion of generation has been strongly influenced by the two linguistic models he co-opts: Chomsky's Transformational Grammar and Hjelmslev's Glossematics. Codd’s relational data model in itself implies absolutely no stratigraphical or temporal relation between any of its components. Such relations can be modeled within a relational database, but only as a function of the user’s choice of domains and attributes; so such relations are created by the model, and are not inherent to it. Buzzetti’s “adequate digital representation,” however, clearly involves working ‘backwards’ or ‘downwards’ from the manifest data. It looks not to simply collect and label what is, but to treat what is as a result or surface effect and to store instead what ‘what is’ came from. With this comes a notion of reduction, of constraint: the multiplicity of what is comes from different operations on a limited set of essential data objects. Hence the appeal for Buzzetti of linguistic models that ‘reverse engineer’ data as it appears to us. Johansen (1993), for example, speaks of Hjelmslev’s desire to arrive at a calculus of language, a set of axioms and functions from which all possible acceptable sentences can be derived; similarly in a Transformational Grammar approach a set of transformation rules and a set of basic sentence structures can generate all the legal syntactic variations. However, because these models are specific to the abstractions of language Buzzetti cannot bring them in as actual candidates for the data model he thinks is necessary; instead they serve as analogies. The problem is that while each model’s general character conveys nearly enough what Buzzetti wants, its specifics do not.

For example, Buzzetti's argument depends heavily upon Hjelmslev's four-layered model of semiosis which "allows us," he says "to clearly distinguish between the form of the representation and the form of the content represented." (64) However, to leave it at that (that is, at a distinction between representation and content) is really to say no more than the association between signifier and signified, though binding, is arbitrary, which is to say no more than Saussure's model does. The whole point of what makes Hjelmslev's model different from Saussure's is the further distinction between form and substance, on both content and expression planes, one that involves an extremely subtle interrelationship between the two layers on each plane. Yet this distinction--so visible in the diagrammatic representation of Hjelmslev's model that Buzzetti includes--disappears from the discussion itself along with any complications it introduces to the simple mapping Buzzetti wants to establish between the form of the expression and traditional markup.[5]

V.

Such moments in Buzzetti's text are symptomatic of an illusory quest. His critique assumes an ultimate textual essence that we can capture in a supermodel that preserves all informative content and serves all scholarly needs. We just have to push further in from the surface form than OHCO to find it. On close reading of Buzzetti's argument, however, the rhetorical paths which promise to lead there instead peter out, or bring us back to the 'surface.' Buzzetti quite legitimately desires to move beyond the limitations of OHCO to digital texts unconstrained by a single representational form, towards a comprehensive representation. Seduced by two powerful linguistic models that seem to promise a path from effect to cause, surface to depth, incidental to essential, he assumes a comprehensive data model: a grand unified abstraction of a text that accounts for all textual phenomena and enables all textual analysis. The subtlety and scope of the case he puts forward make his, as Renear et al have already acknowledged, a landmark contribution to markup theory, but one which I find, in the end, unconvincing.

Notes

[1] For convenience, and because of the close association between SGML formal grammar and its assumed text model, I will hereafter refer simply to "OHCO-type" encoding.

[2] Anecdotally, this author has been told by practitioners of both 'deep' and 'shallow' encoding that advanced features they provide to enable scholarly exploitation of the encoding get little use. This minimal usage can, of course, be interpreted in quite different ways.

[3] Representative discussions include Huitfeldt 1994, Renear et al 1996, Renear 1997, and Caton 2001.

[4] Here meaning as defined in Caton 2001, ie. "a piece of text of known extent with formal generic markers at the beginning and end, the whole constituting a single written utterance." (1)

[5] We should note that Hjelmslev's model, though extremely suggestive, is not without problems itself; see the excellent discussion in the first chapter of Johansen 1993 .

References

Buzzetti, Dino. 2002. Digital representation and the text model. New Literary History 33: 61-88.

Caton, Paul. 2001. Markup's current imbalance. Markup Languages: theory and practice 3:1 (Winter) 1-13.

DeRose, Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990. What is text, really? Journal of Computing in Higher Education 2:1 (Winter), 3-26.

Huitfeldt, Claus. 1994. Multi-dimensional texts in a one-dimensional medium. Computers and the Humanities 28: 235-241.

Johansen, Jorgen Dines. 1993. Dialogic semiosis: an essay on signs and meaning. Bloomington: Indiana University Press.

Raymond, Darrell R., Frank Tompa, and Derick Wood. 1992. Markup reconsidered. First International Workshop on Principles of Document Processing. Washington, D.C., October 1992.

Renear, Allen H., David Durand and Elli Mylonas. 1996. Refining our notion of what text really is. Research in Humanities Computing. Nancy Ide and Susan Hockey, eds. Oxford: Oxford University Press. 263-280.

---. 1997. Out of praxis: three (meta) theories of textuality. Electronic Text: Investigations in Method and Theory. Kathryn Sutherland, ed. Oxford: Clarendon Press. 107-126.

---, David Dubin, Michael Sperberg-McQueen, and Claus Huitfeldt. 2003. Text Markup — Data Structure vs. Data Model. Paper presented at Joint International Conference of the Association for Humanities Computing and the Association for Literary and Linguistic Computing, July, Athens Georgia 2003.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Form, Content, and the Philosopher's Stone

1. Paul Caton

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004