From Markup to Analysis: Culture Claims and Code in the Digital Archive

paper, specified "long paper"
  1. 1. Julia Flanders

    Northeastern University

  2. 2. Elizabeth Maddock Dillon

    Northeastern University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Following the lead offered in the "Text Encoding Meets Text Analysis" panel session at DH2012 in Hamburg,1 and also in the 2013 debate between Matthew Jockers and Julia Flanders on "A Matter of Scale",2 the research value of basic modeling for text analysis seems comparatively clear, even if it is as yet unrealized in practice. However, the payoff from the more complex forms of modeling that are evident in typical TEI-encoded thematic research collections has been less clearly demonstrated. The usefulness of such modeling is evident insofar as it supports the publication of these collections: information about document structures is used to produce display formatting and navigation, and information about content features such as named entities and genre is commonly used in searching. But the larger claims made for digital resources—that they "allow new questions to be asked"3—require us to distinguish between these kinds of functional value and what we might call "real research value." The real research value of the modeling is reflected in the ability of the data to support complex inferences, at scale, that materially contribute to humanities research arguments. Huitfeldt et al. in "Meaning and Interpretation of Markup"4 offers a foundational demonstration of how markup "licenses inferences" in formal terms. But projects developing TEI-encoded archival collections have not yet articulated in detail how that markup might lead researchers from comparatively simple inferences ("this paragraph contains a reference to such-and-such a person") to more complex ones ("if the name of a person classified in our personography as a war hero appears inside an advertisement within a magazine published after the date of the war in question, the advertiser may be using that person's identity to promote the goods being advertised"; "people whose names appear in both advertisements and poetic dedications command greater social capital than those whose names appear in either genre alone"). And the conceptual leap from complex statements of this kind to larger conclusions is greater still.

This paper seeks to explore how the modeling of textual data for humanities research connects to the high-level research questions humanities scholars address in their scholarly writing. It offers a detailed description of the modeling (transcription, text encoding, metadata) and the high-level research goals for two closely related digital collections: the Early Caribbean Digital Archive and the Women Writers Project. It then traces critically the inferential steps by which we seek to get from the data being captured to the theoretical concepts ("culture", "geography", "influence", etc.) that animate the research. The goal of the paper is to provide a much more exacting and thorough understanding of the complexity of data modeling required to support the argumentative nuance and conceptual subtlety of real-world, high-quality humanities research.

In particular, our focus is on the possibilities and challenges of knowledge production afforded by modeling and encoding archives of materials that concern marginalized persons and non-canonical texts and histories. The two projects under discussion here are engaged in bringing to visibility texts and narratives that had previously been submerged beneath (or concealed within) more culturally promiment discursive forms. The Women Writers Project is currently engaged in a grant-funded collaborative research project funded by the National Endowment for the Humanities titled "Cultures of Reception" ( which gathers and digitizes periodical reviews of late 18th- and early 19th-century women's writing, to support the study of patterns of reception in an emerging transatlantic literary culture. The Early Caribbean Digital Archive ( is digitizing a variety of Caribbean textual sources from the same period, with special emphasis on the study of the emerging culture of commodity circulation and its relation to the transatlantic slave trade and submerged narratives of race and gender. Both projects involve detailed TEI encoding of textual sources animated by research goals such as these:

trace and map the relations between texts as a function of time, human agency, and geography
bring into visibility relations between locations of print activity across the Caribbean archipelago
show relations among individuals, such as printers, consumers, merchants, runaway slaves, missionaries, plantation owners, abolitionists, military figures, and colonial political figures
map relations between legislation, commodity prices, geography
map the geographic circulation of literary tropes
trace changes in the culture of reviewing over time with respect to the emergence of a transatlantic literary culture
bring to visibility the evaluative frames of reference within which women's writing is read
trace the cultural frame of reference for reviewers in England and in North America
In both cases, the projects must make conceptual and inferential bridges between the specific assertions constituted in the markup (observations about genre, named entities, time and location, textual structure, references to circulating cultural objects such as commodities and texts) and concepts operating a much more abstract level: "culture", "geography", "influence", "relations", "frames of reference." Humanities scholars are comfortable using terms like these in their writing, as the currency of methodologies from cultural studies to cultural geography indicates, but what forms of evidence and inferential reasoning do they entail at the level of textual markup?

We can unpack here, in a preliminary way, the kinds of reasoning through which these bridges might be built, and the final paper will explore these in more detail. First, there is a set of direct modeling activities (transcription and markup) through which the texts are constituted as research evidence. The activities of transcription produce from the source document assertions about the existence of strings of characters, and markup allows the transcriber and editor to identify areas where this evidence is ambiguous or missing. Through markup the editors can also identify specific strings as references to certain types of named entities (persons, places, books, publishing houses, ships, shipping companies, legislative bodies, etc.) and can associate these references with their target to provide unambiguous entity identification via linked data authority records. At a structural level, the markup can also be used to identify the genre and format of texts and parts of texts, and to associate metadata (author, date and place of production, etc.) with these. Following these activities, we must venture to make inferences based on our modeling. For example, the markup allows us to give greater precision to inferences common in text analysis: instead of judging collocation based on raw word proximity, we can identify word pairs or groups as being within the same textual feature (paragraph, poem, letter, advertisement, heading). In specific cases, we may be able to infer something more from such collocation, such as a connection between two authors mentioned in the same paragraph of a review. Genre and format information may enable us to sharpen these inferences further: mentioning an author in a review means something different from mentioning an author in a dedication; a commodity carries a different cultural freight when listed in an advertisement, a bill of lading, a receipt, a letter, a legislative document. The name of an enslaved person means something different when it appears in a bill of sale, a runaway slave notice, a poem. Taking metadata into account, we can also localize documents in space and time, which gives us the possibility of (cautiously) identifying trends, causation, cultural significance.

In building these bridges, we need to be attentive to the gaps or weak inferential points as we move from the modeling to the research. For example, what does it mean to infer relationships between entities from their proximity within documents? Where are these inferences strongly grounded (for instance, co-authorship reflected in metadata records) and where are they weak (for instance, two names appearing in the same paragraph of a historical account)? Or, in another vein, is the model of geography that emerges from textual attestation (i.e. the inventory of place names and location references) adequate for the kinds of geographical analysis we want to do? Or again, what does "circulation" (whether of physical objects or of ideas) look like as attested in data of this kind and how do we discover it? Finally, what is the relation between the encoding categories we have identified here and our knowledge production with respect to marginalized texts, persons, and narratives? How will our modelling decisions erase or repeat historical occlusions in the archive, not only by determining what aspects of the text are marked but also by imposing existing frames of knowledge on the archive?

1. Bauman, Syd, David Hoover, Karina Van Dalen-Oskam, and Wendell Piez (2012). Text Analysis Meets Text Encoding. Panel session, DH2012, University of Hamburg, July 2012.

2. Flanders, Julia and Matthew L. Jockers. (2013). A Matter of Scale. Keynote Lecture from the Boston Area Days of Digital Humanities Conference. Northeastern University, Boston, MA. March 18, 2013.

3. Our Cultural Commonwealth (2006). The report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences. ACLS.

4. Huitfeldt, Claus, C. M. Sperberg-McQueen, and Allen Renear (2001). Meaning and Interpretation of Markup. Markup Languages: Theory & Practice 2.3: 215-234.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO