Developing for Distant Listening: Developing Computational Tools for Sound Analysis By Framing User Requirements within Critical Theories for Sound Studies

paper, specified "long paper"
  1. 1. Tanya Clement

    University of Texas, Austin

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Council on Library and Information Resources (CLIR) and the Library of Congress (LoC) issued a 2010 report that suggests that if we do not use sound archives, our cultural heritage institutions will not preserve them. Nancy Davenport, previous president of CLIR, concludes that users want unfettered access and better discovery tools for what she calls “deep listening” (what Charles Bernstein calls “close listening”) or “listening for content, in note, performance, mood, texture, and technology.” It is a typical digital humanities problem: Without a better understanding of what such listening entails, we cannot build tools that afford such listening; and, because we lack the tools, humanists struggle to imagine how to describe the access they want -- what Jerome McGann calls “imagining what you don’t know.” In an attempt to imagine how to facilitate distant listening with computation, this paper positions user requirements for critical listening software within the context of critical listening theories.
Critical Listening Theories

Walter J. Ong once announced that recording technologies have heralded a new age in the study of the “voice, muted by script and print.” Humanists hold a range of theories and perspectives on how to study music or sound aesthetics in experimental poetry or how to contextualize sounds of the recording space (such as the whir of an old air conditioning or babies crying in the background) or the recording machine (such as the clicks and pops and pauses).
In particular, theoretical perspectives on “the voice” are useful in identifying the role sonic features that are discoverable by computation can play while close listening. Most such theories, for example, position sonic vocal traits as meaningful only within the context of a structural code for meaning such as language. Roland Barthes identifies two aspects of the voice in vocal music, for instance, that contribute to meaning making: the pheno-song, which refers to the structured elements of a piece such as speech or melody (or language codes) and the geno-song, which is the material or corporal aspect of the voice, the “volume of the singing and speaking voice, the space where significations germinate” (or sonic features). 5Privileging the pheno-song as more productive for communicating meaning, Barthes maintains that the geno-song –having “nothing to do with communication, representation (of feelings), expression”—is a system for transmitting that meaning.
Similarly, Michael Chion asserts that sonic features have meaning but that it is our lack of a descriptive system or hermeneutics that precludes our ability to make sense of these features. Chion approaches sound study by parsing listening into causal (for the source of the sound), semantic (to interpret a message), and reduced (to identify sonic traits) listening. Chion argues that reduced listening precludes meaning making for two reasons: (1) the “fixity” of sonic features required for close listening to sonic traits makes sound “physical data” that do not represent what was actually spoken or actually heard in real time “presence”; and (2) our language for describing how we make meaning with such traits is “totally inadequate.” Consequently, because of issues of fixity and inadequate identifiers, reduced listening is “an enterprise that is new, fruitful, and hardly natural.”
This argument, however, that the voice is only meaningful in the context of speech that transmits a message is a logocentric theoretical stance that has been readily contested. Adriana Cavarero who seeks to “understand speech from the perspective of the voice instead of from the perspective of language” wants to “pull speech itself from the deadly grip of logocentrism.” Caravero critiques the viewpoint of scholars such as Walter Ong and Marshall MacLuhan who at once essentialize the voice as “presence” and disembody and mythicize orality. Similarly, Mladen Dolar considers a “linguistics of non-voices” including coughing, hiccups, babbling, screaming, laughing, and singing, placing these sounds outside of the phonemic structure yet not outside of the linguistic structure. He argues that “It is not that our vocabulary is scanty and its deficiency should be remedied: faced with the voice, words structurally fail.” Finding possibilities for study in aspects of the voice such as accent, intonation, and timbre, Dolar asks the question at the heart of all of these queries: “how can we pursue this dimension of the voice?”
User Needs

User perspectives on the kinds of access and analysis advanced technologies with sound can facilitate were gathered by Clement as part of the HiPSTAS (High Performance Sound Technologies for Access and Scholarship) project. HiPSTAS is an NEH-funded, year-long Institute for Advanced Topics in the Humanities for librarians, information scientists, and humanities scholars who work with spoken word collections. Such collections include PennSound’s poetry archive, the American Folklife Center of the Library of Congress, oral histories in StoryCorps, and recordings from more than 50 tribes across Native America in the American Philosophical Society’s Native American Collection among other collections of interest to the participants. User perspectives were gathered from the 20 participants in three ways: (1) through Institute applications; (2) through pre-Institute interviews; and (3) through post-Institute surveys and project reports.
This data shows that defining the sonic features that map to specific cultural characteristics of “the voice” in spoken word recordings was not how participants phrased their research interests. One participant, for example, who was interested in working with the PennSound archive, wanted to consider “media ecologies” by analyzing “sounded affinities between poets” or “concepts of community poetics through sound;” this participant wanted “to look at groups of poets who have a common locale in terms of their community formation” and to use these clusterings to investigate how software “may or may not track affinities across gender lines.” Another participant analyzing PennSound was interested in “identifying, exploring, and categorizing performance variants of the same texts” such as “an aural/visual equivalent to the Versioning Machine)” and enabling “a kind of distant listening, flagging and visualizing generic features of poetry performance traditions (Is there, for example, a New York School style of oral delivery?” Other concerns were focused on reorienting how the archive is discoverable by enabling a “batch analysis of an audio corpus to mark the aural/perceptual relationships of poetry performance and extra-poetic ‘asides’ (which often provide significant contextualization of the poems but which may be ‘invisible’ in the Pennsound archive).”
Another participant interested in the APS recordings wanted to discover what it meant to think through “how a digital archive can recover intangible and ephemeral yet deeply powerful social experiences of sound” including “[w]hat themes of identity, gendered relations, and intercultural relations, may be heard in the Native speakers’ and singers’ expressions and performances of the recorded stories and songs in the collections;” this participant wondered, “how might we thematize and index sounds to address issues of indigenous sonic embodiment in files from which we can hear but not necessarily see the speakers and singers? What are the [sonic] differences and similarities among performers of similar source material? How do these performative differences/similarities map or not map onto other factors (race, gender, region, class, age, etc.)?” Also interested in the APS Native American collections, another participant wanted to analyze these holdings in order to classify Navajo speakers against a map of origin in order to illustrate the location of a speaker. With the ultimate goal of “develop[ing] a cultural map to show spheres of influence of those language-speaking approaches on the stories and motifs across time and in proximity to historical centers of tribal trauma,” this participant wanted to use software “to determine whether dialectical region or if proximity to historical centers of tribal trauma (e.g. boarding school experiences or Navajo Long Walk) influence that speaker’s . . . Beauty Way and Protection Way approaches to speaking the Diné language”.

Ultimately, can a computer be taught to distinguish between paralinguistic commentaries and formal (or informal) poetry readings or a Beauty Way speaker and a Protection Way speaker within a large collection of sound files? This paper attempts to imagine these possibilities by positioning what users want to do with sound within a critical framework of listening theories that understand “the voice” as a cultural phenomenon that reflects the resonance between linguistic and non-linguistic features of sound. This paper will primarily frame user requirements gathered as part of HiPSTAS within listening theories in the humanities. However, this paper will also briefly mention possible ways forward including a use case in which PennSound poets and scholars use TEI speech tags for tagging tempo, rhythm, loudness, pitch, tension, and voice across PennSound poetry files in order to enable machine learning with ARLO (Adaptive Recognition with Layered Optimization) software, a machine learning application for analyzing sound on Stampede, an NSF petascale HPC system at the Texas Advanced Computing Center. Finally, we need to understand what users want to do with sound and the theories behind critical listening in the humanities before we can design distant listening tools that afford sound scholarship.

Council on Library and Information Resources and the Library of Congress, The State of Recorded Sound Preservation in the United States: A National Legacy at Risk in the Digital Age. Washington DC: National Recording Preservation Board of the Library of Congress, 2010, 157.
Jerome McGann (2008), Radiant Textuality: Literature After the World Wide Web (New York: Palgrave).
Walter J. Ong (1967), The Presence of the Word: Some Prolegomena for Cultural and Religious History (New Haven: Yale University Press), 88.
Marjorie Perloff and Craig Douglas Dworkin (2012), The Sound of Poetry, the Poetry of Sound, Chicago: University of Chicago Press, 2009; Adalaide Morris, ed., Sound States: Innovative Poetics and Acoustical Technologies, Chapel Hill, NC: The University of North Carolina Press, 1998; Jonathan Sterne, ed., The Sound Studies Reader (New York: Routledge).
Roland Barthes (1978), Image-Music-Text. (New York: Hill and Wang), 182.
Barthes, Image, 182.
This seems related to Roland Barthes’s three distinct types of listening in his essay “Listening”: the first represents a listener on “alert” as prey or predator, as mother or child, as lover on the lookout; the second represents "deciphering" or "what the ear tries to intercept are certain signs"; the third is the “intersubjective” listening of the psychoanalyst. Roland Barthes, The Responsibility of Form, trans. Richard Howard (New York: Hill and Wang, 1985), 245; Michael Chion, “Three Listening Modes,” in The Sound Studies Reader, ed. Jonathan Sterne, 48-53 (New York: Routledge, 2012), 50 - 51.
Chion, “Three Listening Modes,” 51.
Chion, “Three Listening Modes,” 50; emphasis added.
Adriana Caravero (2012) “Multiple Voices,” in The Sound Studies Reader, ed. Jonathan Sterne, 520- 532 (New York: Routledge), 530, 531.
Mladen Dolar (2012), “The Linguistics of the Voice,” in The Sound Studies Reader, ed. Jonathan Sterne, 539-554 (New York: Routledge), 552.
Dolar, “Linguistics,” 539.
Dolar, “Linguistics ,” 544.
Please see the project site at
Please see the participant list and links to their project interests at
Project reports are publically available at

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO