Small Data projects/Big Data research: contemporary problems and historical solutions:

paper, specified "long paper"
  1. 1. Daniel O'Donnell

    University of Lethbridge

  2. 2. Nathan Woods

    University of Lethbridge

  3. 3. Barbara Bordalejo

    University of Lethbridge

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Humanist resistance to understanding the material they work with as “data” is well-documented (e.g. Marche, 2012; Fish, 2012). As O’Donnell has argued, 

In other domains, data are generated through experiment, observation, and measurement. Darwin goes to the Galapagos Islands, observes the finches, and fills notebooks with what he sees. His notes (i.e. his “data”)... are “the facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.” Given the extent to which they are generated, it has been argued that they might be described better as capta, “taken,” than data, “given.”
The material of humanities research traditionally is much more datum than captum, finch than note…. [S]uch material... is often unique and its interpretation is usually provisional, depending on broader understandings of purpose, context and form that are themselves open to analysis, argument and modification. In the humanities, we more often end up debating why we think something is a finch than [listing] what we can conclude from observing it (O’Donnell, 2016).
Scale is an important result of this distinction. While experimental and observational approaches to data generation can produce immense datasets, the more dialogic approach taken traditionally by Humanities and Cultural Heritage (HCH) researchers often leads to the development of relatively small, closely analysed datasets or even datapoints — the edition of a single novel or shipping register; a collection of comics; the oeuvre of a single artist or school (Borgman, 2015; Borgman, 2007; Golub and Liu, 2021). 
This difference often results in a mismatch between research infrastructures and the needs of many small-data HCH researchers and users. The evolution of Open Science/Scholarship Infrastructure (OSI) offers a case in point. OSI typically assumes a research workflow and understanding of the purpose and nature of data in which data is clearly distinct from analysis and, traditionally, represents the raw material of research rather than a research output in its own right (Flanders, 2009; Jockers and Flanders, 2013). OSI, in this use case, provides a forum for the registration and publication of what was in many cases previously unpublished and considered unpublishable (Gray et al., 2002; Kratz and Strasser, 2015).
This is different from the typical understanding of the role of “data” (whether recognised as such or not by the researchers) in traditional small-data HCH editions and exhibits. In these projects “data,” which in this case we are defining as the mediated and curated representation of primary texts and objects intended to be used by others as a proxy for access to the originals, is understood to be a principal research output in its own right. In contrast to the workflow contemplated by contemporary OSI, in which data precedes and is published separately from analysis, in this workflow data is commonly given pride of place: published with the accompanying analysis and intended to be used directly by the end-user. In these cases, “analysis” is, if anything, often treated almost as a form of metadata rather than a distinct set of results derived from an underlying dataset. 
This paper explores the question of Research Data Management (RDM) in this context: for small-data projects involving the representation of HCH texts and objects. Our primary focus is the understanding of what we will call “data” (regardless of the views of the researchers themselves about this terminology) inherent in such projects’ research and publication workflows (O’Donnell, 2018). We will expand on the distinction drawn above between “capta” and “data” (recognising at the same time the instability of the each term’s valency, cf. O’Donnell, 2016 and Drucker, 2011) and focus on how such “representational data” is captured, reproduced, and used in traditional HCH “primary-source” research workflows and use cases such as editions, catalogues, and exhibits. 
The paper contrasts this analysis of HCH small-data practices and use cases with the understanding of data and research/RDM workflows implied or described by various specific examples of OSI (e.g. Zenodo, Figshare, Open Science Framework, Humanities Commons). We use this contrast to identify ways in which OSI can be adapted to support such small-data work and make it available for “big data” research (for an example of this adaptation within the sciences themselves, see Ferguson et al., 2014; Seltmann et al., 2013; Biodiversity Literature Repository, 2013; Cui et al., 2010).  
This is not the first time HCH researchers have struggled with this problem. An important part of the paper is an exploration of how such adaptations have been made in the past. While the contemporary conversation around big data has focused on features such as quantification, automisation, standard graphical representation of data patterns, and the compilation of large datasets, we build upon recent scholarship in the history of science that reframes the pre-history of this current conversation to consider alternate precedents (Aronova et al., 2010). 
Our focus in that case is on  HCH  projects — such as the Oxford English Dictionary (OED) and the Corpus Inscriptionum Latinarum (Daston, 2017) — that created big data projects built from distributed data networks. As the architects of these projects demonstrated in the 19th century, it is possible, often with considerable effort, to reuse such work for big-data ends. The OED, after all, is a ‘big data’ project built on the basis of a collection (i.e. a ‘dataset’) of 1.8 million quotations collected from thousands of books (i.e. ‘small-data’ projects; see Oxford English Dictionary; Trench, 1860). These data networks compiled individual quotations and inscriptions as published compendia or editions.  
By exploring these big data precedents in the history of the humanities, our paper contributes in critical and practical ways to the contemporary and ongoing discussion on the organisation of data projects in Digital Humanities (Antonijević, 2015; Antonijević Ubois, 2016; Borgman, 2015; Posner, 2013). While recent scholarship has focused on the work of organising big data Digital Humanities projects, we argue this conversation has often conflated issues of RDM with issues of digital practices in knowledge production. In our concluding discussion, we explore how these earlier examples of data management might enrich contemporary discussion, particularly around the development of scholarly tools and research data management infrastructure. In reframing how digital practice and data might be related and combined, we reconsider in practical and historical terms how broader genealogies suggest alternative portraits of how humanities data is compiled, organised, used and shared. 


Antonijević, S.

Amongst Digital Humanists: An Ethnographic Study of Digital Knowledge Production
. First published. Basingstoke New York, NY: Palgrave Macmillan doi:10.1057/9781137484185.

Antonijević Ubois, S.

(2016). Developing Research Tools via Voices from the Field Text

World. (accessed 20 April 2022).

Aronova, E., Baker, K. S. and Oreskes, N.

(2010). Big Science and Big Data in Biology: From the International Geophysical Year through the International Biological Program to the Long-Term Ecological Research (LTER) Network, 1957–Present.
Historical Studies in the Natural Sciences
: 183–224 doi:10.1525/hsns.2010.40.2.183.

Biodiversity Literature Repository

Biodiversity Literature Repository [Project]

Borgman, C. L.

Scholarship in the Digital Age: Information, Infrastructure, and the Internet
. Cambridge, Mass: MIT Press.

Borgman, C. L.

Big Data, Little Data, No Data: Scholarship in the Networked World
. MIT press

Cui, H., Jiang, K. (Yang) and Sanyal, P. P.

(2010). From text to RDF triple store: An application for biodiversity literature.
Proceedings of the American Society for Information Science and Technology
(1): 1–2 doi:10.1002/meet.14504701415.

Daston, L.

(2017). The Immortal Archive: Nineteenth-Century Science Imagines the Future.
Science in the Archives: Pasts, Presents, Futures
. London and Chicago: UChicago Press.

Drucker, J.

(2011). Humanities Approaches to Graphical Display.
Digital Humanities Quarterly

Ferguson, A. R., Nielson, J. L., Cragin, M. H., Bandrowski, A. E. and Martone, M. E.

(2014). Big data from small data: data-sharing in the ‘long tail’ of neuroscience.
Nature Neuroscience
(11). Nature Research: 1442–47 doi:10.1038/nn.3838.

Fish, S.

(2012). Mind Your P’s and B’s: The Digital Humanities and Interpretation
Opinionator (accessed 30 March 2013).

Flanders, J.

(2009). The Productive Unease of 21st-century Digital Scholarship. ,
(3) (accessed 12 May 2013).

Golub, K. and Liu, Y.-H.

Information and Knowledge Organisation in Digital Humanities: Global Perspectives
. 1st ed. London: Routledge doi:10.4324/9781003131816. (accessed 17 January 2022).

Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., and others

(2002). Online scientific data curation, publication, and archiving.
Virtual Observatories
, vol. 4846. International Society for Optics and Photonics, pp. 103–07.

Jockers, M. and Flanders, J.

(2013). A Matter of Scale.
Faculty Publications – Department of English

Kratz, J. E. and Strasser, C.

(2015). Researcher Perspectives on Publication and Peer Review of Data.
(2). Public Library of Science: e0117619 doi:10.1371/journal.pone.0117619.

Marche, S.

Literature Is Not Data: Against Digital Humanities - Los Angeles Review of Books
Los Angeles Review of Books

O’Donnell, D. P.

(2016). The bird in hand: Humanities research in the age of Open Data. In Figshare (ed),
The State of Open Data Report
. Digital Science, pp. 38–39 doi:10.6084/m9.figshare.4036398.v1.

O’Donnell, D. P.

(2018). Humanities Data and their Research Use Paper presented at the Open Science Infrastuctures for Big Cultural Data, International Masterclass, Plovdiv, Bulgaria (accessed 14 January 2021).

Oxford English Dictionary
History of the OED
Oxford English Dictionary

Posner, M.

(2013). No Half Measures: Overcoming Common Challenges to Doing Digital Humanities in the Library.
Journal of Library Administration
(1): 43–52 doi:10.1080/01930826.2013.756694.

Seltmann, K. C., Pénzes, Z., Yoder, M. J., Bertone, M. A. and Deans, A. R.

(2013). Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the
Hymenoptera Anatomy Ontology. (Ed.) Moreau, C. S.
(2): e55674 doi:10.1371/journal.pone.0055674.

Trench, R. C.

On Some Deficiencies in Our English Dictionaries: Being the Substance of Two Papers Read Before the Philological Society, Nov. 5, and Nov. 19, 1857
Oxford English Dictionary
. J. W. Parker and son

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO