Research Center as Distant Publisher: Developing Non-Consumptive Compliant Open Data Worksets to Support New Modes of Inquiry

paper, specified "short paper"
  1. 1. Robert McDonald

    Libraries - Indiana University, Bloomington

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


In the original Google Books Settlement Agreement in 2008 (Courant 2009), funds were to be set aside to create a research center that would enable researchers worldwide to accomplish data-mining and analysis on texts in the public domain and under copyright in a manner that was secure and compliant with appropriate U.S. copyright law. This did not happen, because the court rejected the agreement in 2011. Despite this, in 2011, the HTDL announced that Indiana University Bloomington and the University of Illinois at Urbana-Champaign would run the HTRC under a cooperative funding agreement with the HathiTrust Board of Governors and the University of Michigan. Since 2014, HTRC has made available as an active production service tools to analyze a set of out-of-copyright content equaling around 4.4 million volumes. In 2016, the HTRC plans to enable analysis of the entirety of the 15 million-volume corpus currently held by the HTDL, the largest digital academic library in North America.

HTRC and Non-Consumptive Research

The HTRC has developed a process to define and work within the concept of non-consumptive computational access to support the fair-use of the HTDL corpus as defined within the Google Books Settlement Agreement that was a part of the Authors Guild et al. v. Google Inc case.

Currently the HTRC defines the process for non-consumptive use of the HTDL corpus as:

Research in which computational analysis is performed on one or more books, but not research in which a researcher reads or displays.

Operationally, from the perspective of the HTRC research cyberinfrastructure, the HTRC defines non-consumptive research as:

That which requires that no action or set of actions on the part of users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from a collection of copyrighted works to reassemble pages from the collection.

This concept has been further refined in the course of the development of the HTRC Data Capsule (Zeng et al. 2014) for secure data analysis and the development

of the HTRC Workset Ontology (Jett et al. 2016) and has been codified in the recently released HathiTrust Research Center Non-Consumptive Use Research Policy (HTRC, 2016).

HTRC as Publisher

During the course of work with scholars using the HTRC tools and services to create derivative non-consumptive data sets, the Center has often taken on a set of the roles traditionally played by publishers. These data sets are reviewed by members of the HTRC staff for compliance with non-consumptive use standards prior to release to the authors.

As part of this work, the HTRC has offered as a service the capability to publish these non-consumptive, compliant data sets using a DOI scheme (Downie; 2015). This service enables the creation of new derivatives (Downie; 2015) of published non-consumptive compliant data sets.

A second benefit of opening access to these data sets is the ability to replicate current experiments that have been developed using the HTDL corpus and the HTRC tool set. From this standpoint the HTRC functions as a distant publisher of non-consumptive compliant data sets in support of new models of research inquiry.

Distant Publishing as Concept

Prior to defining the concept of distant publishing, it is first instructive to understand distant reading within the context of digital humanities. Distant reading was first codified in 2000 by noted humanist and scholar Franco Moretti:

Distant reading: where distance . . . is a condition of knowledge: it allows you to focus on

units that are much smaller or much larger than

the text: devices, themes, tropes - or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. (Moretti 2000)

Moretti later expanded the concept in his 2013 monograph of the same name (Moretti 2013). Much like Moretti's definition that focuses on enabling a broader view of the text, the distant publisher enables a broader view of data sets through bringing to bear the current corpus of computational tools for large-scale textual data mining and analysis. HTRC as a distant publisher is removed by at least one degree from the creator, and remains distinct from any standardized concept of publisher. Yet, data sets are published under the rubric of the HTRC, and these publications are freed from the constraints of copyright in this context due to their non-consumptive nature. Thus we define distant publishing as:

Publication of a non-consumptive data set outside of any standardized publishing construct, removed by x degree from the original creator, openly available to the community of scholars for replication and available for re-use in support of the advancement of knowledge.

This definition is one that the HTRC aims to further refine in the coming years. We welcome broader thoughts on this concept from those working to preserve open research data and the software that makes that data accessible for use in scientific experimental replication and re-use for the long-term benefit of the scholarly community.

Distant Publishing Use Cases

Currently the HTRC is developing models that support our current definition of distant publishing. These models are illustrated in several use cases, outlined below.

• Extracted Features Worksets - HTRC expects this concept to be further refined as we move toward the second round of HTRC Advanced Collaborative Support grants which will be funded in summer 2016. Our most progressive case for distant publishing at this point is leveraged through the publication and release of our main extracted features workset. The current workset is a prototype based on the 4.8 million volume public domain collection from the HTDL. Through 2016-17 this workset will be redefined to include more of the HTDL collection. From this initial workset publication we have seen further refinements of the workset by scholars such as Ted Underwood (Underwood et al. 2013), Colin Allen (Murdock, Zeng, and Allen 2016), and Matthew Wilkens (Wilkens 2013).

• HT+Bookworm - The HathiTrust+Book-worm (HT+BW) project (2016) presents textual content through interactive visualization. Whereas HT+BW has previously been used in standalone contexts with pre-determined metadata, currently HT+BW is enabling scholars to analyze custom personal collections from within the larger corpus and the use of HT+BW as a supplement to other uses of the HTRC. This concept could eventually become a new possibility for derived workset publication in its own right.

• HTRC Workset Ontology - Currently in development, the HTRC Workset Ontology is part of a collections data model by the Workset Creation for Scholarly Analysis project (HTRC 2016), an HTRC research initiative funded by the Andrew W. Mellon Foundation. The resulting HTRC Workset data model is designed to aid humanities scholars by helping them to describe selected portions of the HTDL corpus that serve as the objects of their research. The resulting worksets are persistent, citable, and can be assessed by other scholars for reuse in additional research processes.


Today's digital scholars are embracing new opportunities to explore their disciplines through the type of enhanced computational analysis that the HTRC provides. As the Center works to define emerging possibilities within the context of non-consumptive research, distant publishing will enable us to engage with the community of open data and open software publishers to ensure that our collections are accessible, open and available for the next generation of distant readers and their plans for new forms of scholarship.


The author would like to thank the Executive Management Team of the HTRC, J. Stephen Downie Co-Director, Beth A. Plale Co-Director, Beth Naymachchiv-aya, and John M. Unsworth, and all of the staff of the HathiTrust Research Center and the HathiTrust Digital Library for their contributions to the tools and services that make the concepts in this paper and the research of our users possible.


This article is licensed under CC BY 4.0.


Courant, P. N. (2009). “The Stakes in the Google Book

Search Settlement”. The Economists Voice 6 (9). Walter de Gruyter GmbH. doi:10.2202/1553-3832.1665.

Zeng, J., Ruan, G., Crowell, A., Prakash, A., and Plale, B.

(2014). “Cloud Computing Data Capsules for Non-Con-sumptiveuse of Texts”. In Proceedings of the 5th ACM Workshop on Scientific Cloud Computing - ScienceCloud 14. Association for Computing Machinery (ACM). doi:10.1145/2608029.2608031.

Jett, J., Cole, T. W., Maden, C., and Downie J. S.. (2016).

“The HathiTrust Research Center Workset Ontology: A

Descriptive Framework for Non-Consumptive Research

Collections”. Journal of Open Humanities Data 2 (March).

Ubiquity Press Ltd. doi:10.5334/johd.3.

HathiTrust Digital Library. (2016). “HathiTrust Research

Center Non-Consumptive Use Research Policy.”

Downie, J. S., Capitanu, B., Underwood, T., Organisciak,

P., Bhattacharyya, S., Auvil, L., Fallaw, C. (2015). “Extracted Feature Dataset from 4.8 Million HathiTrust Digital Library Public Domain Volumes”. HathiTrust Research Center. doi:10.13012/j8td9v7m.

Downie, J. S., Underwood, T., Capitanu, B., Organisciak,

P., Bhattacharyya, S., Auvil, L., Fallaw, C. (2015).

“Word Frequencies in English-Language Literature

1700-1922 (0.2)”. HathiTrust Research Center. doi:10.13012/J8JW8BSJ.

Moretti, F. (2000). “Conjectures on World Literature”. New

Left Review 1 (January): 57-58.


Moretti, F. (2013). Distant Reading. Verso.

Underwood, T, Black, M. L., Auvil, L., and Capitanu, B.

(2013). “Mapping Mutable Genres in Structurally Complex Volumes”. In 2013 IEEE International Conference on Big Data. Institute of Electrical

& Electronics Engineers (IEEE). doi:10.1109/big-data.2013.6691676.

Murdock, J., Zeng, J., and Allen, C. (2016). “Towards Cultural-Scale Models of Full Text”.

Wilkens, M. (2013). “Literary Geography at Corpus Scale”.

In Proceedings of Digital Humanities 2013. Alliance of

Digital Humanities Organizations.

Organisciak, P., Bhattacharyya, S., Auvil, L., Unnikrish-

nan, L., Schmidt, B., Shamim, M., McDonald, R., Downie, J., Aiden, E. (2016). “Adding Flexibility to Large-Scale Text Visualization with HathiTrust+Book-

worm”. In Digital Humanities 2016: Conference Abstracts.

Jagiellonian University & Pedagogical University, Krakow, pp. 854-856.

HTRC. (2016). “Workset Creation for Scholarly Analysis - A HathiTrust Research Center Project Funded by the Andrew W. Mellon Foundation.” http: //

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2017

Hosted at McGill University, Université de Montréal

Montréal, Canada

Aug. 8, 2017 - Aug. 11, 2017

438 works by 962 authors indexed

Series: ADHO (12)

Organizers: ADHO