From a Reference Book to Research Data: Literary Bibliographies as Sources for the Data-driven Research

paper, specified "long paper"
  1. 1. Vojtěch Malínek

    Institute of Czech Literature - Czech Academy of Sciences

  2. 2. Tomasz Umerle

    Institute of Literary Research - Polish Academy of Sciences

  3. 3. Piotr Wciślik

    Institute of Literary Research - Polish Academy of Sciences

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Rise in data-driven studies
Data-driven research on bibliographical data in the field of literary studies is on the rise from the beginning of the 21st century with researchers such as Franco Moretti (Moretti, 2005; 2013), Matthew Jockers (Jockers, 2013), Katherine Bode (Bode 2012; 2017), or Hoyt Long and Richard So (Hoyt and So, 2013) leading the way of empirical and theoretical research and discussions.
Notable and established methods of data-driven research – like “distant reading” (Moretti), or “macroanalysis” (Jockers) – have gained considerable recognition in the scholarly field. However, recently the quality and representativeness of datasets for performing such research have come under scrutiny (Bode, 2017). Bode calls for data-driven research that is based on datasets that are richer, better documented, curated, and more systematic in their scope.
While Bode addresses the scholarly community in the first place, this paper, written from the perspective of data producers, recognises the challenges facing existing literary bibliographic resources in the age of data-driven research. Basing on the experience with two large bibliographic databases (Czech Literary Bibliography [CLB] and Polish Literary Bibliography [PBL]), this paper assesses how literary bibliographies adapt to the need for advanced data uses, and how application of data-driven methods in literary research revolutionises the way the bibliographies are prepared, standardised and published. In conclusion, the paper identifies the gap between data studies and data production, which should be bridged through the interdisciplinary cooperation within digital humanities.

Literary bibliographies: from auxiliary subdiscipline to producers of research data
Literary bibliography (subject bibliography of literary matters) has traditionally been an auxiliary subdiscipline in the humanities, and its research goals were relatively limited. The place of bibliography within the larger scientific context – and the very meaning of the term – has shifted in the last decades. Until mid-20th century that meaning was broad and covered a variety of “problems of organizing and selecting documents” (Buckland, 2011: 36), while more recently these problems have been attributed mainly to disciplines connected to library and information sciences (LIS). And the term “bibliography” narrowed its scope to “detailed examination of printed books as physical objects” (Ibidem).
As a result of this shift, bibliographers came to play a rather passive role in the development of the research potential of data gathered by subject bibliographies in the humanities. Such development could only be driven by the expectations of researchers in the field, and these, until recently have been rather modest, rarely demanding more than reference provision.
The rise of data-driven research literary had a far reaching impact on literary bibliographies, redefining the role of bibliographers as data specialists, their core operations, the setup of bibliographical departments, and the modes of gathering and publishing metadata.
The way data-driven research uses bibliographic data amounts in fact to redefinition of a bibliographic record itself. While its main purpose used to be to provide reference, i.e. an accurate representation of the described document (book, article, etc.), now this aura of “raw data” (Gitelman, 2013) have dissipated in favor of a more nuanced understanding, which presents the bibliographic record as already a statement of both cultural meaning and certain research assumptions, which need to be examined in their own right. This is especially relevant in case of literary bibliographies. As texts constitute the quintessential primary source of literary studies, the advances of data-driven research in this field depend on the acute awareness of the possible implications of decisions, methods, tools used by literary bibliographers. And conversely, it depends on whether literary bibliography can open to and assimilate the insights coming from e.g. the sociological and cultural developments within historical bibliography (McKenzie, 1986; Gupta, 2015), or the new wave of documentation studies (Lund, 2009; Buckland, 2015), which examines e.g. theoretical issues of “documenting” and its implications in contemporary culture (Day, 2014; Ferraris, 2012).

CLB and PBL as scholarly data infrastructures
CLB and PBL are examples of scholarly infrastructures that gather rich and large datasets that aim at reconstructing the “literary system” in a most comprehensive way possible (Bode, 2017: 88). They are unique resources of open bibliographic metadata, covering respectively 250 and 70 years, each containing app. 0,7 mln records in database form, and app. 1,5 mln records in form of scanned pages/cards of bibliographic information. Departments responsible for their creation have been continuously running for 70 years, consisting of teams of app. 15 bibliographers working full-time. In the last decades they transformed their bibliographic output from the original printed form (books) to the database versions with proprietary system and public interfaces that present large, deeply structured datasets.
The information they gather represents an augmented perspective on the complexities of literary culture as a whole. It deals with literary, artistic and scientific,
publications (books, journals, plays), their
reception, literary
events (competitions, prizes), and the organization of literary
scientific life. The special attention is paid to the
actors of the literary life (authors, publishers, journalists, scholars etc.; with emphasis on the attribution of the authorship, codenames, pseudonyms etc.), and the
cultural and scientific institutions and their activities.

For each year CLB and PBL process thousands of literary books, and hundreds of journals in order to register information on a broad range of literary subjects outlined above.
The idea behind literary bibliographies is an idea of
recreating the complexities of literary system through bibliographic metadata:

- they aim for completeness of information, processing relevant sources (nearly) in totality – books, parts of the books, proceedings, articles from the journals and newspapers, in the print and electronic (CLB) form;
- they deal with the broad cultural environment: register daily journals, presence of literary issues on radio, TV, the Internet (CLB);
- they aim for methodological stability through decades of processing sources of information, and they rely on stable teams of bibliographers (specialists in literary studies);
- they use the same infrastructure for broad range of topics – information about the literary figures, works, organisation, events etc. are interconnected;
- they are a long-running projects with stable financial foundation.

The “missing link” between data producers and data researchers
While certain researchers – as Bode argues – fail to recognize or describe the limitations of datasets they work on, more thorough and sustained discussion between the researchers in the field of data research and data producers,could lead to both better research and better preparation of bibliographic data.

Cooperation in bridging empirical data studies in literary research and regular data production will lead to a better understanding of how certain literary issues (sociological, theoretical, thematic, etc.) can and should be represented in the form of bibliographic datasets, not only from technological point of view (which is less of an issue), but from a literary point of view.

The paper will expand on the complexities of 5 crucial issues from CLB and PBL databases that should be the subject of such analysis:
1) very basic yet underappreciated issue of readiness of bibliographic databases for quantitative analysis and statistical representation; the bibliographies have not traditionally taken this kind of usage of their data into account (the very small or relatively unimportant information can produce many records); this can be taken into account by the researcher, but this issue is so prevalent, that it needs to be addressed;
2) balancing the stability of metadata (especially subject headings) with the sensitivity to changes in terminology; in large databases the subject headings could be changed without the “memory” of the previous ones which is sometimes needed – in case of mistake on inherent prejudices (Knowlton, 2005) – but sometimes it might induce ahistoricity;
3) how to account for the incompleteness and heterogeneity in data and the assessment of its impact on data research? Not all incomplete information is retrospectively traceable, as the literary bibliographic databases have not valued possible statistical uses; for example – for different years there might be journals that have not been processed due to very practical reasons (lack of access to the issue, etc.), or materials can be processed using different methods (from autopsy, from data extraction etc.);
4) the introduction of “new” or “atypical” forms of documents into the literary bibliographic databases – like Internet documents, or grey literature, samizdat, performance, stand-up, video games etc. – with the awareness of limited workforce available (issues of prioritizing);
5) the relative lack of subject analysis of artistic texts in bibliographical databases: How it should be accounted for? May it be supplemented with the results of textual studies? How big of an issue it is for the development of further research?
Issues like these cannot be solved only by data producers, but rather through cooperation of data producers and users, especially the researchers. The broader cooperation between both sides can significantly develop the possibilities for the data-driven research (not only) in the field of literary studies.
Activities of Czech Literary Bibliography Research Infrastructure are supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project Code LM2015059).


Bode, K. (2012).
Reading by numbers. Recalibrating the literary field. London, New York: Anthem Press.

Bode, K. (2017). The Equivalence of “Close” And “Distant” Reading. Or, toward a New Object for Data-Rich Literary History.
Modern Language Quarterly, 78(1): 77–106.

Buckland, M. (2011). Data management as bibliography.
Bulletin of the American Society for Information Science and Technology 37(6): 34–37.

Buckland, M. (2015). Document Theory: An Introduction. In Willer, M., Gilliland, A. J., Tomić, M. (eds),
Records, Archives and Memory: Selected Papers from the Conference and School on Records, Archives and Memory Studies, University of Zadar, Croatia, May 2013. Zadar: University of Zadar, pp. 223–37.

Day, R. E. (2014).
Indexing It All: The Subject in the Age of Documentation, Information, and Data. Cambridge, London: MIT Press.

Ferraris, M. (2012).
Documentality: Why it is Necessary to Leave Traces. New York: Fordham University Press.

Gitelman, L. (ed) (2013).
“Raw Data” is an oxymoron. Cambridge, London: MIT Press.

Gupta, S. (2015).
Consumable Texts in Contemporary India. Uncultured Books and Bibliographical Sociology. Basingstoke: Palgrave Macmillan.

Jockers, M. (2013).
Macroanalysis. Digital Methods & Literary History. Chicago: University of Illinois Press.

Knowlton, S. A. (2005). Three Decades Since Prejudices and Antipathies: A Study of Changes in the Library of Congress Subject Headings.
Cataloging and Classification Quarterly 40(2): 123–45.

Long, H. and So, R. (2013). Network Science and Literary History.
Leonardo 46(3): 274.

Lund, N. W. (2009). Document theory.
Annual Review of Information Science and Technology 43(1): 1–55.

Maryl M. and Wciślik P. (2016). Remediations of Polish Literary Bibliography: Towards a Lossless and Sustainable Retro-Conversion Model for Bibliographical Data. In
Digital humanities 2016: Conference Abstracts. Kraków: Jagiellonian University & Pedagogical University, pp. 621–23.

McKenzie, D. F. (1986).
Bibliography and the sociology of texts. London: The British Library.

Moretti, F. (2005).
Graphs, maps, trees. Abstract Models for Literary History. London: Verso.

Moretti, F. (2013).
Distant Reading. London: Verso.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.