Uncovering the Black Fantastic: Piloting Text Similarity Methods for Finding “Lost” Genre Fiction in HathiTrust (Poster)

Nikolaus Nova Parulian; Ryan Dubnicek; Glen Layne-Worthey; Seretha Williams; Clarissa West-White; Isabella Magni; J. Stephen Downie

Authorship

1. Nikolaus Nova Parulian

HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA
2. Ryan Dubnicek

HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA
3. Glen Layne-Worthey

HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA
4. Seretha Williams

Augusta University
5. Clarissa West-White

Carl S. Swisher Memorial Library, Bethune-Cookman University, Daytona Beach, FL USA
6. Isabella Magni

HathiTrust Research Center, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN USA
7. J. Stephen Downie

HathiTrust Research Center, iSchool, University of Illinois, Champaign, IL USA

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Background
Within the tradition of speculative fiction, many sub-genres of literature engage with world-building and speculation about the future of humanity, civilization, and our institutions. While the term “Afrofuturism” is commonly used to describe the ways in which artists and creators of the African Diaspora engage with the intersections of race and technology in their works, the less contested term "Black Fantastic" more accurately reflects transcultural iterations of world-building (Iton, 2008). While there is an active scholarly community studying the Black Fantastic (Third Stone, 2021), the use of computational methods in this study is inhibited by a lack of digital collections of Black Fantastic literature. Though the rise of digital libraries has led to new studies and insights into forms and histories of literary genres (Schöch, 2017; Underwood, 2016; Gittel, 2021; Wilkens, 2016), before genre-specific texts can be studied algorithmically, they must first be identified amidst the massive holdings of digital libraries, such as those of the HathiTrust. This project outlines efforts to use text similarity methods to uncover Black Fantastic texts that may be hidden thanks to incomplete metadata or to cataloging practices not conducive to fine-grained genre identification.

Methods & Analysis
Catalog searches in the HathiTrust Digital Library (HTDL) of previously-identified Black Fantastic titles revealed a preliminary list of 13 unique titles. Because metadata records for these volumes do not include “black fantastic” or “afrofuturism,” common library catalog tools would not uncover these known items nor other potential Black Fantastic works. This challenge presented an opportunity to deploy technical methods, specifically text-similarity clustering, to reveal volumes of interest potentially hidden within the HTDL. We started by collecting samples of Black-authored fiction and manually labeling those identified as belonging to the Black Fantastic genre, then analyzed the lexical features that best distinguish these from both general fiction and Black-authored non-Fantastic fiction.
We randomly sampled 100 volumes each of Black-authored and general fiction, then used HTRC Extracted Features data to aggregate word usage in each volume. Our first attempt at distinguishing Black Fantastic texts necessarily relied on a very limited, 13-volume sample set. We aggregated frequently-used words most closely associated with each genre and generated a Latent Dirichlet Allocation (LDA) topic model (Blei, et al., 2003; Rehurek and Sojka, 2011) to illustrate the different sets of characteristic words for each sub-genre. Figure 1 shows these words for each text class.

(a) General fiction (b) Black-authored fiction

(c) Black Fantastic fiction
Figure 1. Topic model representation of distinctive words in each sub-genre of the sample. We categorized each genre with a 4-topic model. Words that most distinguish topics from the general fiction category include “duke”, “car”, “dollars”, “president”, and others, whereas topics from Black-authored fiction rely most on words such as “preacher”, “church”, religious”, “worship”, “encampments”. Moreover, with the addition of words distinct to Black-authored fiction, the Black Fantastic topical signal relies more on words such as “politer”, “bodkin”, “sundered”, “libels”, “amble”, “oracular”, etc.
Next, we use distinct word sets to develop a classification model for the three sub-genres. We experimented with a hierarchical classification system, first attempting to distinguish Black-authored from general fiction, and then to identify the Black Fantastic sub-genre from among all Black-authored fiction. We used the Stochastic Gradient Descent (SGD) algorithm (Bottou, 2012; Pedregosa, et al., 2011) for our classification models based on Term Frequency-Inverse Document Frequency (TF-IDF) features. Figure 2 shows the confusion matrix for each classification of the test set. As we can see, the classification of Black-authored fiction works well, with 90% accuracy on average, but the Black Fantastic classification is less effective, with 67% accuracy and 75% recall. This further implies that our choice of distinctive tokens as a feature to differentiate fictional sub-genres indeed works, but to a lesser degree for detection of the Black Fantastic as a genre. These errors might result from our limited Black Fantastic training set. Beginning with more Black Fantastic texts, and identifying their most distinctive words, might improve the classification model.

Figure 2. Confusion matrices for a hierarchical classification model. The left-hand matrix represents classification into two collections: general and Black-authored fiction; and the right-hand represents the classification of the Black-authored set into Black Fantastic and general Black-authored works.

Future Work
Our next phase will rely on an enriched Black Fantastic dataset in order to improve our classification accuracy. We will also experiment with different features, such as full text (rather than our bag-of-words approach). In spite of limited results for our ultimate goal of discovering new Black Fantastic texts, our success with this model for distinguishing related categories gives reason for optimism given more robust data.
Note: This submission is a companion piece to another one focused on the pedagogical aspects of the “Black Fantastic” project.

Bibliography

Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3(null): 993–1022.

Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Montavon, G., Orr, G. B. and Müller, K.-R. (eds), Neural Networks: Tricks of the Trade: Second Edition. (Lecture Notes in Computer Science). Berlin, Heidelberg: Springer, pp. 421–36 doi:
10.1007/978-3-642-35289-8_25.
https://doi.org/10.1007/978-3-642-35289-8_25 (accessed 1 June 2022).

Gittel, B. (2021). An Institutional Perspective on Genres: Generic Subtitles in German Literature from 1500-2020. Journal of Cultural Analytics, 6(1). Department of Languages, Literatures, and Cultures: 22086 doi:
10.22148/001c.22086.

Iton, R. (2008). In Search of the Black Fantastic: Politics and Popular Culture in the Post-Civil Rights Era. (Transgressing Boundaries: Studies in Black Politics and Black Communities). New York: Oxford University Press doi:
10.1093/acprof:oso/9780195178463.001.0001.
https://oxford.universitypressscholarship.com/10.1093/acprof:oso/9780195178463.001.0001/acprof-9780195178463 (accessed 1 June 2022).

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al. (2011). Scikit-learn: Machine Learning in Python. The Journal of Machine Learning Research, 12(null): 2825–30.

Rehurek, R., and Sojka, P. (2011) "Gensim–python framework for vector space modelling." NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, no. 2.

Schöch, C. (2017). Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama. Digital Humanities Quarterly, 011(2).

Underwood, T. (2016). Genre Theory and Historicism. Journal of Cultural Analytics, 2(2). Department of Languages, Literatures, and Cultures: 11063 doi:
10.22148/16.008.

Wilkens, M. (2016). Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction. Journal of Cultural Analytics, 2(2). Department of Languages, Literatures, and Cultures: 11065 doi:
10.22148/16.009.

About This Journal | Third Stone | Rochester Institute of Technology
https://scholarwor
ks.rit.edu/thirdstone/about.html
(accessed 10 December 2021).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022

"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO