An Adaptive Methodology: Machine Learning and Literary Adaptation

paper, specified "long paper"
Authorship
  1. 1. Grant Glass

    University of North Carolina at Chapel Hill, United States of America

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction

Using one of the most adapted texts in history,
Robinson Crusoe,
I ask whether or not a computer can find adaptations that scholars have yet to identify. Through testing the effectiveness of different machine learning techniques for text embedding on small groups of full-length texts, I determine the best model for our task, the
universal sentence encoder, and then use it to build a deep neural network based binary classifier trained on a large dataset of adaptation and random texts. I attempt to
implicitly
teach the computer the plot of Crusoe, instead of making decisions based on stylistic details, as is a pitfall of traditional techniques. It is my hope that this novel pipeline will help other scholars work with large units of text at the plot level.
Works like, Daniel Shore’s
Cyberformalism: Histories of Linguistic Forms in the Digital Archive,
Andrew Piper’s.
Enumerations: data and literary study
,
and Ted Underwood’s
Distant horizons: digital evidence and literary change
all have
attempted to change the literary methodology by using algorithms to find patterns and features in texts. While these methodologies utilize many machine learning techniques, these methods have met with massive pushback from the larger humanities community.

See
Da, Nan Z. "The computational case against computational literary studies."
Critical inquiry
45.3 (2019): 601-639.

At the same time, these new methodologies force scholars to think about modeling and conceiving of literary texts differently (McCarty). In this shift of modeling literary text, the question that often comes up is how we can frame these literary questions in a format that a machine learning algorithm can understand.

This problem might be best considered against Daniel Defoe’s
The Life and Surprising Adventures of Robinson Crusoe
, which has never been out of print in its over three-hundred year print history and has amassed thousands of editions – not to mention the plethora of movies and T.V. shows. Using this text allows us to gain enough material to make these machine learning algorithms viable. Teaching a machine learning algorithm the story of Crusoe (by feeding it different adaptations) we could ask if it could start to distinguish between a random story and a Crusoe like story? Could it identify new adaptations of Crusoe that have yet to be discovered?

Data Description

In early experiments to determine the suitable text embedding technique, I used four different texts: the “Original Text” which is the 1719 first edition of
Robinson Crusoe
by Daniel Defoe, a “close” adaptation

“Close” refers to the similarity in setting, characters, and plot to the original.

of Jenichiro Oyabe’s
A Japanese Robinson Crusoe,
a “far” science fiction adaptation

“Far” refers to only the plot being loosely similar to the original text.

called
The Happy Castaway
(1965)
,
and a random text,
Pride and Prejudice
by Jane Austen (1813). I chose the text based on my own scholarship of Robinson Crusoe: the “close” adaptation is one that follows the exact storyline, but reimagines the story as a Japanese man in America, the far adaptation takes the same plot, but everything about the text is changed, and the random text is something similar stylistically, but has no character or plot similarity to Crusoe.

The final project utilized two different large datasets, the first corpus of which was a random pooling of 2,188 texts from the
Eighteenth Century Collections Online (ECCO) Text Creation Partnership (TCP)

https://textcreationpartnership.org/tcp-texts/ecco-tcp-eighteenth-century-collections-online/

see
Welzenbach, Rebecca. "Making the Most of Free, Unrestricted Texts: a first look at the promise of the Text Creation Partnership.
"

. This data is freely available through the ECCO-TCP website and was verified through the corresponding CSV file, a preview of which is included in Table 1.

Table 1: ECCO-TCP CSV File describing all the data in the corpus.

The next corpus included 1,484 texts drawn from a variety of variations of
Robinson Crusoe,
pulled from HathiTrust

https://www.hathitrust.org

see
Christenson, Heather, "HathiTrust.”

using Hathitrust’s Rsync.

Core Methodology

The first experiment to determine which method would generate the best text embeddings for this task was with the sklearn TfidfVectorizer to build the embeddings of our training data

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

We used the S

. I calculated the cosine similarity scores for each of the texts to the reference text (i.e. the Original Text)
using sklearn’s metrics-pairwise package
. Then I experimented with
Google Research’s Universal Sentence Encoder (USE)

https://tfhub.dev/google/universal-sentence-encoder/4

-
the latest pretrained model available, updated 2020. See
Yang, Yinfei, et al. "Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax."

. While originally meant for generation of sentence-level embeddings, the model does not actually require a set maximum sequence length, which is a useful functionality that allows us to represent full-length texts of varying lengths as a fixed-dimensional embedding layer.
In the end, I chose to work with the USE embeddings because it gave more context-aware, and as a result, discriminatory embeddings than the other candidates. Notice (in Figure 1) that the text determined as 'close' to the reference text by me (human expert), while indeed the closest, still showed a cosine similarity of only 0.528. Further, the texts determined as 'random' and 'far' were also significantly further from 'close' as well as the reference text, but very close to each other - which is what we might expect from a model which has learnt semantic relationships particularly well (after all, why should
Pride and Prejudice
be closer to
Robinson Cruso
e than
The Happy Castaway
- both are unrelated by plot). Note that the BERT embeddings,
Pride and Prejudice
turned out to be closer to
Robinson Cruso
e which we posit is due to the nature of the sentence-level embeddings - the representation learnt is more about the similarity in the stylistic/linguistic/grammatical/lexical sense than about the plot.

Figure 1: Results of Initial Method Across Different Texts (1.0 being closest to the original text) USE-Universal Sentence Encoder.

Final Results
The model performs exceptionally well on the validation and test sets, identifying the adaptations (denoted by class 1 in Figure 2) with near perfect precision and recall.

Figure 2: The classification results from the mode. 0 denotes non-adaptations and 1 denotes adaptations

Current Conclusions and Future Work

The potential pitfall with this technique is that I will not be able to measure how similar a text is to
Robinson Crusoe
, on a more textual level, but I have already attempted this in previous work using doc2vec. By using this new technique to look at a larger window of text than a sentence, we can find works that share a similar plot, which would begin to make a new model of adaptation centered around plot rather than setting or characters. The challenge becomes where exactly the plot gets figured out, what unit of text can tell us that? If we can begin to think about where the plot gets encoded in the text and we can make the window of analysis the same, then we can begin to move forward to other machine learning problems.

Bibliography

Chaudhary, Vishrav, et al. "Low-Resource Corpus Filtering using Multilingual Sentence Embeddings."
arXiv preprint arXiv:1906.08885
(2019).

Christenson, Heather. "HathiTrust."
Library Resources & Technical Services
55.2 (2011): 93-102

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding."
arXiv preprint arXiv:1810.04805
(2018).

Foster, Thomas C., David De Vries, and © 2012 by Harper Collins Publishers.
How to read literature like a professor
. Harper Collins Publishers, 2012.

Frye, Northrop.
Anatomy of criticism: Four essays
. Princeton University Press, 2020.

McCarty, Willard.
Humanities computing
. 2014.

Moretti, Franco.
Distant reading
. Verso Books, 2013.

Piper, Andrew.
Enumerations: data and literary study
. University of Chicago Press, 2018.

Rae, Jack W., et al. "Compressive transformers for long-range sequence modelling."
arXiv preprint arXiv:1911.05507
(2019).

Sanders, Julie. "Adaptation/Appropriation."
The Encyclopedia of the Novel
(2010).

Shore, Daniel.
Cyberformalism: Histories of Linguistic Forms in the Digital Archive
. JHU Press, 2018.

Underwood, Ted.
Distant horizons: digital evidence and literary change
. University of Chicago Press, 2019.

Watt, Ian.
The rise of the novel
. Univ of California Press, 2001.

Welzenbach, Rebecca. "Making the Most of Free, Unrestricted Texts: a first look at the promise of the Text Creation Partnership." (2011).

Yang, Yinfei, et al. "Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax."
arXiv preprint arXiv:1902.08564
(2019).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO