Recent years have witnessed an impressive surge in the interest for Natural Language Generation. Advances in neural language modeling in particular have boosted the capacities of computational text generation systems, resulting in increased realism of generated text and the capability to generate text in a variety of genres or styles (Ficler et al., 2017). Aside from scholarship, text generation is currently triggering a significant deal of attention in the arts, for instance, with the emergence of artist communities such as botnik.org. While the limited semantic coherence of generated text remains a worry, the grammatical correctness and stylistic qualities of these artificial texts are remarkably convincing (Manjavacas et al., 2017).
The current study set out to study how well human readers are able to distinguish between authentic and synthetic text fragments, with an interpretive focus on the linguistic cues that readers seem to rely on in their authenticity judgements. As a case study, we turned to the domain of Hip-Hop lyrics, known for its explicit content and typical rhythmical delivery (Adams, 2009; Potash et. al.,, 2015). Through a large-scale experiment in the form of a serious game, we crowdsourced authenticity judgments for original and generated rap lyrics. Through statistical modeling of the resulting database of authenticity judgments, the present study aims to (i) enhance our understanding of the cognitive processes at play in the perception of authentic and synthetic artifacts in cultural production and (ii) improve text generation systems.
We have compiled a large corpus of Hip-Hop lyrics from the Original Hip-Hop Lyrics Archive (ohhla.com), an online archive of Hip-Hop songs. This corpus, amounting to approximately 38M tokens in about 64k songs, was preprocessed in a pipeline that included line and stanza detection, tokenization, syllabification, and grapheme to phoneme conversion (to detect word stresses and rhyme words). These data were used to train a neural language model (LM), which makes use of recurrent connections to model longer-term dependencies (Hochreiter et. al., 1997). With the help a LM, a sentence of n words, represented by words w_1 to w_n can easily be generated by following the generative process expressed in Equation 1:
During generation, we sample from the output distribution at each step inputting the token sampled at the previous step, which can also be a single character or syllable, depending on the sequences the model was trained on. We trained three kinds of LMs at different levels: a character-level LM, a syllable-level LM and a hierarchical LM (Chung et. al., 2016), the latter being similar to a word-level LM in that it decomposes the probability of a sentence into the conditional probabilities of its words but, additionally, it decomposes the probability of each word on the basis of its characters. Additionally, we also experiment with a conditional LM variant to each of the three models, which controls for sentence length and final sentence phonology (i.e. the phonological representation after the last stressed vowel). Figure 1 shows examples of generated text.
Figure 1: Generated samples from the experiment randomly extracted from different difficulty bins (e.g. 25%-50% refers to examples in the 25%-50% difficulty percentile according to a logistic classifier). Models correspond to character-level (C), syllable-level (S) and hierarchical (H). The trailing "+" indicates a conditional model.
Serious game
At the heart of this study lies a crowdsourcing experiment carried out at a popular music festival. In this context, we collected authenticity judgments from participants in a serious game relating to how well participants could distinguish between authentic and artificial fragments generated by one of our models. In order to efficiently communicate the game’s purpose in the media, we publicized this experiment as a so-called ‘Turing test’, although the description below will make clear how the set-up is markedly different from the “imitation game” which Turing (1950) originally proposed. Each game took the form of a series of (independent) questions, each of which had to be solved within a time limit of 15 seconds. The questions randomly alternated between two kinds:
Type-A: presented with an authentic and an artificially generated fragment, the participant had to decide which fragment was authentic.
Type-B: presented with a single fragment the participant had to decide whether the fragment was authentic or generated.
Type-B questions involved less reading but only presented participants with a single fragment, meaning that participants were unable to compare two fragments. Each game allowed the player to answer at least 11 questions and the player was awarded one point per correct answer. After the first 10 questions, the game switched into a ‘sudden death’ mode, allowing the player to continue until a false answer was given. The length of fragments was randomly varied between 2-4 lines. We ranked pairs of generated and authentic texts in terms of difficulty (see below). Pairs were then collected into bins of increasing difficulty. After 5 questions, the questions presented would be sampled from the next, more difficult bin.
Figure 2: Example of a type-A question in the game's interface.
Modeling authenticity
Each fragment for the game was enriched with a set of linguistic features that were deemed to be relevant in modeling the difference between authentic and synthetic Hip-Hop. These included morphological, lexical and syntactic characteristics (see below), which we refer to as “cues”. Most of these can be argued to capture some aspect of the linguistic complexity of the fragments. Using these features, we fit a Bayesian logistic regression model (see, e.g. Hoffman & Gelman 2014) with participants as random effect against the following two response variables (i) whether the text is authentic or generated (modeling objective authenticity) and (ii) the actual participant authenticity judgment (modeling perceived authenticity). Additionally, we also study and control for authenticity perception biases, learning effects and linguistic cues learnt to be exploited by participants to solve the game.
We restrict this discussion to our most salient results and refer the reader to the presentation for further details.
On average, the authenticity detection accuracy was above chance level (50%), with participants correctly answering 60.5% of the time. With 58% median accuracy, participants performed slightly worse on questions of type-B than of type-A, suggesting that the task becomes harder in the absence of a reference point. In addition, we observed marked differences in the authenticity detection accuracy for the three aforementioned Language Models. Hierarchically and word-level generated fragments were markedly harder to detect than character-level fragments.
As shown in Figure 3a, there is considerable evidence of a learning effect in both question types, especially towards the beginning of the game. Importantly, the learning effect must be explained differently depending on question type. By design, any learning effect in type-A questions can only involve accuracy of the original fragment. For type-B questions, however, the learning effects seems to reflect a shift in bias towards “generated” to bias towards “original”, as can be seen in Figure 3b.
Figure 3: Marginal effects with 95% credible intervals of trial number for both type-A and type-B questions (a). Marginal effects plots showing the effect of trial number on guess accuracy for “Authentic” and “Generated” in type-B questions (b).
To estimate the objective feature importance, we perform a logistic regression analysis, with as dependent variable whether a text was original or generated, and as predictors the linguistic cues (Figure 4a); we performed the same procedure to model the perceived authenticity of text fragments (right panel). The odds ratios show interesting (dis)similarities in feature importance between the objective and the perceived authenticity. The average depth of syntax trees, for instance, suggests that generated text fragments have considerably less complex sentence constructions and this was clearly picked up by participants as well. Interestingly, Figure 4b shows that humans easily overestimated the positive weight of some feature types—e.g. the portion of politically incorrect words (pc words)—, indicating that humans underestimated a machine’s ability to produce foul language. These observations point to specific aspects for future text generation research to improve on.
Figure 4: Log-Odds ratios for objective (a) and subjetive (b) feature importance.
