The Telltale Hat: LDA and Classification Problems in a Large Folklore Corpus

paper, specified "long paper"
Authorship
  1. 1. David Mimno

    Department of Information Science - Cornell University

  2. 2. Peter Michael Broadwell

    Libraries - University of California, Los Angeles (UCLA)

  3. 3. Timothy Roland Tangherlini

    Department of Asian Languages and Cultures - University of California, Los Angeles (UCLA), Scandinavian Section - University of California, Los Angeles (UCLA)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
Classification is a vexing problem in folkloristics. Indexing collections that often include tens of thousands of records is essential, but neither fully manual nor fully automated methods are adequate. In this work, we combine human notions of genre and topic classification with computational classifiers and topic analysis to produce an indexing that is both appropriate for scholarly goals and robust in the presence of ambiguity.

Traditional scholarly indexes have been limited by time and technology. Although broad genre classifications such as “ballad”, “folktale”, and “legend” are well established, these formal classifications are coarse and do little more than sort the materials into large, internally diverse groupings. Most standard classification schemes assign each record to a single classification and do not allow for cross-genre classification (e.g., a ballad and a legend about the same murder will be in different categories).1234 The inadequacy of these classification schemes has significantly constrained research on verbal folklore, particularly because such categorizations are often the only available topic index for any given collection.

New unsupervised machine learning methods offer scalability but lack human intelligence. Clustering algorithms partition a corpus into groups of documents that are similar. Topic modeling is more flexible, allowing each document to express multiple automatically detected themes. But such methods usually rely on simple bag-of-words representations that miss aspects of a text that are clear to readers familiar with the corpus. In addition, patterns found by algorithms may be statistically valid but uninteresting to scholars.

We explore the problem of classification in a large corpus (~35,000 records) of nineteenth-century Danish folklore and suggest possible solutions to these problems through classification and topic-modeling strategies that combine human labels with machine learning. We consider two classification schemes for the collection: in the first, each document receives one label, whereas the second assigns multiple labels to each document.

One label per story
The original collector assigned each story to exactly one of 36 labels, but we are most interested in “borderline” stories that could fit in many classes. These “liminal” stories not only reveal the challenges to classification that arise when a system can only accommodate a single label—as in the original index—but also help researchers to discover stories that are anomalous.

An excellent example of such an anomalous story appears in our target corpus, Danske sagn [Danish Legends]:5

DS_I_056:

Per Overlade was out one evening shooting hares. It was up on Kræn Møller’s field. Kræn was in the process of moving his farm, and the old farm had not been completely disassembled yet, and Per intended to hide amid the old frame that was still standing and shoot a hare or two. But when he gets there, he sees an old man who is sitting in there with a red cap on who nods to him. Per gets scared and doesn’t dare go in there, and so he doesn’t catch any hares.

Originally labeled as a story about “mound dwellers/hidden folk,” the story could just as easily be classified in several other categories: poaching, household guardian spirits (nisse, suggested by the old man’s red hat), and law breaking, to name but three. The story also touches on shifting agricultural practices and the significant reorganization of the Danish landscape in the early 1800s, when farms were routinely dismantled and moved out onto the newly reapportioned fields.

Where else could the editor/archivist have placed this story? To answer this question, we train a Naïve Bayes classifier by estimating a word-frequency histogram for each label. We then measure the similarity of a document to each of the resulting histograms, taking care to remove the word counts for the “query” document from the histogram for its original label. For many stories, the “true” label is the closest, but not in this case. Its top five labels in order are:

ID Story label
36 Our forbears' way of thinking and spiritual life
35 Outdoor life
29 Witches and their sport
27 Being in league with the Devil
1 Mound dwellers/hidden folk
Although the first assignment is so broad as to be of little use—emphasizing the inadequacy of the original index—the association of the story with topic 35 highlights its affinity to stories about hunting and poaching, while topic 29 indicates the story's connection with hares—animals most commonly associated with witches.

Additionally, we can use this classification scheme to initialize a 36-topic model, creating one topic per original label. We assign each word token to the same topic as the label of its document. We then resample topic assignments for each word token in turn. Given the topic assignments of the tokens in a document, we can rank the topics for that document. After one sweep through the entire corpus, the “Mound dwellers” topic still accounts for more than 80% of the tokens in the story of Per Overlade, but after 10 sweeps, only 21% of the words remain in that topic. “Our forebears' way of thinking” and “Being in league with the Devil” instead account for a greater proportion, with the “Devil” topic triggered by words about shooting hares. Overall, the original topic class now accounts for the majority of tokens in 74% of the stories in the collection.

As we increase the number of sweeps through the corpus, the relationship between the topics of the model and the original labels becomes attenuated. At 100 sweeps, the majority of tokens remains in the original class for only 39% of the stories. In our sample story, the prominent topics are “From the time of villeinage”, “Wiverns and small creepy-crawlies”, “Our forebears' way of thinking”, and “Death portents”. Words about shooting and hares are now assigned to the “Wiverns” topic, indicating that we should be careful in using these labels. The “Death portents” topic is represented by the words forskrækket (scared) and sidder (sitting).

Finding anomalous stories is not simply a question of precision and recall: the very fact that a story is “missed” in a given classification makes it particularly interesting. One of the jobs of the folklorist is to reconstruct the imaginary boundaries of the belief world, so stories that question or test those boundaries are the ones that are most important. Computationally cross-validating a traditional human-generated index, as described above, is an effective way to discover such liminal cases.

Multiple human-generated labels
We can also construct computational story classifiers when editors assign more than one label to each document. Human experts have catalogued a subset of the documents in our target corpus by assigning multiple labels to each document from a modern ontology that includes aspects of stories such as people, locations, and events. We would like to know how these labels map to the words in the documents, but simply counting the words in every document assigned to a label may result in noisy histograms. To improve our ability to interpret the results, we use a labeled topic model to learn which words are associated with which labels.

Multiple labels add complexity but allow us to make stronger assumptions. Since each document has more than one label, we cannot easily translate these labels into word-level assignments as in the previous experiment. On the other hand, we can be reasonably certain that the absence of a label implies that it is not relevant. Similar to LabeledLDA6, we can therefore estimate word-topic assignments under the constraint that words can only be assigned to one of the labels for the document, or to a “Background” label that can absorb frequent words not related to any label. We then re-estimate topic-word distributions given these assignments, and repeat the process as needed.

To evaluate the resulting word distributions, the original creator of the ontology marked individual words that are highly relevant to each label. At each stage of the algorithm, we have a ranked list of words for each label. Given relevance assignments, we can compute mean average precision (MAP) for the model at each stage. Under the initial noisy distributions, MAP for precision up to rank 20 is .26. After the first iteration, MAP increases to .33, but then begins falling in subsequent iterations, indicating that the model may be overfitting.

Consistent differences in ranking quality provide insight into labels. We are more successful at finding words related to concrete themes such as people, animals, and objects. More abstract labels, such as story resolutions and actions or events, were mostly unsuccessful. But there are exceptions: we identified no words related to the label “Farmer”, despite the fact that this is a very common label, while events such as “Disease” and “Death” identified many specific words.

Conclusion
We demonstrate that classification and topic modeling methods can be used to improve existing manual annotations in a collection of Danish folklore. We find that incorporating human labels into machine learning methods—even when the labels are noisy or incomplete—produces indexes that have the benefits of both scholarly domain expertise and data-driven analysis. We believe that these results are applicable for many corpora both in digital humanities and the wider document analysis community.

References
1. Uther, Hans-Jörg. (2004). The Types of International Folktales: A Classification and Bibliography, Based on the System of Antti Aarne and Stith Thompson. FF Communications. Helsinki: Suomalainen Tiedeakatemia.

2. Grundtvig, Svend, Axel Olrik, Hakon Grüner-Nielsen, Karl-Ivar Hildeman, Erik Dal, Iørn Piø, Thorkild Knudsen, Svend Nielsen, and Nils Schiørring, eds. 1966–1976 [1853–1976]. Danmarks gamle Folkeviser. 12 volumes. Copenhagen: Universitets-Jubilæets Danske Samfund (Akademisk forlag).

3. Taylor, Archer. (1934). An Index to "The Proverb". FF Communications 113. Helsinki: Suomalainen Tiedeakatemia, 1934.

4. Christiansen, Reidar T. (1958). The Migratory Legends. FF Communications 175. Helsinki: Suomalainen Tiedeakatemia.

5. Kristensen, Evald Tang. (1892). Danske sagn, som de har lydt i folkemunde. Århus and Silkeborg: Århus Folkeblads Bogtrykkeri.

6. Ramage, Daniel, David Hall, Ramesh Nallapati, and Christopher D. Manning. (2009). Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 248-256.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO