Palimpsest: Improving Assisted Curation of Loco-specific Literature
University of Edinburgh, United Kingdom
University of Edinburgh, United Kingdom
Yahoo Labs, London
University of St. Andrews, United Kingdom
Paul Arthur, University of Western Sidney
Locked Bag 1797
Penrith NSW 2751
Converted from a Word document
interfaces and technology
natural language processing
data mining / text mining
This paper reports on interdisciplinary work carried out for the Palimpsest project, focusing on mining literary works set in Edinburgh, a UNESCO City of Literature.
1 The project’s aim is to use text mining to scour accessible literary works and find those mentioning Edinburgh or places within it. We ground ‘loco-specific’ passages of text by identifying their latitudes and longitudes, so that scholars and the public can both geographically explore their fictional city. Palimpsest is a collaboration between literary scholars studying the use of place in literature and computer scientists working on text mining and information visualisation. Through a range of maps and accessible visualisations, users are able to explore the spatial relations of the literary city at particular times in its history, in the works of specific authors, or across eras and writers.
We present an overview of the project workflow and describe the assisted curation process adopted. It involves automatic retrieval and ranking of accessible literature according to its loco-specificity followed by manual selection of ranked documents, resulting in a set of literary works identified as set in Edinburgh. We report on the fine-tuning of the retrieval and ranking prototype based on literary scholar annotators’ feedback.
Figure 1 shows the Palimpsest workflow. The input data is made of five literary document collections amounting to approximately 380,000 works, most of which are out of copyright, as well as a small set of modern books from authors who are well known for their literature being set in Edinburgh (incl. Irvine Welsh, Alexander McCall Smith, and Muriel Spark). The out-of-copyright collections are varied in content and contain literary fiction and nonfiction genres. The data is first indexed using Indri 5.6
2 and ranked using a set of 1,633 Edinburgh place name queries.
3 We use the Indri inference network language model-based ranking approach (Strohman et al., 2005). The ranking score of a document is increased given certain metadata information (including a set of favoured Library of Congress codes and subject terms) or down-weighted for ambiguous Edinburgh place names. We combine the score for genre in the metadata with the location query retrieved from the content of the book. The output of the document retrieval component is a set of ranked Edinburgh-specific candidate documents per collection.
This data was loaded into a web-based annotation tool for manual curation. All Edinburgh place names occurring in the document and snippets surrounding them were displayed to aid in the annotators’—three literary scholars from the School of Literature at the University of Edinburgh—decision making. The subset of works which were manually curated as Edinburgh-specific are further processed by text mining, which geo-references place names by grounding them to their latitude/longitude coordinates using the Edinburgh Geoparser (Grover et al., 2010)
4 and in particular the Edinburgh gazetteer that is being developed in Palimpsest. The output (geo-referenced location mentions and snippets) is stored in the Palimpsest database, which is accessible via web-based visualisations.
By assisted curation we refer to the process of semiautomatically curating a set of Edinburgh-specific literature from all accessible literature. Related endeavours have relied on the collection of titles, or passages, by a few individuals or via crowdsourcing (e.g., Edinburgh Reads
5 run by Edinburgh Libraries or Global Bookmap
6). The idea for Palimpsest arose out of an initial prototype that visualises a small set of extracts manually collected by literary scholars at the University of Edinburgh.
7 Such an approach results in high-quality data with the disadvantage of missing less well-known but potentially interesting works. In Palimpsest we consider the entire pool of accessible literature accessible to determine a subset of highly ranked Edinburgh-specific candidates automatically using location-based document retrieval. The aim is to uncover a large range of Edinburgh-specific literature, not only famous and well-read titles. Assisted curation by means of text mining alone has shown encouraging results in other domains (e.g., Kristjansson et al., 2004; Alex et al., 2008). We combine text mining and information retrieval for assisted curation and show how user feedback can improve the technical stages to this process.
The manual annotation of the ranked candidates to select actual Edinburgh-specific literature was done using the annotation tool displayed in Figure 2. All ranked documents are displayed on the left-hand panel, listing the title of each work, the author and publication date if available, a link to the original source document, and a list of location mentions identified within the book. When clicking on a title, additional information appears in the right-hand panel, including a graph showing occurrences of place names within a document and snippets containing Edinburgh place names. Based on this information and by following the link to the original source, the annotators can determine a work as being Edinburgh-specific or not, enter further comments, and identify the start and end content pages of a document. When clicking the submit button, a document annotation is saved to the database and disappears from the panel on the left.
An item can be annotated using the annotation scheme shown in Figure 3. We consider documents annotated as
yes (except) as Edinburgh-specific within Palimpsest.
8 The scheme was developed by the annotators while working on an initial ranking of HathiTrust documents.
We used the HathiTrust collection (253,350 documents) to develop the retrieval and ranking component. This resulted in 20,542 ranked candidate documents containing one or more Edinburgh place names. Over a period of two weeks, the annotators curated the ranked documents in order. This resulted in 1,710 annotated documents, of which 200 were considered Edinburgh-specific literature.
Initially, the annotators reacted enthusiastically to the annotation and discovered several works set in Edinburgh of which they were unaware (e.g.,
John and Betty’s Scotch History Visit or
Noctes Ambrosianae). As they worked through the documents, however, they lost trust in the ranking. They noticed relevant documents appearing far down the list and sometimes had to go through many documents to find a positive example. They also recorded a list of ambiguous place names (
High Street or
Trinity) mostly referring to other locations as well as a list of words in titles suggesting nonliterary content (
dictionary). Finally, they observed that most Edinburgh-specific documents contain a reference to
Edinburgh or a variant.
Improving the Ranking
Based on this feedback, we then fine-tuned the retrieval component. We used the set of 1,710 annotated works as an evaluation set to determine the effect of a modification. There is a body of research on using relevance judgments for improving information retrieval, a good summary of which appears in Manning et al. (2008). We tested the initial ranking (baseline), the following three measures and their combination:
(a) Down-weighting ambiguous place names identified by the annotators.
(b) Removing documents containing nonliterary title words (
(c) Ensuring that
Edinburgh or one of its variants (
Edinburrie, etc.) occurs in the work.
Figure 4 shows that down-weighting of ambiguous place names (a) resulted in a small improvement in average precision (MAP) (Baeza-Yates and Ribeiro-Neto, 1999). Filtering documents with nonliterary title words (b) had the highest increase in MAP. The condition of Edinburgh or a variant to appear in the document (c) decreased MAP slightly. However, it resulted in a large decrease in the number of ranked documents, reducing the workload of the annotators significantly. We therefore consider measure (c) to be beneficial as well. When combining all three measures, the retrieval component yielded an improved MAP score of 0.1684 (compared to the baseline MAP of 0.1307), and the workload of documents to be curated was reduced by 60%.
The assisted curation process undertaken in Palimpsest attempts to keep the user in the loop during iterative technical development. We received useful feedback from the literary scholars on issues that appeared as they curated documents and considered their suggestions in changing the underlying methods for ranking Edinburgh-specific literature. Our results show that document retrieval performance improved and curation workload was reduced as a result. The improved method was subsequently applied to all document collections, which resulted in very positive feedback from the curators, reporting that the ranking improved considerably.
Palimpsest is funded by the AHRC (Digital Transformations in the Arts and Humanities—Big Data, PI: Prof. Loxley). We would like to thank Dr Anderson, Dr Otty, and Dr Thomson, who were in charge of the curation, and Dr Harris-Birtill, who helped with the data processing for the annotation tool.
3. This includes entries appearing in at least three of five resources used to construct the Edinburgh gazetteer (OpenStreetMap, OSLocator, Royal Commission for Ancient Historic Monuments of Scotland, Historic Scotland, QuatroShapes of Edinburgh areas).
8. We excluded poetry, but we annotated it (
prob. not) to be able to work on it in future.
Alex, B., Grover, C., Haddow, B., Kabadjov, M., Klein, E., Matthews, M., Roebuck, S., Tobin, R. and Wang, X. (2008). Assisted Curation: Does Text Mining Really Help? In
Biocomputing 2008: Proceedings of the Pacific Symposium on Biocomputing, pp. 556–67.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999).
Modern Information Retrieval. Addison Wesley Longman, London.
Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S. and Ball, J. (2010). Use of the Edinburgh Geoparser for Georeferencing Digitised Historical Collections.
Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences,
Kristjansson, T. T., Culotta, A., Viola, P. and McCallum, A. (2004). Interactive Information Extraction with Constrained Conditional Random Fields. In
Proceedings of the 19th National Conference on Artificial Intelligence
, San Jose, CA. AAAI Press, pp. 412–18.
Manning, C. D., Raghavan, P. and Schütze, H. (2008).
Introduction to Information Retrieval. Cambridge University Press, Cambridge.
Strohman, T., Metzler, D., Turtle, H. and Croft, W. B. (2005). Indri: A Language-Model Based Search Engine for Complex Queries, CIIR Technical Report, ext. version.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Western Sydney University
June 29, 2015 - July 3, 2015
280 works by 609 authors indexed
Conference website: https://web.archive.org/web/20190121165412/http://dh2015.org/
Series: ADHO (10)