Corpus-DB: a Scriptable Textual Corpus Database for Cultural Analytics

lightning talk
  1. 1. Jonathan Reeve

    Columbia University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Corpus-DB ( is a database and query framework which solves the problems of text retrieval, text cleaning, corpus compilation, and metadata aggregation that often form the first step for researchers in computational text analysis. Traditionally, scholars interested in studying a collection of texts, such as: novels set in London, Bildungsromane, sestinas, or poems written in 1889, have had to manually assemble their corpora, which can be a prohibitively laborious process. Corpus-DB gathers full texts and metadata from Project Gutenberg, the British Library, and other sources; cleans the texts; adds metadata found via Wikidata, Goodreads, and Wikipedia, and elsewhere; and provides this as a free, open, and easily scriptable API. This enables rapid prototyping of text analysis projects, as well as advanced queries of these corpora, providing easy answers to questions such as the average Goodreads star rating of novels set in London, or the median publication date for detective novels.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at Data for this conference were initially prepared and cleaned by May Ning.

Conference website:


Series: ADHO (15)

Organizers: ADHO