Crowdsourcing Performing Arts History with NYPL's ENSEMBLE

paper, specified "short paper"
  1. 1. Doug Reside

    New York Public Library

  2. 2. Ben Vershbow

    New York Public Library

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

The New York Public Library for the Performing Arts holds in its collection over one million programs documenting a large number of the major theater, music, and dance events performed around the country since the end of the Civil War. Although the collection grows each month, the Library estimates that it currently holds approximately 125,000 dance, 400,000 music, and over one million theater programs. These programs are valuable as individual artifacts, of course, but as an aggregated collection they serve as a sort of analog database of performing arts history. Unfortunately, querying this “database” is, at present, very inefficient. The materials are available only to researchers who come to New York to view them in person, and can only be viewed one at a time. Further, many are printed on crumbling paper that may not survive many more examinations by even careful researchers.
In early 2013, motivated both by our responsibility to preserve these artifacts and out of a desire to better expose the data they contain, we launched an effort to create digital images of our program collection and organize a crowd-sourced effort to transcribe and structure the information contained within it. The project, launched in beta under the name Ensemble1 in June of 2013, is now part of a new NEH-funded Digital Humanities Implementation grant to create tools for crowd-sourced transcription projects2. This paper will discuss the lessons the team learned from the beta release as well as the modifications we are planning for the upcoming full release in 2014.
2. Behind the Beta

Although it is our goal is to scan and transcribe every program in the Library’s collection, for the beta release we scanned 5 reels of microfilm containing 200 programs connected to theatrical productions in New York City performed between 1860 and 1930. We selected this content for several reasons:
The relatively low cost and high efficiency of microfilm scanning allowed us to add a relatively large number of programs to our initial set very cheaply and quickly. Although in most cases we would prefer to digitize originals, the programs preserved on these reels no longer exist in our collections.
Performing arts events from this period are not well-documented by other online databases (such as the Internet Broadway Database3 or Playbill Vault4).
These programs are almost certainly in the public domain; therefore they can be scanned and published online in their entirety. (If programs printed between 1923 and 1950 ever were in copyright, they were not likely not renewed and so passed into the public domain 28 years after publication).
This period was an especially fertile time in the development of the American performing arts; Carnegie Hall was built5, and Vaudeville and American Musical Theater both developed during these decades.
3. What we learned

The beta release of Ensemble has, as of this writing, produced almost 11,500 transcriptions of data from our initial test set of 200 programs. Although this is significant, it falls far short of the activity seen by other crowd-sourcing projects released by NYPL Labs. In its first three years, the menus transcription project has had over 1.2 million dishes transcribed. Over 60,000 buildings were checked in the first days of the “Building Inspector” app6. By comparison, participation in Ensemble is very low.
In part, these lower numbers may reflect the relative difficulty of the task Ensembleassigns to its users. Rather than asking for a simple transcription (as the Menus project does), users of Ensembleare required to identify relationships among text on the page (and occasionally to bring to it their own understanding of the theater industry). For instance, a program in our collection purports to be a record of “Jesse L. Lasky’s Aristic Novelty: Fleurette.” A user assigned this program must determine whether Jesse L. Lasky is the playwright, the producer, or perhaps the director.
In some cases, our interface may not even have an appropriate category. Lacking a consistently adopted schema for performing arts data, the user is required to engage in a bit of amateur taxonomy. In our first official release, we plan to revise and publish our schema, and make it easier for novice users to perform less demanding tasks while saving more challenging assignments for “advanced” levels of the game. Zooniverse's transcription project, Old Weather7, has had success with a similar approach.
Following the model of the citizen sciences like Zooniverse’s Galaxy Zoo, Ensemble requires “agreement” by several users before accepting a crowd-sourced transcription as correct. The level of agreement among different transcriptions of the same text is processed by our systems and will eventually be used to determine what assertions are stored in the database that users of Ensemble construct. In our initial version, we attempted to expose the quality assurance/”user-agreement.” Our hope was that those who were suspicious of the accuracy of any database constructed by the “crowd” would be somewhat reassured after they understood how the process worked. More often, though, we found that users who were the first to transcribe a fact, and then saw that the system had a low “degree of confidence” in the work they had just submitted (since no one else had yet “agreed” with them) misunderstood what they were being told and felt either insulted or disheartened. We quickly removed these visualization (although we may want to find a way to incorporate them in a more clearly contextualized way in the final version of the tool).
4. Potential uses of the data

Of course, the reason for engaging the crowd to produce this dataset in the first place is based on the assumption that it will be useful to future researchers. Some initial use cases have imagined include:
Aggregating archives: At present, researching historic performing arts events can be difficult as most of the primary sources are held in collections centered on a particular person. For example, if a scholar is researching the George M. Cohen musical Little Johnny Jones he or she will quickly discover that there is no large Little Johnny Jones collection at any major library. Once all of the data in our programs is available, however, a researcher could write a computer program that, given a title of a show, could generate a list of people associated with it and automatically search Worldcat for libraries that hold archives related to these people.
Discovering untold biographies: The lives of star performers and successful writers are often studied, but the careers of those members of a production whose role is less visible, but no less vital, often go unchronicled. Ensemble will enable researchers to track, for instance, which stage managers are most often associated with successful plays, which celloists were featured in the best orchestras of the 1920s, and, which constellation of artists and technicians is most often associated with the success or failure of the production of a Shakespeare play.
Mapping the arts: Where in New York City in 1920 would one mostly likely find an opera performed? What about a burlesque? A jazz concert? By opening up the data in the programs, it will be possible for software developers and geographers to combine the performance data with our digitized historical map collection and plot the kinds of performing arts events performed in particular regions of the City during a defined time period. This data may confirm or overturn scholarly assumptions about the geographical history of the city.
It is possible that the most illuminating and exciting uses of the data in these programs have yet to be imagined because, at the moment, surprisingly little of this information from this period is available at all. It is our hope that Ensemblewill soon become the backbone of an extensive, linked, open set of performing arts data that will allow researchers of all kinds to discover new information about the rich history of the performing arts in New York.

1. Ensemble (2013). Ensemble: Help Build an Open Database of the Performing Web. 27 Oct. 2013.
2. Announcing 6 Digital Humanities Implementation Grant Awards (July 2013) | National Endowment for the Humanities (2013). Web. 27 Oct. 2013.
3. IBDB: The Official Source for Broadway Information. Web. 27 Oct. 2013.
4. The Largest Internet Database of Broadway Information - Playbill Vault. Web. 27 Oct. 2013.
5. History of the Hall | Carnegie Web. 27 Oct. 2013.
6. Building Inspector by NYPL Web. 27 Oct. 2013.
7. Alexandra Eveleigh, Charlene Jennett, Stuart Lynn and Anna Cox (2013). I want to be a Captain! I want to be a Captain!": Gamification in the Old Weather Citizen Science Project. 26 Oct 2013
8. Galaxy Web. 27 Oct. 2013.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info


ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from (needs to replace plaintext)

Conference website:

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO