Creating a Meaningful Genre Schema and Metadata using IMDb data for a Large-Scale Digital Humanities Project in Media Studies

lightning talk
  1. 1. Cindy Conaway

    SUNY Empire State College

  2. 2. Diane Shichtman

    SUNY Empire State College

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

We are currently engaged in a long-term DH project examining the social networks of TV and film actors and crews across more than 32,500 media items they worked on, starting in 1938. The primary source is the Internet Movie Database (IMDb). IMDb is one of the most robust databases available and provides free downloadable data about the actors and crew that worked on various media items, including TV, movies, and video games. However, it is problematic in a number of ways, as we presented at DH2018. IMDb appears to be under-researched as a source for media studies in DH, with many scholars focusing on fan activity, or on film actors and directors only, with most not questioning IMDb genres. This presentation will explore the contrast between IMDb’s methods and the schema we have had to create. It also discusses some of the challenges we faced, and how we think this could aid other researchers creating public digital humanities databases.Steve Neale writes in Genre and Contemporary Hollywood, “genres can be approached from the point of view of the industry and its infrastructure, from the point of view of their aesthetic traditions, from the point of view of the broader socio-cultural environment upon which they draw and into which they feed, and from the point of view of audience understanding and response.” We must consider all of these concepts as our work encompasses not only Hollywood film, the subject of most genre analysis, but media of every type including much that is obscure or forgotten. IMDb’s genre methodology was inadequate, so we turned to industry-focused websites to enhance our understanding.Other schemas use very vague macro descriptors, or idiosyncratic descriptors allowing a media item to be included in multiple “lists.” A movie in the Library of Congress is simply Comedy, Drama, Action, etc. AFI divides films into “best” but also “most thrilling” (including action, horror, and adventure). The Telegraph writes that Netflix’s “genres, based on a complicated algorithm that uses reams of data about users' viewing habits . . . number in the tens of thousands” including commonly-accepted genres like “Action” but also “Family Watch Together TV.” We attempted to taxonomize, expanding and limiting the optionsavailable to encompass the full range of programming, but not get so esoteric that they cannot be clustered together and measured. We have created a vocabulary/schema to allow for clear communication as part of Public Humanities. We cannot, for example, talk about the representation of women in a particular subgenre if we do not have a shared understanding of it. If other scholars also use this schema (or work with us to adapt it), each media item can be described in a way that allows for effective and relatively consistent coding by multiple scholars.Genre on IMDb is handled in ways that are not useful for analysis because terms are used in inconsistent ways. There’s not enough inter-rater reliability, the tags are misleading, and scholars do not all agree on how to use them. As Deb Verhoeven states in “Mapping the Movies,” a project like her team’s “only works if the existing data collection is both sufficiently comprehensive and thoroughly reliable, since it will have to be accepted by all partners” (Verhoeven). There is little consistent agreement on genre among scholars or fans, as borne out in genre theory. What IMDb designates as “genres” actually combines traditional genres, subgenres, and target audience categories. As IMDb allows those who enter the data to select any number of these terms, and many fans enjoy labeling media items to fit multiple lists, it becomes impossible to analyze using IMDb’s categories. In large part this is because the database relies heavily on users: sometimes cast/crew members, agents, producers, and fans, for its data and for much editing. As Wasserman et al., point out, “Although user editing allows a reference website such as IMDb to be up-to-date, it diffuses the responsibility for fact-checking, leading to greater uncertainty about accuracy and objectivity of information” (Wasserman).It has taken significant additional research and reorganization to use the data effectively because, as media researchers Marsden, et al. explain, there is not enough agreement about metadata. While most people can tell a Western from Science Fiction, IMDb makes it more difficult to deal with hybrid genres such as Dramedies or Family movies, or where a particular movie or show combines genres, such superimposing Western generic concepts into Science Fiction, or an Action/Adventure movie with a strong romantic plot. The schemas used by the Library of Congress, Netflix, Amazon and others are too reductive or imprecise for our purposes.Therefore, we not only had to create a taxonomy with a variety of categories, including subjects, styles, modes, and purposes, but our own concise definitions for these categories. We will share parts of our schema in this presentation.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at Data for this conference were initially prepared and cleaned by May Ning.

Conference website:


Series: ADHO (15)

Organizers: ADHO