Tracking Culture on the Web; An Experiment

Stéfan Sinclair; Michael Picheca

Authorship

1. Stéfan Sinclair

McGill University, McMaster University, Department of Languages, Literatures & Cultures - University of Alberta
2. Michael Picheca

Computing and Software - McMaster University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

A couple of years ago a friend suggested that a way to get people
interested in trends in popular culture would be to create an online
cultural stock market or "horse race" where people could bet on cultural
items and then watch their stock go up or down as the objects rose or fell
in popularity. The problem was how to measure the popularity of a cultural
item like "Star Wars" or "XML". It wasn't until later that it occurred to
us that we ask WWW search engines for the number of pages indexed that
included the word or phrase in question as a way of tracking the relative
popularity of the item. The problem was whether and how one could
systematically gather such information from search engines and whether such
information would provide a reliable guide to cultural shifts. This paper
reports on a two stage experiment that we ran, first to design a system
capable of tracking items, and second to gather a significant amount of
data so as to see if the system did in fact reflect known events in popular
culture and contemporary ideas. In effect we wanted to see if we could
treat the WWW as an enormous text corpus with the search engines as our
text-analysis tools for the purpose of cultural and intellectual study.

In the presentation we will do three things:

1. We will discuss the case for using the WWW to track ideas and culture.

2. We will report on the initial tests of a system for tracking selected
items and the resulting design.

3. We will report on the results of a four month study during which we
gathered data daily on selected items and compared it to known events.

In a paper that Rockwell and Bradley gave at the ALLC-ACH conference in
Paris in 1994, "A Growing Fascination With Dialogue: Bibliographic
Databases and the Recent History of Ideas", they reported on a technique of
using bibliographic databases for tracking the recent history of ideas.(1)
In that paper they argued that databases provide evidence of changes in the
symptoms of intellectual culture comparable to the types of evidence that
epidemiologists use to track epidemics.(2) The problem with the technique
they used is that bibliographic databases reflect academic work not popular
and commercial subculture. Bibliometrics is useful for tracking
bibliographic trends, less so for cultural trends. The WWW on the other
hand can be argued to be a better reflection of popular and commercial
subculture. The WWW has the additional advantage that it is the work of the
millions who write WWW pages and is therefore not keyworded by experts who
might, as cataloguers do in the case of bibliographic databases, impose
their organizational categories on the evidence. The WWW is a significant
expression of North American culture and therefore better represents the
relative complexity of the whole blooming buzzing confusion. Further, the
WWW is already digitized and in a relatively standardized format so that it
can be searched and indexed as a growing whole (with great difficulty.) It
is the accessibility of the WWW to quantitative methods that makes it ideal
for tracking the movement of ideas and popular culture.

The problem with the WWW as evidence is that it is not conveniently
organized into a database that one can search diachronically. For this
reason we turned to the popular search engines that index WWW pages and
provide statistics on demand as a reasonable source of evidence for the WWW
as a whole. This is not an original idea; in the presentation we will show
some "voyeur" pages that allow you to see what are the popular terms others
are searching for.(3) Unfortunately, when we contacted the owners of such
pages to see if they would collaborate with us we were rebuffed. Such
statistics are a closely guarded secret with commercial value. The tactic
we settled on was to then test the feasibility of a system that would
gather statistics from the search engines for terms we chose and we ran a
series of tests to see if we could gather data regularly from the search
engines. We also wanted to see what the resource implications of such
system were, given that ultimately we might want to track thousands of
items. One of the results of our test was that we found that statistics on
news articles gave us greater variation and more detail over short periods.
In effect WWW page statistics seem to be useful for tracking long term
change while news articles are more responsive to short term changes. In
the presentation we will review the tests and resulting statistics from
this first phase.

Once we demonstrated that this could be done we built a system that
gathered data from three search engines (Excite, Yahoo, and Thunderstone)
on both WWW page statistics and news article statistics for a selection of
words and phrases in three areas: guitars and popular music, popular movies
and characters, and text-analysis and markup languages. The system gathers
these statistics every night and writes them to a database to which we
built a front end that can plot items over time. (In the presentation we
will demonstrate the WWW front end where viewers can view the data by item,
by search engine, and by date range.) The system was run for four months
(September 1999 to January 2000) and the data was exported to Excel for
more analysis. We found in a number of cases a clear correlation between
spikes in activity and known events. For example, the release of the latest
James Bond movie was reflected in the data we gathered by a dramatic surge
in hits for news articles for the phrase. In the presentation we will
present the statistics for selected items over the period and comment on
what events we believe these statistics reflected. We believe these
correspondences empirically demonstrate the usefulness of this technique.

There are a number of theoretical problems with this approach to tracking
ideas and cultural items that will be discussed in the presentation. I
summarize them here with questions and tentative answers:

1. Does the WWW reflect popular culture or does it reflect only the culture
of the community of its authors? While the WWW undoubtedly reflects only
the interests of a geographically and economically limited set of people it
is still an enormous body of evidence and there are few alternatives if one
wishes to avoid impressionistic studies or use "top ten" lists generated by
the media on selected topics. Whatever the degree to which the WWW reflects
popular culture there are enough people authoring WWW pages to argue that
it is interesting what they are writing about if one can accurately measure
it.

2. Do counts of words or phrases accurately reflect interest in culture?
This is a problem common to any form of text-analysis. Certain cultural
topoi are no doubt not going to be tracked by a system that searches only
for words and phrases, that is what cultural historians are for.
Disambiguation is another problem. That said, we believe that information
about key words and phrases over time could provide useful evidence for
more sophisticated interpretation and aggregation if it can be shown to be
something that can be gathered and if it can be shown to reflect real
events. Further, searching for key words and phrases is the way many people
access information on the WWW so there is some justification for using this
approach.

More importantly our system is designed to track "hits" over time. We
believe that what is important is not the number of "hits" for an item, but
changes in the number and comparisons between items. It is hard to say what
it means if there are over 4 million WWW pages devoted to the "Spice
Girls", but if that number changes dramatically that may indicate a change
in interest in the subject.

3. Can we trust data gathered from search engines that are not open to
scrutiny? Certainly not. This is why we can't use the "voyeur" pages and
why we gathered statistics from more than one engine and for both WWW pages
and news articles. That said, the search engines are the best source of
statistics (without building our own spiders) and their statistics do seem
to match known events. Further, as mentioned above, the search engines are
used to find information - they are part of the culture of the Internet and
consequently would have to be taken into account anyway.

4. Is it ethical to gather such statistics from search engines? Given the
general climate of concern about the gathering and aggregation of data on
the Internet it is worth asking whether this system or ones like it pose
ethical problems. As we do not gather information about individual WWW
pages or authors it is unlikely that such a system could be used to predict
more than general trends, but the fact remains that systems like ours could
be designed to track the interests of an individual author over time. A
more pressing issue we faced arose when, in our initial tests to see what
was the correlation between the number of items searched for and the time
it took to conclude the searches, our server was denied access by one
particular search engine, Northern Light. In an e-mail exchange they
explained that their service was available for humans not robots and that
such robot searches depress click-through advertising rates which in turn
could cause financial harm to their investors. Respecting their wish we
dropped them from the list of engines to search.

Pragmatically, the best way to test the value of such statistics was to
implement it and see if the results correlated with events, which it did in
cases we could confirm. We believe our experiment shows that such
statistics can be a useful monitor of changes in culture, with certain
reservations, and we will conclude by discussing how we plan to implement
the system for targeted research and as a teaching tool. The system was
built so that it can now become a module in a larger system where students
could play at investing in culture or researchers can provide lists of
terms related to a field to track over a given period.

Notes

(1) Rockwell, Geoffrey and John Bradley, "A Growing Fascination With
Dialogue: Bibliographic Databases and the Recent History of Ideas" was
presented at the ALLC-ACH '94 conference in Paris in April 1994.

(2) For more on the epidemeology of culture and ideas see Dan Sperber,
_Explaining Culture: A Naturalistic Approach_. Oxford: Blackwell, 1996.
Sperber's project is more ambitious than ours and differs in interesting
ways. For him the epidemeology of culture should help us generate natural
explanations of how ideas are transmitted by linking cognitive psychology
and sociology/anthropology. We are skeptical that a technique such as ours
could do this. At best it can accurately track changes in the symptoms of
culture not explain what is happening in the minds of people.

(3) See www.searchterms.com. The idea that search engines are gathering
information about trends that could be used is not original. Eric Knight,
the owner of Searchterms.com, describes the Internet as a "cultural
barometer". The companies that run the search engines no doubt gather and
sell information about market trends; unfortunately they don't make it
available for academic study.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20011127030143/http://www.nyu.edu/its/humanities/ach_allc2001/

Attendance: 289 (https://web.archive.org/web/20011125075857/http://www.nyu.edu/its/humanities/ach_allc2001/participants.html)

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tracking Culture on the Web; An Experiment

1. Stéfan Sinclair

2. Michael Picheca

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001