Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews

paper
Authorship
  1. 1. Xiao Hu

    University of Illinois, Urbana-Champaign

  2. 2. J. Stephen Downie

    University of Illinois, Urbana-Champaign

  3. 3. M. Cameron Jones

    University of Illinois, Urbana-Champaign

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. INTRODUCTION
There are many networked resources which now provide critical consumer-generated reviews of humanities
materials, such as online stores, review websites, and various forums including both public and private blogs, mailing lists and wikis. Many of these reviews are quite detailed, covering not only the reviewers’ personal opinions but also important
background and contextual information about the works
under discussion. Humanities scholars should be given the ability to easily gather up and then analytically examine these reviews to determine, for example, how users are impacted
and influenced by humanities materials. Because the
ever-growing volume of consumer-generated review text precludes simple manual selection, the time has come to develop robust automated techniques that assist humanities scholars in the location, organization and then the analysis
of critical review content. To this end, the authors have
conducted a series of very promising large-scale experiments
that bring to bear powerful text mining techniques to
the problem of “criticism analysis”. In particular, our
experimental results concerning the application of the Naïve
Bayes text mining technique to the “criticism analysis”
domain indicate that “criticism mining” is not only feasible but
also worthy of further exploration and refinement. In short,
our results suggest that the formal development of a
“criticism mining” paradigm would provide humanities scholars with a sophisticated analytic toolkit that will open rewarding new avenues of investigation and insight.
2. EXPERIMENTAL SETUP
Our principal experimental goal was to build and then evaluate a prototype criticism mining system that could automatically predict the: 1) genre of the work being reviewed (Experimental Set 1 (ES1)).
2) quality rating assigned to the reviewed item (ES2).
3) difference between book reviews and movie reviews, especially for items in the same genre(ES3).
4) difference between fiction and non-fiction book
reviews (ES4).
In this work, we focused on the movie, book and music reviews published on www.epinions.com, a website
devoted to consumer-generated reviews. Each review in
epinions.com is associated with both a genre label and a numerical quality rating expressed as a number of stars (from 1 to 5) with higher ratings indicating more positive opinions. The genre labels and the rating
information provided the ground truth for the
experiments. 1800 book reviews, 1650 movie reviews and 1800 music reviews were selected and downloaded from the most popular genres represented on epinions.
com. As in our earlier work (Hu et al 2005), the
distribution of reviews across genres and ratings was made as evenly as possible to eliminate analytic bias. Each review contains a title, the reviewer’s star rating of the item, a summary, and the full review content. To make our criticism mining approach generalizable to other sources of criticism materials, we only processed
the full review text and the star rating information.
Figure 1 illustrates the movie, book and music genre taxonomies used in our experiments. Figure 1: Book, movie and music genres from
epinions.com used in experiments; Genres with
the same superscripts are overlapping ones used
in “Books vs. Movie Reviews” experiments (ES3)
The same data preprocessing and modeling techniques were applied to all experiments. HTML tags were removed, and the documents were tokenized. Stop words and punctuation marks were not stripped as previous studies suggest these provide useful stylistic information (Argamon and Levitan
2005, Stamatatos 2000). Tokens were stemmed to unify
different forms of the same word (e.g., plurals). Documents
were represented as vectors where each attribute value was the frequency of occurrence of a distinct term. The model selected was generated by a Naïve Bayesian text classifier
which has been widely used in text mining due to its
robustness and computational efficiency (Sebastiani 2002). The experiments were implemented in the Text-to-
Knowledge (T2K) framework which facilitates the fast
prototyping of the text mining techniques (Downie et al 2005).
3. GENRE CLASSIFICATION TESTS (ES1)
Figure 2a provides an overview of the genre classification
tests. The confusion matrices (Figure 2b, 2c and 2d)
illustrate which genres are more distinguishable from the others
and which genres are more prone to misclassification. Bolded
values represent the successful classification rate for each
medium (Figure 2a) or genre (Figure 2b, 2c and 2d). Figure 2: Genre classification data statistics, results and confusion matrices. The first rows in confusion
matrices represent prediction (P); the first columns represent ground truth (T). 5- fold random
cross-validation on book and movie reviews, 3- fold random cross-validation on music reviews
As Figure 2a shows, the overall precisions are impressively
high (67.70% to 78.89%) compared to the baseline of
random selection (11.11% to 8.33%). The identification of some genres is very reliable e.g., “Music & Performing Arts” book reviews (89%) and “Children” movie reviews
(95%). Some understandable confusions are also
apparent e.g., “Documentary” and “Education” movie reviews (31% confusion). High confusion values appear
to indicate that such genres semantically overlap.
Furthermore, such confusion values may also indicate pairs
of genres that create similar impressions and impacts on users. For example, there might be a formal distinction between the “Documentary” and “Education” genres but the two genres appear to affect significant numbers of users in similar, interchangeable ways.
4. RATING CLASSIFICATION TESTS (ES2)
We first tested the classification of reviews
according to quality rating as a five class problem (i.e., classification classes representing the individual ratings (1, 2, 3, 4 and 5 stars)). Next we conducted
two binary classification experiments: 1) negative and positive review “group” identification (i.e., 1 or 2 stars versus 4 or 5 stars); and 2) ad extremis identification
(i.e., 1 star versus 5 stars). Figure 3 demonstrates the
dataset statistics, corresponding results and confusion matrices. Figure 3: Rating classification data statistics, results and confusion matrices. The first rows in confusion matrices represent prediction (P); the first columns represent ground truth (T). 5- fold random cross-validation on book and movie reviews, one single iteration on music reviews
The classification precision scores for the binary rating tasks are quite strong (80.13% to 86.25%), while the five class scores are substantially weaker (36.70% to 44.82%). However, upon examination of the five class confusion matrices it is apparent that the system is “reasonably” confusing adjacent categories (e.g., 1 star with 2 stars, 4 stars with 5 stars, etc.).
5. MOVIE VS. BOOK REVIEW TESTS (ES3)
We first formed a binary classification experiment with movie and book reviews of all genres. We
then compared reviews in each of the six genres common to books and movies. To prevent the oversimplification of the classification task we eliminated words that can
directly suggest the categories: “book”, “movie”, “fiction”,
“film”, “novel”, “actor”, “actress”, “read”, “watch”, “scene”, etc. Eliminated terms were selected from those which
occurred most frequently in either category but not both Figure 4: Overview statistics of book and movie review classification experiments. All results are from 5
- fold random cross validation
The results in Figure 4 show the classifier is amazingly
accurate (consistently above 94.28% precision) in
distinguishing movie reviews from book reviews both in mixed genres and within single genre classes. We conducted a post-experiment examination of the reviews to ensure that the results were not simply based upon suggestive
terms like those we had eliminated pre-experiment.
Therefore, it can be inferred that users criticize books and movies in quite different ways. This is an important finding that prompts for future work the identification of key features contributing to such differences.
6. FICTION VS. NON-FICTION BOOK REVIEW TEST (ES4)
As in ES3, we eliminated such suggestive words as “fiction”, “non”, “novel”, “character”, “plot”, and “story” after examining high-frequency terms of each
category. The classification results are shown in Figure 5. Figure 5: Fiction and non-fiction book review classification data statistics, results and confusion matrix. The first row in confusion matrix represents prediction (P); the first column represents ground truth (T). Results are from 5- fold random cross validation
The precision of 94.67% not only verifies our system is good at this classification task but also indicates reviews on the two categories are significantly different. It is also noteworthy that more non-fiction book reviews (9%) were mistakenly predicted as fiction book reviews than the other way around (2%). Closer analysis on features causing such behaviors will be our future work.
7. CONCLUSIONS AND FUTURE WORK
Consumer-generated reviews of humanities
materials represent a valuable research resource for
humanities scholars. Our series of experiments on the automated classification of reviews verify that important information about the materials being reviewed can be found using text mining techniques. All our experiments
were highly successful in terms of both classification
accuracy and the logical placement of confusion in the confusion matrices. Thus, the development of “criticism
mining” techniques based upon the relatively simple Naïve Bayes model has been shown to be simultaneously viable and robust. This finding promises to make the ever-growing consumer-generated review resources
useful to humanities scholars.
In our future work, we plan to undertake a broadening of our understanding by exploring the application of text mining techniques beyond the Naïve Bayes model (e.g., decision trees, neural nets, support vector machines, etc.). We will also work towards the development of a system to automatically mine arbitrary bodies of critical review text such as blogs, mailing lists, and wikis. We also hope to construct content and ethnographic analyses to help answer the “why” questions that pertain to the results.
References:
Argamon, S., and Levitan, S. (2005). Measuring
the Usefulness of Function Words for Authorship
Attribution. Proceedings of the 17th Joined
International Conference of ACH/ALLC.
Downie, J. S., Unsworth, J., Yu, B., Tcheng, D.,
Rockwell, G., and Ramsay, S. J. (2005). A
Revolutionary Approach to Humanities
Computing?: Tools Development and the D2K Data-Mining Framework. Proceedings of the 17th Joined International Conference of ACH/ALLC.
Hu, X., Downie, J. S., West, K., and Ehmann, A. (2005).
Mining Music Reviews: Promising Preliminary
Results. Proceedings of the Sixth International
Conference on Music Information Retrieval (ISMIR).
Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text Genre Detection Using Common Word Frequencies. Proceedings of 18th International
Conference on Computational Linguistics.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None