Using Temporal Information on Topic Mining

paper, specified "long paper"
Authorship
  1. 1. Takeaki Uno

    National Institute of Informatics, Japan

  2. 2. Ryota Kobayashi

    University of Tokyo

  3. 3. Yuka Takedomi

    National Institute of Informatics, Japan

  4. 4. Takako Hashimoto

    Chiba University of Commerce

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Analyzing social media data (e.g., Twitter and Reddit) is essential to characterize social and political problems and understand them. Practically it is impossible to read all the social media posts and digest them manually. Thus, we have to extract underlying topics from the text automatically.

Clustering is a suitable tool for extracting topics from social media data. However, it is still challenging to discover interpretable topics from a corpus of noisy and short texts, including social media posts. A limitation of the existing clustering methods, such as topic models (e.g., LDA [1]), is that they extract topics based on only word information (e.g., word frequency in the post). It is a notable feature that the social media data is the streaming data, i.e., the data consists of the text and the timestamp (the posted times). In contrast, it is not apparent how to utilize temporal information to find interpretable topics. For instance, we may be able to classify the tweets based on only the timestamps. However, this method would not find interpretable topics because many users post various topics of tweets simultaneously. Here we propose a clustering algorithm that utilizes the word and temporal information of the posts.

Proposed Clustering Algorithm

We propose a two-stage clustering algorithm that discovers course-grained topics by leveraging textual and temporal information [2]. Suppose that a huge volume of Twitter data, with text bodies (tweets) and timestamps (posted times), is available for us.

In the first stage, we extract fine-grained topics (micro-clusters) by clustering the word co-occurrence graph of the tweets. We used the Data Polishing algorithm [3,4] to obtain the micro-clusters, that is, the tweets sharing a fine-grained topic. Data polishing algorithm was modified to incorporate the bias of the tweet data due to many retweets. Data polishing algorithm was modified to incorporate the bias of the tweet data due to many retweets. This modification improved the computational efficiency of the algorithm significantly. The temporal pattern of a micro-cluster is obtained by calculating the posting frequency of all the tweets in a micro-cluster. In the second stage, we discover the coarse-grained topics (e.g., a reaction to breaking news) by clustering the time series of the micro-clusters. We used K-Spectral Centroid (K-SC) clustering [5] to obtain the cluster of micro-clusters sharing a similar temporal pattern. K-SC algorithm is an extension of K-means, which improves the robustness to scaling and shifting. If the fine-grained topics originate from an item of news, we can expect their temporal pattern should be similar. This is the motivation for focusing on the temporal pattern to find course-grained topics.

First, the proposed algorithm was applied to large-scale Twitter data related to the "COVID-19 vaccine" [1]. We discover three types of coarse-grained topics from the Twitter data (26 million tweets): A. Reaction to the news (8 topics), B. Reaction to the tweets (2 topics), and C. Others (2 topics). While topics A and B exhibit a clear peak in their temporal pattern, topic C does not exhibit peaks: it shows a persistent activity. This topic comprises rumors, fake news, alerts for fake news, jokes, tips, etc., which are continuously posted on Twitter. Notably, the data polishing method can extract even such a less popular topic.

Next, we evaluate the computational efficacy of the proposed algorithm; we ask how fast the algorithm process the data? We compare the proposed algorithm with five existing algorithms for finding topics from tweets: LDA [1], K-means, MeanShift, Agglomerative clustering, and Data Polishing [3, 4]. We found that the proposed algorithm is much faster than the existing methods. For instance, the proposed algorithm is more than 300 times faster than LDA with a data set of 100,000 tweets.

Bibliography

Blei, D. M. (2012), Probabilistic topic models. Communications of the ACM, 55:77–84

Hashimoto, T., et al. (2021), Two-stage Clustering Method for Discovering People's Perceptions: A Case Study of the COVID-19 Vaccine from Twitter, IEEE BigData 2021.

Hashimoto, T., et al. (2021), Analyzing Temporal Patterns of Topic Diversity using Graph Clustering, The Journal of Supercomputing, 77:4375-4388

Uno, T., et al. (2017), Micro-clustering by Data Polishing. IEEE BigData 2017, 1012-1018

Yang, J. and Leskovec, J. (2011), Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web Search and Data Mining, pages 177–186

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO