0% found this document useful (0 votes)
2 views12 pages

Emerging Topic Detection in Twitter Stream Based On High Utility Pattern Mining

The paper presents a novel method for detecting topics in Twitter streams using High Utility Pattern Mining (HUPM), which integrates word frequency and utility based on growth rates. It introduces a dynamic minimum utility threshold and a Topic-tree (TP-Tree) for post-processing to refine candidate topic patterns. Experimental results show that this method outperforms existing techniques in terms of topic recall and efficiency across various datasets.

Uploaded by

Huyen Ngoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views12 pages

Emerging Topic Detection in Twitter Stream Based On High Utility Pattern Mining

The paper presents a novel method for detecting topics in Twitter streams using High Utility Pattern Mining (HUPM), which integrates word frequency and utility based on growth rates. It introduces a dynamic minimum utility threshold and a Topic-tree (TP-Tree) for post-processing to refine candidate topic patterns. Experimental results show that this method outperforms existing techniques in terms of topic recall and efficiency across various datasets.

Uploaded by

Huyen Ngoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

EMERGING TOPIC DETECTION IN

TWITTER STREAM BASED ON HIGH


UTILITY PATTERN MINING
Hyeok-Jun Choi - Cheong Hee Park
The process of extracting and summarizing trending
issues in the form of useful information is called topic
detection.
Paper proposes a topic detection method for Twitter
using High Utility Pattern Mining. The proposed method
considers both the frequency of words over tweets and
the utility of words, which is defined based on the
growth rate in appearance frequency. A technique to
dynamically determine the minimum utility threshold
for each chunk of tweets is presented.
And define a Topic-tree (TP-Tree) for post-processing to
extract actual topic patterns from candidate topic
patterns generated by High Utility Pattern Mining
(HUPM).

1. Introduction
2
2. Related Work
2.1. Feature-pivot methods 2.2. Document-pivot methods
• In feature-pivot based methods, a topic • The document-pivot based methods are
is expressed as a group of words, and generally characterized according to the
the goal is to determine the word method used to represent documents and
groups that appear simultaneously in a measure the similarity between documents.
document set.
• (Phuvipadawat & Murata, 2010;
• Some methods are referenced from Sankaranarayanan et al., 2009; O’Connor et
(Mathioudakis & Koudas, 2010), (Weng al., 2010), (Becker et al., 2011), (Petrovic et
& Lee, 2011), (Zhang et al., 2010), al, 2010), (Zhou & Chen, 2014).
(Petkos et al., 2014; Huang et al., 2015;
Gaglio et al., 2015) (Li et al., 2012; 2.3. Probabilistic topic model
Aiello et al., 2013), (Erra et al., 2015). • A topic is expressed as a probability
• Method treats hashtag words as usual distribution of words and documents are
terms and popular hashtag words can considered to be the probability
be used as the words representing distributions of topics.
emerging topics. • (Quercia et al., 2012, Blei et al., 2003;
Hofmann, 1999), (Kim et al., 2012). 3
3. High Utility Pattern
Mining
• Definition 1 [Transaction table]: If a tweet is considered
to be a transaction, words in a tweet can be treated as items
together with the word frequency in the tweet.
• Definition 2 [External utility, Internal utility, Utility]: An
external utility for item means the value of the item,
expressed as eu(i). An internal utility of item i represents
the frequency of the item in the transaction, expressed as
iu(i,T).
The utility of item i for the transaction T, u(i,T):
(1)
• Definition 3 [Itemset utility, Transaction utility,
Transaction-weighted utility]: X denotes a subset of the
items included in transaction T, the utility of the itemset X,
u(X,T), the transaction utility of the transaction T, tu(T), and
the transaction-weighted utility, twu(X):
(2)
(3)
(4) 4
4. Topic Detection based on High
Utility Pattern Mining
4.1. Computation of Utility for Words
Tweets generated in a time order are denoted as Ti and
the Twitter streams as TS = T1, T2, T3,… TS is
represented as a sequence of batches B1, B2, B3,…

The frequency of word i for batch Bt as f(Bt,i), the


difference in the frequency of word i between the
current batch BL and the previous batch BL-1:
dif(i) = f(BL,i) – f(BL-1,i) (5)

• diff(i) > 0: the frequency of word is increasing


• diff(i) < 0: the frequency of word is decreasing

The rate of frequency increase of word i:


Rate(i) = (6)

Figure 1. The flowchart of the proposed method for


The external utility for word i included in batch BL: emerging topic detection in a Twitter stream.
5
(7)
4.2. Determining a Minimum Utility Threshold

• All tweet posts containing the words in X as s(X), the length


l(T) of a tweet post T:

4.3. Generation of Candidate Topic Patterns


(4)

In the HUPM described in (Liu & Qu, 2012),


(8) most of the patterns generated in manner
 α = : the utility average of words include redundant patterns, where some
patterns of short length appear repeatedly in
 β = : the average length of tweets the patterns of long length.
 γ = s(X): the number of tweets The paper calls the patterns generated by
, HUPM candidate topic patterns and apply post-
processing to eliminate the redundant patterns.
• The number of selected words = (9)

>

 lower-bound
 upper-bound
min-util = avg( (10) 6
4.4. Extraction of Actual Topic Patterns
• TP-Tree (Topic-tree) was constructed to effectively remove the redundancy from the candidate topic
patterns.

 The utility of the pattern p,


PU(p) is defined as follows:

 The means the sum of


external utilities for the
words included in the
pattern.

7
5. Experimental Results

5.1. Twitter Data 5.2. Data Preprocessing


 FA Cup Final (FA): Ground-truth data includes 13 1. All the characters were changed to small letters.
topics in 13 intervals. 2. The tweets that are collected through the Twitter
 Super Tuesday (ST): Ground-truth data contains 22 Streaming API often include HTML tags . Changing
topics in 8 intervals. the HTML tags to white-space and removed URLs
 US Elections (US): The ground-truth information included in the tweet.
for the 64 topics in 26 intervals is given. 3. Performed tokenization including all of hashtags by
Lucene’s Standard Analyzer3

8
5.3. Measures for Performance Evaluation

• Topic recall and topic relevance: Topic recall is the


ratio of the topics successfully detected among the
ground-truth topics. Topic relevance is the ratio of the
topics matched to some ground-truth topic among the
topics found by a method.
• Keyword precision and keyword recall: Keyword
precision is the ratio of correctly detected keywords
out of the total number of keywords for the found
topics matched to some ground-truth topic.
• F-measure: From keyword precision and keyword
recall

9
5.4. Performance Comparison
• Tested with a setting of 27
combinations:

5.5. Parameter Sensitivity


• Figure 4 compares the performance
depending on the parameter selection.
• It shows the topic recall when , , were
varied.
• Overall, for three data sets, topic recall
was good when the value was above
0.025, the value was in the range of
200 to 400, and the value was in the
range of 70 to 90.

10
6.
Conclusions
In this paper, the authors proposed a method for
detecting topics from Twitter streams using HUPM. The
proposed method includes a stage for calculating the
utilities for words in each batch of tweets by the sliding
window technique, a stage for determining the min-util on
each batch, and a stage for extracting actual topic patterns
from the candidate topic patterns using TP-Tree.

They experimentally analyzed the topic detection


performance of the proposed method in comparison with
other methods. Notably, the proposed method showed a
topic recall 5 % higher than the other compared methods
for the ST dataset, 6% higher for the US election data set,
and 8% higher for the FA dataset. Regarding time spent for
topic detection, the proposed method demonstrated short
running time for the three datasets.

11
THANK YOU!

You might also like