See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/334338039
Topic Shift Detection in Online Discussions using Structural Context
Conference Paper · July 2019
DOI: 10.1109/COMPSAC.2019.00155
CITATIONS READS
19 695
2 authors:
Yingcheng Sun Kenneth Loparo
University of North Carolina at Greensboro Case Western Reserve University
41 PUBLICATIONS 431 CITATIONS 436 PUBLICATIONS 11,978 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Yingcheng Sun on 22 October 2019.
The user has requested enhancement of the downloaded file.
2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)
Topic Shift Detection in Online Discussions using Structural Context
Yingcheng Sun Kenneth Loparo
Case Western Reserve University Case Western Reserve University
Cleveland, OH, USA Cleveland, OH, USA
[email protected] [email protected] Abstract—Topic shift occurs frequently in online discussions, 0 Texas serial bomber made video confession before blowing himself up
and automatically detecting topic shift can help to better 1 What are the chances we ever see the video?
capture the main clues and obtain relevant answers from large 2 The same as the chances of the Browns winning Super Bowl.
3 Browns run too many bad plays in the last quarter.
number of comments. Traditional topic-shift detection 4 I take the browns to the super bowl every morning
methods calculate text similarity and have limited success 5 Zero, videos like this are locked down and used for training purposes
because they ignore semantic relatedness. In this paper, we 6 Here I am thinking how bad can it be?
propose a new topic shift detection model that uses 7 I want to know what kind of phone he has?
conversational structure to enrich the context information and 8 An old analog one from the 90's.
9 Nokia brick, amazingly durable.
word embedding to build the semantic associations for each
comment - post pair. Experiments show that the proposed Figure 1. An example thread of user replies on a news article about Texas
model leads to better performance in terms of precision, recall, serial bomber’s video1. Blue color indictates on- the topic comment and red
and F1 score. color indicates off-the-topic comment. Number i is the i-th comment.
Keywords- topic shift; online discussions; structural context II. TOPIC SHIFT DETECTION MODEL
I. INTRODUCTION Word embedding trained on large corpus has been shown
that it can capture the semantic and syntactic features of
While a nested conversation in social media usually starts words so that similar words are close to each other in the
from the topics discussed in the initial post, often the topic embedding space [5]. Compared to word occurrence
changes during comments and replies and this, leads to topic measurement, word embedding can improve the accuracy of
shift or topic drift [1]. Automatically detecting topic shift in topic shift detection, but may have issues on topic-related
online discussions can help to capture the main clues of the comments with low semantic similarity. For example,
discussion threads, and filter irrelevant replies to promote comment 6 in Figure 1 asking about the video content is
bringing the conversation back on topic and then improve related to the topic but may be incorrectly classified into the
members’ experiences of online communities, especially for “topic shift” group even using word embedding, because its
question and answer sites [2]. Conventional topic shift semantic similarity with post 0 is also low. We thus need to
detection models are based on comparing text similarity introduce more clues from the contextual environment.
between comments and the initial post [3][4], but comments In online discussions, users can easily participate by
in online discussions are short and omit background submitting comments or writing replies to those that draw
information, and are sometimes sparse with lots of co- their attention. In writing a reply, a user reads the initial post
referenced expressions [5], so traditional topic shift detection or headline, browses the comments and selects one for a
models that only use literal similarity may not work well. reply. By writing a reply, a user explicitly expresses their
Figure 1 illustrates a real discussion thread example of user interest in the topic(s) in the discussion thread, thereby
comments on a news article about “Texas serial bomber’s enlarging the discussion tree by adding leaf nodes. The main
video”. In Figure 1, comments 7, 8 and 9 are discussing the topics of a reply may not be closely related to comments
phone the “bomber” uses, that is related to the news article, located at a distance in the discussion thread, but will
but comments 2,3 and 4 (colored red) are talking about definitely be responsive to the comment it is directly
“Browns in the Super Bowl”, that is clearly a shift from the replying to. We thus design our model based on the intuition:
original topic. Neither of the two groups of comments have the topic distribution of a node can be inferred by its
any words that occurred in the news article, so comments in parents, children and siblings nodes besides itself. Figure
both groups will be identified as “topic shift” if we use text 2 shows the tree structure of the example in Figure 1 and its
similarity value as the metric. To address this issue, in this topic shift detection process using word embedding and
paper, we propose a topic shift detection model using word structural context following the above intuition. First, we use
embedding as the vector representation to build the semantic word embedding to obtain the matrix of vectors. In this paper,
relationship between comments and the post, and use the tree we choose the 100-dimensional GloVe2 word embedding
structure that each discussion thread inherently exhibits as pre-trained on Wikipedia as the vector representations. We
context information to enrich the background knowledge for
each comment. Experiments show that our model effectively 1
https://www.reddit.com/r/news/comments/867njq/texas_serial_bomber_
improves topic shift detection performance. 2
made_video_confession_before/?st=juj3moys&sh=3a890c20.
https://nlp.stanford.edu/projects/glove/
978-1-7281-2607-4/19/$31.00 ©2019 IEEE 948
DOI 10.1109/COMPSAC.2019.00155
00 00 Context Topic Shift
0 0 0
11 77 Information Comment
11 77 Word 55 22 Cosine 1 Computed in 1 level 1 Node 1 7
88 7 7
Embedding Similarity Bottom up Classification
55 22 88 66 33 99 level 2 5 2
5 2 8 5 2 8 8
66 33 99 44
6 3 9 6 3 9 level 3 6 3 9
44 4 44 level 4 4
Vector matrix
Figure 2. Tree structure and topic shift detection process of the example in figure 1. Blue color indictates on- the topic comment and red color indicates off-
the-topic comment. Shade of color represents topic similiarity to the root, the deeper the larger the similarity.
can obtain the matrix corresponding to the work embedding 0.8
for the entire discussion tree with one line as a comment’s
vector representation. Next, we compute the cosine similarity 0.6
between each comment to the initial post, namely, the root
node in the tree, and rank them by their values. The darker 0.4
the color, the larger the value illustrated in the figure. We
can see that comments 2, 3 and 4 are “topic shifted” and 0.2
comments 6 and 9 are “ not topic shifted” but because they
all have low values, it is difficult to tell them apart. We thus 0
use structural context in the next step to provide additional Entertainmnet Sports Politics Health
background information to each node. We calculate the topic Text Similarity Structural Context and Word Embedding
similarity S’i of node i by its original cosine similarity value
Si, and the similarity value of its parent Sp, children Sc and Figure 3. Comparions of F1 score for the data sets of four domains
sibling Ss :
The result shows that our model achieves better performance
1 1 than traditional text similarity based model in all of the four
′ = + + + domains, with average accuracy, recall and F1 scores of 0.67,
0.654 and 0.661 compared to 0.61, 0.57 and 0.5985. It also
shows that it is harder to detect topic shift or connectedness
where Mi and Ni are respectively the number of children and in the entertainment field than the other three fields, because
siblings of node i if any, and w is the weight. Experiments lots of popular movies, shows and topics are not listed in
on 800 comments show that the topic reliance of a node on GloVe, so their relatedness failed to be detected. In the future,
the four types of nodes above is in a descending order: wi > we plan to use more data to further develop and test our
wp > wc > ws. This is pretty straightforward, because the model.
content of the node itself is the most important indicator and
it replies to the content of its parent node but may discuss ACKNOWLEDGMENT
some other aspects, so wp is also important but less than wi, This work was supported by the Ohio Department of
and the same for wc and ws. In the example of Figure 2, wi, Higher Education, the Ohio Federal Research Network and
wp, wc, ws are set as 0.56, 0.24, 0.15, 0.05 respectively the Wright State Applied Research Corporation under award
according to the statistical result we obtained from WSARC-16-00530 (C4ISR: Human-Centered Big Data).
experiments. We compute the topic similarity in a “bottom
REFERENCES
up” order – from the highest level to level 1. With the new
calculated topic similarity value, we can see that topics
discussed in nodes 6 and 9 are closer to the root, so only [1] Lifna, C.S. and Vijayalakshmi, M., 2015. Identifying concept-drift in
twitter streams. Procedia Computer Science, 45, pp.86-94
nodes 2, 3 and 4 are classified as “topic shift” comments.
[2] Park, A., Hartzler, A.L., Huh, J., Hsieh, G., McDonald, D.W. and
Pratt, W., 2016. “How Did We Get Here?”: topic drift in online health
III. EXPERIMENT discussions. Journal of medical Internet research, 18(11), p.e284.
To reduce the domain bias [3], we collected 200 [3] Topal, K., Koyuturk, M. and Ozsoyoglu, G., 2016, August. Emotion-
comments from discussion threads in four different domains and area-driven topic shift analysis in social media discussions. In
from Yahoo News and Reddit, and have 800 comments in 2016 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM) (pp. 510-518). IEEE.
total. We systematically identified a number of main topics
[4] Q. Li, Y. Sun, and B. Xue, “Complex query recognition based on
in each of the posts and examined whether and how many of dynamic learning mechanism,” in Journal of Computational
those main topics changed as threads evolved. Using this Information Systems, vol. 8. Springer, 2012, pp. 8333–8340.
information, we categorized topical changes into topic [5] Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A. and Ma, Z., 2017.
shifted (254 comments) and not shifted (546 comments). Enhancing topic modeling for short texts with auxiliary word
With the labelled data as a gold standard, we calculated embeddings. ACM Transactions on Information Systems (TOIS),
the precision, recall, and F1 score of our proposed topic shift 36(2), p.11.
detection model and the traditional text similarity based
model. The comparative results are provided in Figure 3.
949
View publication stats