Flu Gone Viral: Syndromic Surveillance of Flu On Twitter Using Temporal Topic Models
Flu Gone Viral: Syndromic Surveillance of Flu On Twitter Using Temporal Topic Models
Liangzhe Chen* , K. S. M. Tozammel Hossain* , Patrick Butler, Naren Ramakrishnan, B. Aditya Prakash
Department of Computer Science, Virginia Tech, VA, USA
Email: {liangzhe, tozammel, pabutler, naren, badityap}@cs.vt.edu
Abstract—Surveillance of epidemic outbreaks and spread from activity have been inspired by epidemiological research, recent
social media is an important tool for governments and public work [20], [26], [23] has shown that there are key aspects along
health authorities. Machine learning techniques for nowcasting which they differ from biological contagions. Specifically,
the flu have made significant inroads into correlating social media evidence from [20], [9] shows that the activity profile (or the
trends to case counts and prevalence of epidemics in a population. number new people using a hashtag/keyword) shows a power-
There is a disconnect between data-driven methods for forecasting
flu incidence and epidemiological models that adopt a state
law drop—in contrast standard epidemiological models exhibit
based understanding of transitions, that can lead to sub-optimal an exponential drop [12]. Also, there is some evidence that
predictions. Furthermore, models for epidemiological activity and hashtags of different topics show an exposure curve which is
social activity like on Twitter predict different shapes and have not monotonic, resembling a complex contagion [23].
important differences. We propose a temporal topic model to
capture hidden states of a user from his tweets and aggregate
We show that we can reconcile the apparently contrasting
states in a geographical region for better estimation of trends. We behaviors with a finer-grained modeling of biological phases
show that our approach helps fill the gap between phenomenolog- as inferred from tweets. For example, sample tweets “Down
ical methods for disease surveillance and epidemiological models. with flu. Not going to school.” and “Recovered from flu after
We validate this approach by modeling the flu using Twitter in 5 day, now going to the beach” denote different states of the
multiple countries of South America. We demonstrate that our users (also see Figure I(a)). We argue that correcting for which
model can consistently outperform plain vocabulary assessment epidemiological state a user belongs, the social and biological
in flu case-count predictions, and at the same time get better activity time-series are actually similar. Hashtags and keywords
flu-peak predictions than competitors. We also show that our merge users belonging to different epidemiological phases.
fine-grained modeling can reconcile some contrasting behaviors We separate these states by using a temporal topic model. In
between epidemiological and social models.
addition, thanks to the finer-grained modeling, our approach
gets better predictions of the incidence of flu-cases than direct
I. I NTRODUCTION keyword counting and also sometimes gets better predictions of
Online web searches and social media such as Twitter flu-peaks than sophisticated methods like Google Flu Trends.
and Facebook have emerged as surrogate data sources for Our contributions are:
monitoring and forecasting the rise of public health epidemics.
The celebrated example of such surrogate sources is arguably 1) We propose a temporal topic model (HFSTM) for in-
Google Flu Trends where user query volume for a handcrafted ferring hidden biological states for users, and an EM-
vocabulary of keywords is harnessed to yield estimates of flu based learning algorithm (HFSTM-FIT) for modeling
case counts. Such surrogates thus provide an easy-to-observe, the hidden epidemiological state of a user.
indirect, approach to understanding population-level events. 2) We show via extensive experiments using tweets
from South America that our learner indeed learns
The recent research has brought intense scrutiny on Google meaningful word distributions and state transitions.
Flu Trends, often negative. Lazer et al. [17] provide many Further, our method can better forecast the flu-trend
reasons for Google Flu Trend’s lackluster performance. Some as well as flu-peaks.
of these reasons are institutional (e.g., a cloud of secrecy 3) Finally, we show how once corrected for the state in-
about which keywords are used in the model, affecting repro- formation using our learnt model, the social contagion
ducibility and verification); some are operational (e.g., lack activity profile fits better with standard epidemiolog-
of periodic re-training); others could be indicative of more ical models.
systemic problems, e.g., that the vocabulary for tracking might
evolve over time, or that greater care is needed to distinguish Our work can be seen as a stepping stone to better
which aspects of search query volume should be used in understanding of contagions that occur in both biological and
modeling. These problems are not unique to Google Flu social spheres.
Trends, and can resurface with other surveillance strategies.
Our work is motivated by such considerations and we II. R ELATED W ORK
aim to better bridge the gap between syndromic surveillance The most closely related work comes from three areas; we
strategies and contagion-based epidemiological modeling such discuss them next in this section.
as SI, SIR, and SEIS [12]. In particular, while models of social
Epidemiology: In the epidemiological domain, various
* Authors contributed equally to this work. compartmental models (which explicitly model states of each
S E I S/R
.98 .01 .53
.02 .95
Had good sleep this morning!
Going to see my favourite band
I am in bed with the worst flu
I should have gotten the vaccine
S E I
.04
My neck hurts Starting to feel better
No word can describe the
Going to the concert tonight
amount of pain I am in .47
(a) Toy example. (b) State transition learnt by our model HFSTM.
Fig. 1. Comparison between expected state transition and the state transitions learnt by our model. (a) A toy example showing possible user states and a tweet
associated with each state. (b) State transition probabilities learnt by HFSTM (see Sec. III).
A generative process for the model is shown in Alg. 1. 1 Code and vocabulary can be found here: http://people.cs.vt.edu/liangzhe/
A binary variable l determines whether or not a word is code/hfstm.html
2) Datasets: We collected tweets generated from 15 coun- Date Tweet Message State
tries in South America for the period Dec, 2012—Jan, 2014 us- 29 Jul I hate pork chops - . - S
29 Jul I just want to leave my house to eat what I like S
ing Datasift’s Twitter collection service2 , which pre-processes
my
the data and detects the geo-location for tweets. 29 Jul I’m dying of sleep , headache and sore throat E
We create a training dataset TrainData, using the tweets but I will because I have mathematical
29 Jul That itv program brainwashed my mom , now S
from Jun 20, 2013 to Aug 06, 2013, which contains a peak of
I want to take juice or eat cereal
infections. We created two evaluation sets: TestPeriod-1, using 29 Jul Everything would be perfect if I hurt your E
tweets from Dec 01, 2012 to Jul 08, 2013, which contains throat
the rising part of a flu infection peak; TestPeriod-2, from 30 Jul I’m sure I have a fever because I hear weird I
Nov 10, 2013 to Jan 26, 2014, which is from a different flu sounds
season. For creating training data we perform keyword and 30 Jul I will survive because I am macabre empire I
phrase checking (from our vocabulary) to identify a set of 30 Jul I want to go to the doctor - . - I
users who have potentially tweeted a flu-related tweet. We then 30 Jul Natural orange juice for the sick I
fetch their tweet streams from Twitter API for the training 30 Jul spicy ham tkm I
period. We then use the Datasift service to preprocessing
these tweets (stemming, lemmatization, etc.), and get our final TABLE II. E XAMPLE STATE SEQUENCE FOR A USER AS LEARNT FROM
OUR MODEL FROM REAL - WORLD TWEETS ( TRANSLATED TO E NGLISH
training dataset of roughly 34,000 tweets. USING G OOGLE T RANSLATE ).
We collected data from The Pan American Health Organi- We used HFSTM to classify tweets to different states. As we can
zation (PAHO [21]) for the ground-truth reference dataset for see, our model can capture the difference between different states
and also the state transitions.
flu case counts (trends). PAHO plays the same role in South
America as CDC does in the USA. Note that PAHO gives
only per-week counts. D. Fitting flu trend
Additionally, to test the predictive capability of our model,
B. Word distritution for each flu-state we design a flu-case count prediction task on our test datasets,
after training on TrainData. We compare three models: (A)
In short, our model learns meaningful topic word distribu- the baseline model, which uses classical linear regression
tion for the flu states. See Figure 3–it shows a word cloud for techniques and word counts to predict case count numbers;
each flu-state (we renormalized each word distribution after (B) our model HFSTM; and (C) GFT (Google Flu Trend). In
removing the generic block-word) learned by HFSTM. The all three cases we use the same LASSO based linear regression
most frequent words in each state matches well with the S, model to predict the number of cases of influenza like illnesses
E and I states in epidemiology. As shown in the figure, the S recorded by PAHO (the ground-truth). We predict per-weekly
state has normal words, the E state starts to gather words which values as both PAHO and GFT give counts only on a weekly
indicate an exposure or approaching to the disease (’pain’, basis.
’throat’), while the I state gets many typical flu-related words
(’flu’, ’fever’). The baseline model uses a set of features created from
the counts of 114 flu related words. We count the number
of occurrance of these words in the testing data, these word
C. State transition counts were then collated into a single feature vector defined
as the number of tweets containing a single word per week.
We show the state transition diagram learned by our We then regressed this set of counts to the PAHO case counts
model in Figure I(b). The initial state probability learned for each week.
is [0.98, 0.02, 0.00], with high probability that a tweet starts
at state S, 0.02 probability it starts at state E, and almost Our model improve upon the baseline model by incorpo-
zero probability it starts at state I. When there’s a transition rating the state of the user when a word was tweeted. In this
occurring, a tweet in S state tends to stay in S state, a tweet in E way we capture the context of a word/tweet as implied by
state is very likely to enter I state, while a tweet in I state either our HFSTM model. For our model, the feature vector is
stays infected or recovers and goes back to state S. All these created from a count of the top 20 words from each state,
observations match closely with the standard epidemiological appended to the word of each state, such that (cold, S) is
SEIS model and intuition. counted differently from (cold, I).
We also investigate the most-likely state sequence for each For GFT, we directly collect data from the Google Flu
user learned by our model. Using the probabilities learned by Trends website3 , and then apply the same regression as used
our model, we take a sequence of tweets from one user, and use in other methods to predict the number of infection cases. Note
MLE to estimate the state each tweet is in. Table II shows one that as GFT is a state-of-the-art production system with highly
example of these transitions (we show the translated English optimized proprietary vocabulary lists, we do not expect to
version here using Google Translate). As we can see, our beat it consistently, yet as we describe later, we note some
model is powerful enough to learn the Exposed state, before interesting results.
the user is infectious. This also shows the accuracy of our Fig. 4(a) shows the aggregated cases for TestPeriod-1, and
transition probabilities between the flu states. Fig. 4(b) shows the smae cases for TestPeriod-2. We make
2 http://datasift.com/ 3 http://www.google.org/flutrends
(a) S state (b) E state (c) I state
Fig. 3. The translated word cloud for the most probable words in the S, E and I state-topic distributions as learnt by HFSTM on TrainData. Words are originally
learned and inferred in Spanish, we then translate the result using google translate for the ease of understanding. The size of the word is proportional to its
probability in the corresponding topic distribution. Our model is able to tease out the differences in the word distributions between them.
several observations. Firstly, it is clear from the figures that off exponentially as expected from standard epidemiological
HFSTM outperforms the baseline method (of keyword count- models.
ing) in both cases—demonstrating that the state knowledge is
important and our model is carefully learning that information To test our hypothesis, we chose commonly occurring flu-
correctly (the RMSE value difference between HFSTM and keywords—enfermo (sick), fiebre (fever), dolor (pain)—for
the baseline for the 2 plots are about [250, 70] respectively). the analysis. Firstly, we count the total occurrences of these
Secondly, we also see that the predictions from our model keywords in TestPeriod-1. For each keyword we identify the
are comparable qualitatively to the state-of-the-art GFT pre- falling part of its activity-curve. We then fit each curve with
dictions, even though our method was just implemented as a power law and exponential function. As expected from [20],
research prototype without sophisticated optimizations. In fact, Fig. 5(a) shows that the power-law function provides a much
for Figures 4(b), our model HFSTM even outperforms GFT better fit of the falling part of the curve compare to the expo-
(with an RMSE difference of about 37). Significantly, in both nential function (RMSE scores for power law and exponential
cases, GFT clearly overestimates the peak which our method functions are ∼ 320.31 and ∼ 469.35 respectively).
does not (this is an important issue with GFT which was also Secondly, to study the effect of our model on the activity
documented and observed in context of another US flu season profiles of these keywords: we count total occurrences of
as well [6]). These results show that including the epidemio- these keywords in the tweets which are tweeted only by
logical state information of users via our model can potentially infected users (i.e. by those users we learn as being in I). In
benefit the prediction of infection cases dramatically. contrast to the previous figure, we see that now exponential
fit (RMSE score ∼ 147.48) is much better than a power
E. Bridging the Social and the Epidemiological law fit (RMSE score ∼ 275.50) (see Fig. 5(b))—matching
what we would expect from an epidemiological model like
SEIS. Thus this demonstrates that finer-grained modeling can
explain differences between the biological activity and the
social activity which is used as its proxy.
Case Count
1000
3000
800
2000 600
400
1000
200
0 0
Jan 2
013 2013 ar 2013 pr 2013 ay 2013 2013 Ju l 2
013 24 20
13
08 20
13
22 20
13 5 201
4
9 201
4
02 2014
Feb M A M Ju n Nov Dec Dec Jan 0 Jan 1 Feb
Date Date
(a) TestPeriod-1 (b) TestPeriod-2
Fig. 4. Evaluation for the two test scenarios: (a) TestPeriod-1 and (b) TestPeriod-2. Comparison of the week-to-week predictions against PAHO case counts
using the three models: baseline model, HFSTM, and GFT (Google Flu Trend). Our model outperforms the baseline, and is comparable to GFT, beating it in
case of (b). GFT overestimates the peak in both test periods.
related states and topics even with an enlarged and noisier [13] L. Hong, D. Yin, J. Guo, and B. Davison. Tracking Trends: Incorpo-
vocabulary. rating Term Volume into Temporal Topic Models. In the 17th ACM
SIGKDD, pages 484–492, 2011.
Acknowledgements. This material is based upon work supported by [14] J. Jacquez and C. Simon. The Stochastic SI Model with Recruitment
the National Science Foundation under Grant No. IIS-1353346, by and Deaths I. Comparison with the Closed SIS Model. Mathematical
the Maryland Procurement Office under contract H98230-14-C-0127, Biosciences, 117(1):77–125, 1993.
by the Intelligence Advanced Research Projects Activity (IARPA) via [15] A. Lamb, M. J. Paul, and M. Dredze. Separating fact from fear:
Tracking flu infections on twitter. In North American Chapter of the
Department of Interior National Business Center (DoI/NBC) contract
Association for Computational Linguistics (NAACL), 2013.
number D12PC000337, and by the VT College of Engineering. Any
[16] V. Lampos, T. De Bie, and N. Cristianini. Flu detector: Tracking
opinions, findings and conclusions or recommendations expressed in epidemics on twitter. In Proceedings of the 2010 European Conference
this material are those of the author(s) and do not necessarily reflect on Machine Learning and Knowledge Discovery in Databases: Part III,
the views of the respective funding agencies. ECML PKDD’10, pages 599–602, 2010.
[17] D. Lazer, R. Kennedy, G. King, and A. Vespignani. The parable of
R EFERENCES google flu: Traps in big data analysis. Science, 343(6176):1203–1205,
2014.
[1] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Predicting
Flu Trends using Twitter Data. In IEEE Conference on Computer [18] K. Lee, A. Agrawal, and A. Choudhary. Real-time disease surveillance
Communications Workshops, pages 702–707. IEEE, 2011. using twitter data: Demonstration on flu and cancer. In Proceedings
of the 19th ACM SIGKDD international conference on Knowledge
[2] M. Andrews and G. Vigliocco. The Hidden Markov Topic Model: A discovery and data mining (KDD), pages 1474–1477. ACM, 2013.
Probabilistic Model of Semantic Representation. Topics in Cognitive
Science, 2(1):101–113, 2010. [19] M. Li and J. Muldowney. Global stability for the SEIR model in
epidemiology. Mathematical Biosciences, 125(2):155–164, 1995.
[3] E. Beretta and Y. Takeuchi. Global Stability of an SIR Epidemic Model
with Time Delays. The Journal of mathematical biology, 33(3):250– [20] Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos. Rise
260, 1995. and fall patterns of information diffusion: model and implications. In
Proceedings of the 18th ACM SIGKDD international conference on
[4] D. Blei and J. Lafferty. Dynamic Topic Models. In In ICML, pages Knowledge discovery and data mining, KDD ’12, pages 6–14, 2012.
113–120, 2006.
[21] PAHO. Epidemic disease database, pan american health organization.
[5] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. The Journal http://ais.paho.org/phip/viz/ed flu.asp, Dec. 2012.
of Machine Learning Research, 3:993–1022, 2003.
[22] M. Paul and M. Dredze. You Are What You Tweet: Analyzing Twitter
[6] D. Butler. When Google got Flu Wrong. Nature, 494(7436):155–156, for Public Health. In Fifth International AAAI Conference on Weblogs
2013. and Social Media (ICWSM 2011), 2011.
[7] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, [23] D. M. Romero, B. Meeder, and J. Kleinberg. Differences in the
E. Nsoesie, S. Mekaru, J. Brownstein, M. Marathe, and N. Ramakrish- mechanics of information diffusion across topics: idioms, political
nan. Forecasting a Moving Target: Ensemble Models for ILI Case Count hashtags, and complex contagion on twitter. In Proceedings of the 20th
Predictions. In SIAM International Conference on Data Mining, 2014. international conference on World wide web, pages 695–704, 2011.
[8] N. Christakis and J. Fowler. Social Network Sensors for Early Detection [24] A. Sadilek, H. Kautz, and V. Silenzio. Predicting disease transmission
of Contagious Outbreaks. PLoS ONE, (9), 09 2010. from geo-tagged micro-blog data. In AAAI Conference on Artificial
[9] R. Crane and D. Sornette. Robust Dynamic Classes Revealed by Intelligence, 2012.
Measuring the Response Function of a Social System. In PNAS, 2008. [25] X. Wang and A. McCallum. Topics Over Time: a non-Markov
[10] J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, and Continuous-time Model of Topical Trends. In the 12th ACM SIGKDD,
L. Brilliant. Detecting Influenza Epidemics using Search Engine Query pages 424–433, 2006.
Data. Nature, 457(7232):1012–1014, 2008. [26] J. Yang and J. Leskovec. Patterns of temporal variation in online media.
[11] A. Gruber, M. Rosen-Zvi, and Y. Weiss. Hidden Topic Markov Models. In WSDM, pages 177–186, 2011.
Artificial Intelligence and Statistics (AISTATS), 2007. [27] J. Yang, J. McAuley, J. Leskovec, P. LePendu, and N. Shah. Finding
[12] H. W. Hethcote. The mathematics of infectious diseases. SIAM Review, progression stages in time-evolving event sequences. In the 23rd
42, 2000. International Conference on World Wide Web, pages 783–794, 2014.