The Pulse of News in Social Media: Forecasting Popularity: Roja Bandari Sitaram Asur Bernardo A. Huberman

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Pulse of News in Social Media: Forecasting Popularity

Roja Bandari Sitaram Asur Bernardo A. Huberman


Department of Electrical Engineering Social Computing Group Social Computing Group
UCLA HP Labs HP Labs
Los Angeles, CA 90095-1594 Palo Alto, CA 94304 Palo Alto, CA 94304
[email protected] [email protected] [email protected]

Abstract a news article will spread on the web is extremely valuable


News articles are extremely time sensitive by nature. to journalists, content providers, advertisers, and news rec-
There is also intense competition among news items to ommendation systems. This is also important for activists
propagate as widely as possible. Hence, the task of pre- and politicians who are using the web increasingly more to
dicting the popularity of news items on the social web influence public opinion.
is both interesting and challenging. Prior research has However, predicting online popularity of news articles is
dealt with predicting eventual online popularity based a challenging task. First, context outside the web is often not
on early popularity. It is most desirable, however, to readily accessible and elements such as local and geographi-
predict the popularity of items prior to their release, cal conditions and various circumstances that affect the pop-
fostering the possibility of appropriate decision mak-
ulation make this prediction difficult. Furthermore, network
ing to modify an article and the manner of its publi-
cation. In this paper, we construct a multi-dimensional properties such as the structure of social networks that are
feature space derived from properties of an article and propagating the news, influence variations among members,
evaluate the efficacy of these features to serve as predic- and interplay between different sections of the web add other
tors of online popularity. We examine both regression layers of complexity to this problem. Most significantly, in-
and classification algorithms and demonstrate that de- tuition suggests that the content of an article must play a
spite randomness in human behavior, it is possible to crucial role in its popularity. Content that resonates with a
predict ranges of popularity on twitter with an overall majority of the readers such as a major world-wide event can
84% accuracy. Our study also serves to illustrate the be expected to garner wide attention while specific content
differences between traditionally prominent sources and relevant only to a few may not be as successful.
those immensely popular on the social web.
Given the complexity of the problem due to the above
mentioned factors, a growing number of recent studies
1 Introduction (Szabó and Huberman 2010), (Lee, Moon, and Salamatian
News articles are very dynamic due to their relation to con- 2010), (Tatar et al. 2011), (Kim, Kim, and Cho 2011), (Ler-
tinuously developing events that typically have short lifes- man and Hogg 2010) make use of early measurements of an
pans. For a news article to be popular, it is essential for it to item’s popularity to predict its future success. In the present
propagate to a large number of readers within a short time. work we investigate a more difficult problem, which is pre-
Hence there exists a competition among different sources to diction of social popularity without using early popularity
generate content which is relevant to a large subset of the measurements, by instead solely considering features of a
population and becomes virally popular. news article prior to its publication. We focus this work on
Traditionally, news reporting and broadcasting has been observable features in the content of an article as well as its
costly, which meant that large news agencies dominated the source of publication. Our goal is to discover if any predic-
competition. But the ease and low cost of online content cre- tors relevant only to the content exist and if it is possible to
ation and sharing has recently changed the traditional rules make a reasonable forecast of the spread of an article based
of competition for public attention. News sources now con- on content features.
centrate a large portion of their attention on online mediums The news data for our study was collected from
where they can disseminate their news effectively and to a Feedzilla 1 –a news feed aggregator– and measurements of
large population. It is therefore common for almost all ma- the spread are performed on Twitter 2 , an immensely popu-
jor news sources to have active accounts in social media ser- lar microblogging social network. Social popularity for the
vices like Twitter to take advantage of the enormous reach news articles are measured as the number of times a news
these services provide. URL is posted and shared on Twitter.
Due to the time-sensitive aspect and the intense competi- To generate features for the articles, we consider four dif-
tion for attention, accurately estimating the extent to which
1
Copyright c 2012, Association for the Advancement of Artificial www.feedzilla.com
2
Intelligence (www.aaai.org). All rights reserved. www.twitter.com
ferent characteristics of a given article. Namely: Moon, and Salamatian 2010), (Jamali and Rangwala 2009)
• The news source that generates and posts the article and (Tatar et al. 2011) predict the popularity of a thread us-
ing features based on early measurements of user votes and
• The category of news this article falls under comments. (Kim, Kim, and Cho 2011) propose the notion
• The subjectivity of the language in the article of virtual temperature of weblogs using early measurements.
(Lerman and Hogg 2010) predict digg counts using stochas-
• Named entities mentioned in the article
tic models that combine design elements of the site -that in
We quantify each of these characteristics by a score making turn lead to collective user behavior- with information from
use of different scoring functions. We then use these scores early votes.
to generate predictions of the spread of the news articles us- Finally, recent work on variation in the spread of content
ing regression and classification methods. Our experiments has been carried out by (Romero, Meeder, and Kleinberg
show that it is possible to estimate ranges of popularity with 2011) with a focus on categories of twitter hashtags (similar
an overall accuracy of 84% considering only content fea- to keywords). This work is aligned with ours in its atten-
tures. Additionally, by comparing with an independent rat- tion to importance of content in variations among popular-
ing of news sources, we demonstrate that there exists a sharp ity, however they consider categories only, with news being
contrast between traditionally popular news sources and the one of the hashtag categories. (Yu, Chen, and Kwok 2011)
top news propagators on the social web. conduct similar work on social marketing messages.
In the next section we provide a survey of recent liter-
ature related to this work. Section 3 describes the dataset 3 Data and Features
characteristics and the process of feature score assignment.
This section describes the data, the feature space, and feature
In Section 4 we will present the results of prediction meth-
score assignment in detail.
ods. Finally, in Section 5 we will conclude the paper and
discuss future possibilities for this research. 3.1 Dataset Description
2 Related Work Data is comprised of a set of news articles published on the
web within a defined time period and the number of times
Stochastic models of information diffusion as well as deter- each article was shared by a user on Twitter after publica-
ministic epidemic models have been studied extensively in tion. This data was collected in two steps: first, a set of
an array of papers, reaffirming theories developed in sociol- articles were collected via a news feed aggregator, then the
ogy such as diffusion of innovations (Rogers 1995). Among number of times each article was linked to on twitter was
these are models of viral marketing (Leskovec, Adamic, found. In addition, for some of the feature scores, we used
and Huberman 2007), models of attention on the web (Wu a 50-day history of posts on twitter. The latter will be ex-
and Huberman 2007), cascading behavior in propagation of plained in section 3.2 on feature scoring.
information (Gruhl et al. 2004) (Leskovec et al. 2007)
and models that describe heavy tails in human dynamics
10000

(Vázquez et al. 2006). While some studies incorporate fac-


tors for content fitness into their model (Simkin and Roy-
chowdhury 2008), they only capture this in general terms
1000

and do not include detailed consideration of content features.






Salganik, Dodds, and Watts performed a controlled exper- ●


●●●

iment on music, comparing quality of songs versus the ef-



●●
●●●
100

●●●●●

fects of social influence(Salganik, Dodds, and Watts 2006).


●●●
●●●
● ●●●

● ●
● ●●
●●

They found that song quality did not play a role in popularity
●●●
●●● ●
● ●●
●●
●●
●●
●●
●●
●●
●●● ●●●

of highly rated songs and it was social influence that shaped


●●
10

●● ●
●● ●●● ●
● ● ●● ●
●●● ●●● ●● ●
● ●● ●
●●●●

the outcome. The effect of user influence on information


●●● ●
●●●●●
●●
● ● ●● ●●●

● ●● ● ●● ●●●● ●
●●
●● ●

●●●
●●

●● ●
●●●

●●
●●●●●
●●
● ● ●● ●

diffusion motivates another set of investigations (Kempe, ● ●


●●
●●●

●●
●●

●●
●●●

●●

●●
●●

●●●
●●

●●●
●●
●●●●●● ● ●● ● ●

Kleinberg, and Tardos 2003), (Cosley et al. 2010),(Agarwal


●● ●●●

●●
●●
●●●
●●●


●●

●●
●●

●●
●●●●

●●
●●●●


●●
●●●

●●●
●● ●
●●
●●

●●
●●●
●●
●●
●●●●●
1

et al. 2008), (Lerman and Hogg 2010). 1 5 10 50 100 500 1000

On the subject of news dissemination, (Leskovec, Back-


strom, and Kleinberg 2009) and (Yang and Leskovec 2011) Figure 1: Log-log distribution of tweets.
study temporal aspects of spread of news memes online,
with (Lerman and Ghosh 2010) empirically studying spread Online news feed aggregators are services that collect and
of news on the social networks of digg and twitter and (Sun deliver news articles as they are published online. Using the
et al. 2009) studying facebook news feeds. API for a news feed aggregator named Feedzilla, we col-
A growing number of recent studies predict spread of in- lected news feeds belonging to all news articles published
formation based on early measurements (using early votes online during one week (August 8th to 16th, 2011) which
on digg, likes on facebook, click-throughs, and comments comprised 44,000 articles in total. The feed for an article
on forums and sites). (Szabó and Huberman 2010) found includes a title, a short summary of the article, its url, and
that eventual popularity of items posted on youtube and digg a time-stamp. In addition, each article is pre-tagged with a
has a strong correlation with their early popularity; (Lee, category either provided by the publisher or in some manner
1.2 
Number of Links  t‐density 

0.8 

0.6 

0.4 

0.2 


NM  


DU  


 


 

 

 

NO S 

ER E 


SS

TS

ES

BS

ES

TH

TS
A

TS

NG
IC
RY

EL

M  
GY

ES

RA ITY
OG

RT ITIC

IE

ST
EN

AR
NC

ET
ET
US

CO MIN
DE

UF
EW

US
NE

TI

TI
OR

EN

UC

AV
JO

AL
ST

UN  STY

AM
BB
LO

PI

NI
CI
NE

OG AL
IE
SI

RI
BL

VI

ST
M
L
SI

OP
 N

EV

OD
HE

TR
SP

HO

M
SO
SC
PO

EB

PR ITU
 G


BU

E
LD

AI

LU
SH
IN

LIF

PR
IV
TO

O
CH

FU
CE

IR
OR

DE
TE

SP
TE
W

VI


EN

AN

IO
LIG
RE
Figure 2: Normalized values for t-density per category and links per category

determined by Feedzilla. A fair amount of cleaning was per- These questions motivate the choice of the following char-
formed to remove redundancies, resolve naming variations, acteristics of an article as the feature space: the category
and eliminate spam through the use of automated methods that the news belongs to (e.g. politics, sports, etc.), whether
as well as manual inspection. As a result over 2000 out of a the language of the text is objective or subjective, whether
total of 44,000 items in the data were discarded. (and what) named entities are mentioned, and what is the
The next phase of data collection was performed using source that published the news. These four features are cho-
Topsy 3 , a Twitter search engine that searches all mes- sen based on their availability and relevance, and although
sages posted on Twitter. We queried for the number of times it is possible to add any other available features in a similar
each news link was posted or reshared on Twitter (tweeted manner, we believe the four features chosen in this paper to
or retweeted). Earlier research (Leskovec, Backstrom, and be the most relevant.
Kleinberg 2009) on news meme buildup and decay suggest We would like to point out that we use the terms article
that popular news threads take about 4 days until their pop- and link interchangeably since each article is represented by
ularity starts to plateau. Therefore, we allowed 4 days for its URL link.
each link to fully propagate before querying for the number Category Score News feeds provided by Feedzilla are
of times it has been shared. pre-tagged with category labels describing the content. We
The first half of the data was used in category score as- adopted these category labels and designed a score for them
signment (explained in the next section). The rest we parti- which essentially represents a prior disribution on the pop-
tioned equally into 10,000 samples each for training and test ularity of categories. Figure 2 illustrates the prominence of
data for the classification and regression algorithms. Fig- each category in the dataset. It shows the number of links
ure 1 shows the log distribution of total tweets over all data, published in each category as well as its success on Twitter
demonstrating a long tail shape which is in agreement with represented by the average tweet per link for each category.
other findings on distribution of Twitter information cas- We call the average tweet per link the t-density and we will
cades (Zhou et al. 2010). The graph also shows that articles use this measure in score assignments for some other fea-
with zero tweets lie outside of the general linear trend of the tures as well.
graph because they did not propagate on the Twitter social Number of Tweets
network. t-density =
Our objective is to design features based on content to Number of Links
predict the number of tweets for a given article. In the next Observe in Figure 2 that news related to Technology re-
section we will describe these features and the methods used ceives more tweets on average and thus has a more promi-
to assign values or scores to features. nent presence in our dataset and most probably on twitter
as a whole. Furthermore, we can see categories (such as
3.2 Feature Description and Scoring Health) with low number of published links but higher rates
of t-density (tweet per link). These categories perhaps have
Choice of features is motivated by the following questions: a niche following and loyal readers who are intent on posting
Does the category of news affect its popularity within a so- and retweeting its links.
cial network? Do readers prefer factual statements or do We use t-density to represent the prior popularity for a
they favor personal tone and emotionally charged language? category. In order to assign a t-density value (i.e. score) to
Does it make a difference whether famous names are men- each category, we use the first 22,000 points in the dataset to
tioned in the article? Does it make a difference who pub- compute the average tweet per article link in that category.
lishes a news article?
Subjectivity Another feature of an article that can affect
3
http://topsy.com the amount of online sharing is its language. We want to
examine if an article written in a more emotional, more per- each entity on twitter over the timeframe of a month. The
sonal, and more subjective voice can resonate stronger with assigned score is the average t-density (as defined in sec-
the readers. Accordingly, we design a binary feature for tion 3.2) of each named entity. To assign a score for a given
subjectivity where we assign a zero or one value based on article we use three different values: the number of named
whether the news article or commentary is written in a more entities in an article, the highest score among all the named
subjective voice, rather than using factual and objective lan- entities in an article, and the average score among the enti-
guage. We make use of a subjectivity classifier from Ling- ties.
Pipe (Alias-i. 2008) a natural language toolkit using ma-
chine learning algorithms devised by (Pang and Lee 2004). Source Score The data includes articles from 1350 unique
Since this requires training data, we use transcripts from sources on the web. We assign scores to each source based
well-known tv and radio shows belonging to Rush Lim- on the historical success of each source on Twitter. For this
baugh 4 and Keith Olberman 5 as the corpus for subjective purpose, we collected the number of times articles from each
language. On the other hand, transcripts from CSPAN 6 source were shared on Twitter in the past. We used two
as well as the parsed text of a number of articles from the different scores, first the aggregate number of times articles
website FirstMonday 7 are used as the training corpus for from a source were shared, and second the t-density of each
objective language. The above two training sets provide a source which as defined in 3.2 is computed as the number of
very high training accuracy of 99% and manual inspection tweets per links belonging to a source. The latter proved to
of final results confirmed that the classification was satis- be a better score assignment compared to the aggregate.
factory. Figure 3 illustrates the distribution of average sub- Histogram of log(t_density)
jectivity per source, showing that some sources consistently

140
publish news in a more objective language and a somwhat
lower number in a more subjective language.

120
Histogram of AvgSubjectivity

100
80
Frequency
200

60
150

40
Frequency

20
100

0 1 2 3 4 5
50

log(t_density)

Figure 4: Distribution of log of source t-density scores over


0

0.0 0.2 0.4 0.6 0.8 1.0


collected data. Log transformation was used to normalize
AvgSubjectivity
the score further.

Figure 3: Distribution of average subjectivity of sources. To investigate whether it is better to use a smaller por-
tion of more recent history, or a larger portion going back
Named Entities In this paper, a named entity refers to a farther in time and possibly collecting outdated information,
known place, person, or organization. Intuition suggests that we start with the two most recent weeks prior to our data
mentioning well-known entities can affect the spread of an collection and increase the number of days, going back in
article, increasing its chances of success. For instance, one time. Figure 5 shows the trend of correlation between the t-
might expect articles on Obama to achieve a larger spread density of sources in historical data and their t-density in our
than those on a minor celebrity. And it has been well doc- dataset. We observe that the correlation increases with more
umented that fans are likely to share almost any content datapoints from the history until it begins to plateau near 50
on celebrities like Justin Bieber, Oprah Winfrey or Ashton days. Using this result, we take 54 days of history prior to
Kutcher. We made use of the Stanford-NER 8 entity extrac- the first date in our dataset. We find that the correlation of the
tion tool to extract all the named entities present in the title assigned score found in the above manner has a correlation
and summary of each article. We then assign scores to over of 0.7 with the t-density of the dataset. Meanwhile, the cor-
40,000 named entities by studying historical prominence of relation between the source score and number of tweets of
any given article is 0.35, suggesting that information about
4
http://www.rushlimbaugh.com the source of publication alone is not sufficient in predicting
5
http://www.msnbc.msn.com/id/32390086 popularity. Figure 4 shows the distribution of log of source
6
http://www.c-span.org scores (t-density score). Taking the log of source scores pro-
7
http://firstmonday.org duces a more normal shape, leading to improvements in re-
8
http://nlp.stanford.edu/software/CRF-NER.shtml gression algorithms.
0.72  2000k 

Tweets 
0.7  1600k 
Links 

1200k 
0.68 
Correla'on 

800k 
0.66 

400k 
0.64 


0.62  1  8  15  22  29  36  43  50 
10  20  30  40  50  60  ,me (days) 
Time (days) 

(a) tweets and links


Figure 5: Correlation trend of source scores with t-density
in data. Correlation increases with more days of historical 8 

data until it plateaus after 50 days. 7 

t‐density 


We plot the timeline of t-densities for a few sources and
find that t-density of a source can vary greatly over time. 4 
1  8  15  22  29  36  43  50 
Figure 6 shows the t-density values belonging to the technol- 4me (days) 

ogy blog Mashable and Blog Maverick, a weblog of promi-


(b) t-density
nent entrepreneur, Mark Cuban. The t-density scores cor-
responding to each of these sources are 74 and 178 respec- Figure 7: Temporal variations of tweets, links, and t-density
tively. However, one can see that Mashable has a more con- over all sources
sistent t-density compared to Blog Maverick.

Google News 9 is one of the major aggregators and


600 

Blog Maverick 
500 
Mashable  providers of news on the web. While inclusion in Google
400  news results is free, Google uses its own criteria to rank
t‐density 

300  the content and place some articles on its homepage, giv-
200  ing them more exposure. Freshness, diversity, and rich tex-
100 
tual content are listed as the factors used by Google News to
automatically rank each article as it is published. Because
Google does not provide overall rankings for news sources,

1  11  21  31  41  51  61  71  81 
3me (days) 
to get a rating of sources we use NewsKnife 10 . NewsKnife
is a service that rates top news sites and journalists based
Figure 6: Timeline of t-density (tweet per link) of two on analysis of article’s positions on the Google news home-
sources. page and sub-pages internationally. We would like to know
whether the sources that are featured more often on Google
In order to improve the score to reflect consistency we news (and thus deemed more prominent by Google and rated
devise two methods; the first method is to smooth the mea- more highy by NewsKnife) are also those that become most
surements for each source by passing them through a low- popular on our dataset.
pass filter. Second is to weight the score by the percent-
age of times a source’s t-density is above the mean t-density Total Links Total Tweets t-density
over all sources, penalizing sources that drop low too often. Correlation 0.57 0.35 -0.05
The mean value of t-densities over all sources is 6.4. Fig-
ure 7 shows the temporal variations of tweets and links over Table 1: Correlation values between NewsKnife source
all sources. Notice that while both tweets and links have a scores and their performance on twitter dataset.
weekly cycle due to natural variations in web activity, the
t-density score is robust to periodic weekly variations. The Accordingly we measure the correlation values for the 90
non-periodic nature of t-density indicates that the reason we top NewsKnife sources that are also present in our dataset.
see less tweets during down time is mainly due to the fact The values are shown in Table 1. It can be observed that the
that less links are posted. ratings correlate positively with the number of links pub-
lished by a source (and thus the sum of their tweets), but
Are top traditional news sources the most propagated? have no correlation (-0.05) with t-density which reflects the
As we assign scores to sources in our dataset, we are inter-
9
ested to know whether sources successful in this dataset are http://news.google.com/
10
those that are conventionally considered prominent. http://www.newsknife.com
number of tweets that each of their links receives. For our Variable Description
source scoring scheme this correlation was about 0.7. S Source t-density score
Table 2 shows a list of top sources according to C Category t-density score
NewsKnife, as well as those most popular sources in our Subj Subjectivity (0 or 1)
dataset. While NewsKnife rates more traditionally promi- Entct Number of named entities
nent news agencies such as Reuters and the Wall Street Jour- Entmax Highest score among named entities
nal higher, in our dataset the top ten sources (with high- Entavg Average score of named entities
est t-densities) include sites such as Mashable, AllFacebook
(the unofficial facebook blog), the Google blog, marketing Table 3: Feature set (prediction inputs). t-density refers to
blogs, as well as weblogs of well-known people such as average tweet per link.
Seth Godin’s weblog and Mark Cuban’s blog (BlogMaver-
ick). It is also worth noting that there is a bias toward news
and opinion on web marketing, indicating that these sites ac- rithms. We apply three different regression algorithms - lin-
tively use their own techniques to increase their visibility on ear regression, k-nearest neighbors (KNN) regression and
Twitter. support vector machine (SVM) regression.
While traditional sources publish many articles, those
more successful on the social web garner more tweets. A Linear Regression SVM Regression
comparison shows that a NewsKnife top source such as The All Data 0.34 0.32
Christian Science Monitor received an average of 16 tweets Tech Category 0.43 0.36
in our dataset with several of its articles not getting any Within Twitter 0.33 0.25
tweets. On the other hand, Mashable gained an average of
nearly 1000 tweets with its least popular article still receiv- Table 4: Regression Results (R2 values)
ing 360 tweets. Highly ranked news blogs such as The Huff-
ington Post perform relatively well in Twitter, possibly due
to their active twitter accounts which share any article pub- Since the number of tweets per article has a long-tail
lished on the site. distribution (as discussed previously in Figure 1), we per-
formed a logarithmic transformation on the number of
NewsKnife Reuters, Los Angeles Times, New York tweets prior to carrying out the regression. We also used the
Times, Wall Street Journal, USA To- log of source and category scores to normalize these scores
day, Washington Post, ABC News, further. Based on this transformation, we reached the fol-
Bloomberg, Christian Science Monitor, lowing relationship between the final number of tweets and
BBC News feature scores.
Twitter Dataset Blog Maverick, Search Engine Land, ln(T ) = 1.24ln(S) + 0.45ln(C) + 0.1Entmax 3
Duct-tape Marketing, Seth’s Blog,
Google Blog, Allfacebook, Mashable, where T is the number of tweets, S is the source t-density
Search Engine Watch score, C is the category t-density score, and Entmax is the
maximum t-density of all entities found in the article. Equiv-
Table 2: Highly rated sources on NewsKnife versus those alently,
popular on the Twitter dataset T = S 1.24 C 0.45 e (0.1Entmax +3)
with coefficient of determination R2 = 0.258. All three pre-
dictors in the above regression were found to be significant.
4 Prediction Note that the R2 is the coefficient of determination and re-
lates to the mean squared error and variance:
In this work, we evaluate the performance of both regression
and classification methods to this problem. First, we apply M SE
R2 = 1
regression to produce exact values of tweet counts, evalu- V AR
ating the results by the R-squared measure. Next we define Alternatively, the following model provided improved re-
popularity classes and predict which class a given article will sults:
belong to. The following two sections describe these meth- 2
ods and their results. T 0.45 = (0.2S 0.1Entct 0.1Entavg + 0.2Entmax )
with an improved R2 = 0.34. Using support vector machine
4.1 Regression (SVM) regression (Chang and Lin 2011), we reached similar
Once score assignment is complete, each point in the data values for R2 as listed in Table 4.
(i.e. a given news article) will correspond to a point in the In K-Nearest Neighbor Regression, we predict the tweets
feature space defined by its category, subjectivity, named en- of a given article using values from it’s nearest neighbors.
tity, and source scores. As described in the previous section, We measure the Euclidean distance between two articles
category, source, and named entity scores take real values based on their position in the feature space (Hastie, Tibshi-
while the subjectivity score takes a binary value of 0 or 1. rani, and Friedman 2008). Parameter K specifies the num-
Table 3 lists the features used as inputs of regression algo- ber of nearest neighbors to be considered for a given article.
Results with K = 7 and K = 3 for a 10k test set are R- is interesting to note that while category score does not con-
sq= 0.05, with mean squared error of 5101.695. We observe tribute in prediction of popularity within Twitter, it does help
that KNN performs increasingly more poorly as the dataset us determine whether an article will be at all mentioned on
becomes larger. this social network or not. This could be due to a large bias
toward sharing technology-related articles on Twitter.
Category-specific prediction One of the weakest predic-
tors in regression was the Category score. One of the reasons
for this is that there seems to be a lot of overlap across cat- Class name Range of tweets Number of articles
egories. For example, one would expect World News and A 1–20 7,600
Top News to have some overlap, or the category USA would B 20–100 1,800
feature articles that overlap with others as well. So the cate- C 100–2400 600
gories provided by Feedzilla are not necessarily disjoint and
this is the reason we observe a low prediction accuracy. Table 5: Article Classes
To evaluate this hypothesis, we repeated the prediction
algorithm for particular categories of content. Using only
the articles in the Technology category, we reached an R2 Method Accuracy
value of 0.43, indicating that when employing regression we Bagging 83.96%
can predict the popularity of articles within one category (i.e. J48 Decision Trees 83.75%
Technology) with better results. SVM 81.54%
Naive Bayes 77.79%
4.2 Classification
Feature scores derived from historical data on Twitter are Table 6: Classification Results
based on articles that have been tweeted and not those arti-
cles which do not make it to Twitter (which make up about
half of the articles). As discussed in Section 3.1 this is evi-
dent in how the zero-tweet articles do not follow the linear 5 Discussion and Conclusion
trend of the rest of datapoints in Figure 1. Consequently, we
This work falls within the larger vision of studying how at-
do not include a zero-tweet class in our classification scheme
tention is allocated on the web. There exists an intense and
and perform the classification by only considering those ar-
fast paced competition for attention among news items pub-
ticles that were posted on twitter.
lished online and we examined factors within the content of
Table 5 shows three popularity classes A (1 to 20 tweets),
articles that can lead to success in this competition. We pre-
B (20 to 100 tweets), C (more than 100) and the number of
dicted the popularity of news items on Twitter using features
articles in each class in the set of 10,000 articles. Table 6
extracted from the content of news articles. We have taken
lists the results of support vector machine (SVM) classifica-
into account four features that cover the spectrum of the in-
tion, decision tree, and bagging (Hall et al. 2009) for classi-
formation that can be gleaned from the content - the source
fying the articles. All methods were performed with 10-fold
of the article, the category, subjectivity in the language and
cross-validation. We can see that classification can perform
the named entities mentioned. Our results show that while
with an overall accuracy of 84% in determining whether an
these features may not be sufficient to predict the exact num-
article will belong to a low-tweet, medium-tweet, or high-
ber of tweets that an article will garner, they can be effective
tweet class.
in providing a range of popularity for the article on Twitter.
In order to determine which features play a more signifi-
More precisely, while regression results were not adequate,
cant role in prediction, we repeat SVM classification leaving
we achieved an overall accuracy of 84% using classifiers. It
one of the features out at each step. We found that publica-
is important to bear in mind that while it is intriguing to pay
tion source plays a more important role compared to other
attention to the most popular articles –those that become vi-
predictors, while subjectivity, categories, and named enti-
ral on the web– a great number of articles spread in medium
ties do not provide much improvement in prediction of news
numbers. These medium levels can target highly interested
popularity on Twitter.
and informed readers and thus the mid-ranges of popularity
Predicting Zero-tweet Articles We perform binary clas- should not be dismissed.
sification to predict which articles will be at all mentioned Interestingly we have found that in terms of number of
on Twitter (zero tweet versus nonzero tweet articles). Us- retweets, the top news sources on twitter are not necessarily
ing SVM classification we can predict –with 66% accuracy– the conventionally popular news agencies and various tech-
whether an article will be linked to on twitter or whether it nology blogs such as Mashable and the Google Blog are very
will receive zero tweets. We repeat this operation by leav- widely shared in social media. Overall, we discovered that
ing out one feature at a time to see a change in accuracy. one of the most important predictors of popularity was the
We find that the most significant feature is the source, fol- source of the article. This is in agreement with the intuition
lowed by its category. Named entities and subjectivity did that readers are likely to be influenced by the news source
not provide more information for this prediction. So despite that disseminates the article. On the other hand, the cate-
one might expect, we find that readers overall favor neither gory feature did not perform well. One reason for this is that
subjectivity nor objectivity of language in a news article. It we are relying on categories provided by Feedzilla, many of
which overlap in content. Thus a future task is to extract Lerman, K., and Hogg, T. 2010. Using a model of social
categories independently and ensure little overlap. dynamics to predict popularity of news. In WWW, 621–630.
Combining other layers of complexity described in the in- ACM.
troduction opens up the possibility of better prediction. It Leskovec, J.; Adamic, L. A.; and Huberman, B. A. 2007.
would be interesting to further incorporate interaction be- The dynamics of viral marketing. TWEB 1(1).
tween offline and online media sources, different modes of
Leskovec, J.; Backstrom, L.; and Kleinberg, J. M. 2009.
information dissemination on the web, and network factors
Meme-tracking and the dynamics of the news cycle. In
such as the influence of individual propagators.
KDD, 497–506. ACM.
Leskovec, J.; Mcglohon, M.; Faloutsos, C.; Glance, N.; and
6 Acknowledgements Hurst, M. 2007. Cascading behavior in large blog graphs.
We would like to thank Vwani Roychowdhury for his sup- In In SDM.
port of the project and the reviewers for their constructive Pang, B., and Lee, L. 2004. A sentimental education: Sen-
comments. timent analysis using subjectivity summarization based on
minimum cuts. In Proceedings of the ACL, 271–278.
References Rogers, E. 1995. Diffusion of innovations. Free Pr.
Agarwal, N.; Liu, H.; Tang, L.; and Yu, P. S. 2008. Identify- Romero, D. M.; Meeder, B.; and Kleinberg, J. 2011. Differ-
ing the influential bloggers in a community. In Proceedings ences in the mechanics of information diffusion across top-
of the international conference on Web search and web data ics: idioms, political hashtags, and complex contagion on
mining, WSDM ’08, 207–218. New York, NY, USA: ACM. twitter. In Proceedings of the 20th international conference
Alias-i. 2008. Lingpipe 4.1.0. http://alias-i.com/lingpipe. on World wide web, WWW ’11, 695–704. New York, NY,
Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: A library for USA: ACM.
support vector machines. ACM Transactions on Intelligent Salganik, M. J.; Dodds, P. S.; and Watts, D. J. 2006. Exper-
Systems and Technology 2:27:1–27:27. Software available imental study of inequality and unpredictability in an artifi-
at http://www.csie.ntu.edu.tw/ cjlin/libsvm. cial cultural market. Science 311(5762):854–856.
Cosley, D.; Huttenlocher, D.; Kleinberg, J.; Lan, X.; and Simkin, M. V., and Roychowdhury, V. P. 2008. A theory of
Suri, S. 2010. Sequential influence models in social net- web traffic. EPL (Europhysics Letters) 82(2):28006.
works. In 4th International Conference on Weblogs and So- Sun, E.; Rosenn, I.; Marlow, C.; and Lento, T. M. 2009.
cial Media. Gesundheit! modeling contagion through facebook news
Gruhl, D.; Liben-Nowell, D.; Guha, R. V.; and Tomkins, A. feed. In ICWSM. The AAAI Press.
2004. Information diffusion through blogspace. SIGKDD Szabó, G., and Huberman, B. A. 2010. Predicting the pop-
Explorations 6(2):43–52. ularity of online content. Commun. ACM 53(8):80–88.
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, Tatar, A.; Leguay, J.; Antoniadis, P.; Limbourg, A.;
P.; and Witten, I. H. 2009. The weka data mining software: de Amorim, M. D.; and Fdida, S. 2011. Predicting the
an update. SIGKDD Explor. Newsl. 11:10–18. popularity of online articles based on user comments. In
Hastie, T.; Tibshirani, R.; and Friedman, J. 2008. The el- Proceedings of the International Conference on Web Intelli-
ements of statistical learning: data mining, inference, and gence, Mining and Semantics, WIMS ’11, 67:1–67:8. New
prediction. Springer series in statistics. Springer. York, NY, USA: ACM.
Jamali, S., and Rangwala, H. 2009. Digging digg: Comment Vázquez, A.; Oliveira, J. a. G.; Dezsö, Z.; Goh, K.-I.; Kon-
mining, popularity prediction, and social network analysis. dor, I.; and Barabási, A.-L. 2006. Modeling bursts and heavy
In Web Information Systems and Mining, 2009. WISM 2009. tails in human dynamics. Phys. Rev. E 73:036127.
International Conference on, 32 –38. Wu, F., and Huberman, B. A. 2007. Novelty and collective
Kempe, D.; Kleinberg, J. M.; and Tardos, É. 2003. Maxi- attention. Proceedings of the National Academy of Sciences
mizing the spread of influence through a social network. In 104(45):17599–17601.
KDD, 137–146. ACM. Yang, J., and Leskovec, J. 2011. Patterns of temporal varia-
Kim, S.-D.; Kim, S.-H.; and Cho, H.-G. 2011. Predict- tion in online media. In WSDM, 177–186. ACM.
ing the virtual temperature of web-blog articles as a mea- Yu, B.; Chen, M.; and Kwok, L. 2011. Toward predict-
surement tool for online popularity. In IEEE 11th Interna- ing popularity of social marketing messages. In SBP, vol-
tional Conference on Computer and Information Technology ume 6589 of Lecture Notes in Computer Science, 317–324.
(CIT), 449 –454. Springer.
Lee, J. G.; Moon, S.; and Salamatian, K. 2010. An approach Zhou, Z.; Bandari, R.; Kong, J.; Qian, H.; and Roychowd-
to model and predict the popularity of online contents with hury, V. 2010. Information resonance on twitter: watching
explanatory factors. In Web Intelligence, 623–630. IEEE. iran. In Proceedings of the First Workshop on Social Me-
Lerman, K., and Ghosh, R. 2010. Information contagion: dia Analytics, SOMA ’10, 123–131. New York, NY, USA:
An empirical study of the spread of news on digg and twitter ACM.
social networks. In ICWSM. The AAAI Press.

You might also like