Who Says What To Whom On Twitter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Who Says What to Whom on Twitter

Shaomei Wu∗ Jake M. Hofman


Cornell University, USA Yahoo! Research, NY, USA
[email protected] [email protected]
Winter A. Mason Duncan J. Watts
Yahoo! Research, NY, USA Yahoo! Research, NY, USA
winteram@yahoo- [email protected]
inc.com

ABSTRACT 1. INTRODUCTION
We study several longstanding questions in media communi- A longstanding objective of media communications re-
cations research, in the context of the microblogging service search is encapsulated by what is known as Lasswell’s maxim:
Twitter, regarding the production, flow, and consumption “who says what to whom in what channel with what ef-
of information. To do so, we exploit a recently introduced fect” [9], so-named for one of the pioneers of the field, Harold
feature of Twitter—known as Twitter lists—to distinguish Lasswell. Although simple to state, Laswell’s maxim has
between elite users, by which we mean specifically celebri- proven difficult to satisfy in the more-than 60 years since
ties, bloggers, and representatives of media outlets and other he stated it, in part because it is generally difficult to ob-
formal organizations, and ordinary users. Based on this clas- serve information flows in large populations, and in part
sification, we find a striking concentration of attention on because different channels have very different attributes and
Twitter—roughly 50% of tweets consumed are generated by effects. As a result, theories of communications have tended
just 20K elite users—where the media produces the most in- to focus either on “mass” communication, defined as “one-
formation, but celebrities are the most followed. We also find way message transmissions from one source to a large, rela-
significant homophily within categories: celebrities listen to tively undifferentiated and anonymous audience,” or on “in-
celebrities, while bloggers listen to bloggers etc; however, terpersonal” communication, meaning a “two-way message
bloggers in general rebroadcast more information than the exchange between two or more individuals.” [13].
other categories. Next we re-examine the classical “two-step Correspondingly, debates among communication theorists
flow” theory of communications, finding considerable sup- have tended to revolve around the relative importance of
port for it on Twitter, but also some interesting differences. these two putative modes of communication. For exam-
Third, we find that URLs broadcast by different categories ple, whereas early theories such as the so-called “hypodermic
of users or containing different types of content exhibit sys- model” posited that mass media exerted direct and relatively
tematically different lifespans. And finally, we examine the strong effects on public opinion, mid-century researchers [10,
attention paid by the different user categories to different 6, 11, 4] argued that the mass media influenced the pub-
news topics. lic only indirectly, via what they called a two-step flow of
communications, where the critical intermediate layer was
Categories and Subject Descriptors occupied by a category of media-savvy individuals called
opinion leaders. The resulting “limited effects” paradigm
H.1.2 [Models and Principles]: User/Machine Systems; was then subsequently challenged by a new generation of
J.4 [Social and Behavioral Sciences]: Sociology researchers [5], who claimed that the real importance of the
mass media lay in its ability to set the agenda of public
General Terms discourse. But in recent years rising public skepticism of
two-step flow, communications, classification mass media, along with changes in media and communica-
tion technology, have tilted conventional academic wisdom
once more in favor of interpersonal communication, which
Keywords some identify as a “new era” of minimal effects [2].
Communication networks, Twitter, information flow Recent changes in technology, however, have increasingly

Part of this research was performed while the author was undermined the validity of the mass vs. interpersonal di-
visiting Yahoo! Research, New York. chotomy itself. On the one hand, over the past few decades
mass communication has experienced a proliferation of new
channels, including cable television, satellite radio, special-
ist book and magazine publishers, and of course an array of
Permission to make digital or hard copies of all or part of this work for web-based media such as sponsored blogs, online communi-
personal or classroom use is granted without fee provided that copies are ties, and social news sites. Correspondingly, the traditional
not made or distributed for profit or commercial advantage and that copies
mass audience once associated with, say, network television
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific has fragmented into many smaller audiences, each of which
permission and/or a fee. increasingly selects the information to which it is exposed,
WWW ’11 Hyderabad, India and in some cases generates the information itself. Mean-
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
while, in the opposite direction interpersonal communication deliver qualitatively similar results. In section 4 we analyze
has become increasingly amplified through personal blogs, the production of information on Twitter, particularly who
email lists, and social networking sites to afford individu- pays attention to whom. In section 4.1, we revisit the the-
als ever-larger audiences. Together, these two trends have ory of the two-step flow—arguably the dominant theory of
greatly obscured the historical distinction between mass and communications for much of the past 50 years—finding con-
interpersonal communications, leading some scholars to refer siderable support for the theory as well as some interesting
instead to “masspersonal” communications [13]. differences. In section 5, we consider “who listens to what”,
Nowhere is the erosion of traditional categories more ap- examining first who shares what kinds of media content, and
parent than in the micro-blogging platform Twitter. To il- second the lifespan of URLs as a function of their origin and
lustrate, the top ten most-followed users on Twitter are not their content. Finally, in section 6 we conclude with a brief
corporations or media organizations, but individual people, discussion of future work.
mostly celebrities. Moreover, these individuals communi-
cate directly with their millions of followers, often managed
by themselves or publicists, thus bypassing the traditional 2. RELATED WORK
intermediation of the mass media between celebrities and Aside from the communications literature surveyed above,
fans. Next, in addition to conventional celebrities, a new a number of recent papers have examined information dif-
class of “semi-public” individuals like bloggers, authors, jour- fusion on Twitter. Kwak et al. [8] studied the topological
nalists, and subject matter experts have come to occupy an features of the Twitter follower graph, concluding from the
important niche on Twitter, in some cases becoming more highly skewed nature of the distribution of followers and the
prominent than traditional public figures such as entertain- low rate of reciprocated ties that Twitter more closely resem-
ers and elected officials. Third, in spite of these shifts away bled an information sharing network than a social network—
from centralized media power, media organizations—along a conclusion that is consistent with our own view. In ad-
with corporations, governments, and NGOs—all remain well dition, Kwak et al. compared three different measures of
represented among highly followed users, and are often ex- influence—number of followers, page-rank, and number of
tremely active. And finally, Twitter is primarily made up retweets—finding that the ranking of the most influential
of many millions of users who seem to be ordinary individ- users differed depending on the measure. In a similar vein,
uals communicating with their friends and acquaintances in Cha et al. [3] compared three measures of influence—number
a manner largely consistent with traditional notions of in- of followers, number of retweets, and number of mentions—
terpersonal communication. and also found that the most followed users did not neces-
Twitter, therefore, represents the full spectrum of commu- sarily score highest on the other measures. Weng et al. [15]
nications from personal and private to “masspersonal” to tra- compared number of followers and page rank with a modified
ditional mass media. Consequently it provides an interesting page-rank measure which accounted for topic, again finding
context in which to address Lasswell’s maxim, especially as that ranking depended on the influence measure. Finally,
Twitter—unlike television, radio, and print media—enables Bakshy et al. [1] studied the distribution of retweet cascades
one to easily observe information flows among the members on Twitter, finding that although users with large follower
of its ecosystem. Unfortunately, however, the kinds of ef- counts and past success in triggering cascades were on aver-
fects that are of most interest to communications theorists, age more likely to trigger large cascades in the future, these
such as changes in behavior, attitudes, etc., remain difficult features are in general poor predictors of future cascade size.
to measure on Twitter. Therefore in this paper we limit Our paper differs from this earlier work by shifting atten-
our focus to the “who says what to whom” part of Laswell’s tion from the ranking of individual users in terms of various
maxim. influence measures to the flow of information among dif-
To this end, our paper makes three main contributions: ferent categories of users. In particular, we are interested
in identifying “elite” users, who we differentiate from “ordi-
• We introduce a method for classifying users using Twit- nary” users in terms of their visibility, and understanding
ter Lists into “elite” and “ordinary” users, further clas- their role in introducing information into Twitter, as well as
sifying elite users into one of four categories of interest— how information originating from traditional media sources
media, celebrities, organizations, and bloggers. reaches the masses.
• We investigate the flow of information among these
categories, finding that although audience attention is 3. DATA AND METHODS
highly concentrated on a minority of elite users, much
of the information they produce reaches the masses 3.1 Twitter Follower Graph
indirectly via a large population of intermediaries.
In order to understand how information is transmitted on
• We find that different categories of users place slightly Twitter, we need to know the channels by which it flows;
different emphasis on different types of content, and that is, who is following whom on Twitter. To this end, we
that different content types exhibit dramatically dif- used the follower graph studied by Kwak et al. [8], which
ferent characteristic lifespans, ranging from less than included 42M users and 1.5B edges. This data represents
a day to months. a crawl of the graph seeded with all users on Twitter as
observed by July 31st, 2009, and is publicly available1 . As
The remainder of the paper proceeds as follows. In the reported by Kwak et al. [8], the follower graph is a directed
next section, we review related work. In section 3 we dis- network characterized by highly skewed distributions both
cuss our data and methods, including section 3.3 in which
we describe how we use Twitter Lists to classify users, out- 1
The data is free to download from
line two different sampling methods, and show that they http://an.kaist.ac.kr/traces/WWW2010.html
of in-degree (# followers) and out-degree (# “friends”, Twit- are interested in the relative importance of mass commu-
ter notation for how many others a user follows); however, nications, as practiced by media and other formal organiza-
the out-degree distribution is even more skewed than the tions, masspersonal communications as practiced by celebri-
in-degree distribution. In both friend and follower distribu- ties and prominent bloggers, and interpersonal communica-
tions, for example, the median is less than 100, but the max- tions, as practiced by ordinary individuals communicating
imum # friends is several hundred thousand, while a small with their friends. In addition, we are also interested in the
number of users have millions of followers. In addition, the relationships between these categories of users, motivated
follower graph is also characterized by extremely low reci- by theoretical arguments such as the theory of the two-step
procity (roughly 20%)—in particular, the most-followed in- flow [6]. Rather than pursuing a strategy of automatic clas-
dividuals typically do not follow many others. The Twitter sification, therefore, our approach depends on defining and
follower graph, in other words, does not conform to the usual identifying certain predetermined classes of theoretical in-
characteristics of social networks, which exhibit much higher terest, where both approaches have advantages and disad-
reciprocity and far less skewed degree distributions [7], but vantages. In particular, we restrict our attention to four
instead resembles more the mixture of one-way mass com- classes of what we call “elite” users: media, celebrities, orga-
munications and reciprocated interpersonal communications nizations, and bloggers, as well as the relationships between
described above. these elite users and the much larger population of “ordi-
nary” users.
3.2 Twitter Firehose In additional to these theoretically-imposed constraints,
In addition to the follower graph, we are interested in the our proposed classification method must also satisfy a prac-
content being shared on Twitter—particularly URLs—and tical constraint—namely that the rate limits established by
so we examined the corpus of all 5B tweets generated over Twitter’s API effectively preclude crawling all lists for all
a 223 day period from July 28, 2009 to March 8, 2010 us- Twitter users3 . Thus we instead devised two different sam-
ing data from the Twitter “firehose,” the complete stream pling schemes—a snowball sample and an activity sample—
of all tweets2 . Because our objective is to understand the each with some advantages and disadvantages, discussed be-
flow of information, it is useful for us to restrict attention to low.
tweets containing URLs, for two reasons. First, URLs add
easily identifiable tags to individual tweets, allowing us to 3.3.1 Snowball sample of Twitter Lists
observe when a particular piece of content is either retweeted The first method for identifying elite users employed snow-
or subsequently reintroduced by another user. And second, ball sampling. For each category, we chose a number u0 of
because URLs point to online content outside of Twitter, seed users that were highly representative of the desired cat-
they provide a much richer source of variation than is pos- egory and appeared on many category-related lists. For each
sible in the typical 140 character tweet. Finally, we note of the four categories above, the following seeds were chosen:
that almost all URLs broadcast on Twitter have been short-
ened using one of a number of URL shorteners, of which the • Celebrities: Barack Obama, Lady Gaga, Paris Hilton
most popular is http://bit.ly/. From the total of 5B tweets • Media: CNN, New York Times
recorded during our observation period, therefore, we focus
our attention on the subset of 260M containing bit.ly URLs. • Organizations: Amnesty International, World Wildlife
Foundation, Yahoo! Inc., Whole Foods
3.3 Twitter Lists
Our method for classifying users exploits a relatively re- • Blogs4 : BoingBoing, FamousBloggers, problogger, mash-
cent feature of Twitter: Twitter Lists. Since its launch on able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,
November 2, 2009, Twitter Lists have been welcomed by the dragonblogger, bbrian017, hishaman, copyblogger, en-
community as a way to group people and organize one’s in- gadget, danielscocco, BlazingMinds, bloggersblog, Ty-
coming stream of tweets by specific sets of users. To create coonBlogger, shoemoney, wchingya, extremejohn,
a Twitter List, a user needs to provide a name (required) GrowMap, kikolani, smartbloggerz, Element321, bran-
and description (optional) for the list, and decide whether donacox, remarkablogger, jsinkeywest, seosmarty, No-
the new list is public (anyone can view and subscribe to this tAProBlog, kbloemendaal, JimiJones, ditesco
list) or private (only the list creator can view or subscribe to
this list). Once a list is created, the user can add/edit/delete After reviewing the lists associated with these seeds, the
list members. As the purpose of Twitter Lists is to help users following keywords were hand-selected based on (a) their
organize users they follow, the name of the list can be con- representativeness of the desired categories; and (b) their
sidered a meaningful label for the listed users. List creation lack of overlap between categories:
therefore effectively exploits the “wisdom of crowds” [12] 3
The Twitter API allows only 20K calls per hour, where at
to the task of classifying users, both in terms of their im- most 20 lists can be retrieved for each API call. Under the
portance to the community (number of lists on which they modest assumption of 40M users (roughly the number in the
appear), and also how they are perceived (e.g. news organi- 2009 crawl by [8]), where each user is included on at most
zation vs. celebrity, etc.). 20 lists, this would require 4 ∗ 106 /2 ∗ 103 = 2, 000 hours, or
Before describing our methods for classifying users in terms 11 weeks. Clearly this time could be reduced by deploying
of the lists on which they appear, we emphasize that we multiple accounts, but it also likely underestimates the real
time quite significantly, as many users appear on many more
are motivated by a particular set of substantive questions than 20 lists (e.g. Lady Gaga appears on nearly 140,000)
arising out of communications theory. In particular, we 4
The blogger category required many more seeds because
bloggers are in general lower profile than the seeds for the
2
http://dev.twitter.com/doc/get/statuses/firehose other categories
u0 Table 1: Distribution of users over categories
l0 Snowball Sample Activity Sample
category # of users % of users # of users % of users
u1 celeb 82,770 15.8% 14,778 13.0%
media 216,010 41.2% 40,186 35.3%
l1 org 97,853 18.7% 14,891 13.1%
blog 127,483 24.3% 43,830 38.6%
u2 total 524,116 100% 113,685 100%
l2

however, the bias is likely to be quite different from any in-


Figure 1: Schematic of the Snowball Sampling troduced by the snowball sample; thus obtaining similar re-
Method sults from the two samples should give us confidence that our
findings are not artifacts of the sampling procedure. This
• Celebrities: star, stars, hollywood, celebs, celebrity, method initially yielded 750k users and 5M lists; however,
celebrities, celebsverified, celebrity-list,celebrities-on- after pruning the lists to those that contained at least of the
twitter, celebrity-tweets keywords above, and assigning users to unique categories
(as described above), we obtained a much-reduced sample
• Media: news, media, news-media of 113,685 users, where Table 1 reports the number of users
assigned to each category. We note that the number of lists
• Organizations: company, companies, organization, obtained by the activity sampling methods is considerably
organisation, organizations, organisations, corporation, smaller than that obtained by the snowball sample, and
brands, products, charity, charities, causes, cause, ngo that bloggers are more heavily represented among the ac-
tivity sample at the expense of the other three categories—
• Blogs: blog, blogs, blogger, bloggers consistent with our claim that the two methods introduce
different biases. Interestingly, however, 97,614 of the ac-
Having selected the seeds and the keywords for each cate- tivity sample, or 85%, also appear in the snowball sample,
gory, we then performed a snowball sample of the bipartite suggesting that the two sampling methods identifying sim-
graph of users and lists (see Figure 1). For each seed, we ilar populations of elite users–as indeed we confirm in the
crawled all lists on which that seed appeared. The resulting next section.
“list of lists” was then pruned to contain only the l0 lists
whose names matched at least one of the chosen keywords
for that category. For instance, Lady Gaga is on lists called 3.3.3 Classifying Elite Users
“faves”, “celebs”, and “celebrity”, but only the latter two lists In order to identify categories of elite users, we not only
would be kept after pruning. We then crawled all u1 users need to classify users into categories, but also arrive at a def-
appearing in the pruned “list of lists” (for instance, find- inition of “elite” that satisfies a tradeoff between (a) keeping
ing all users that appeared in the “celebrity” list with Lady each category relatively small, so as not to include users who
Gaga), and then repeated these last two steps to complete are not distinguishable from ordinary users, while (b) maxi-
the crawl. In total, 524, 116 users were obtained, who ap- mizing the volume of attention that is accounted for by each
peared on 7, 000, 000 lists; however, many of the more promi- category. In addition, it is also desirable to make the four
nent users appeared on lists in more than one category—for categories the same size, so as to facilitate comparisons. To
example Oprah Winfrey is frequently included in lists of this end, we first rank all users in each of category by how
“celebrity” as well as “media.” To resolve this ambiguity, we frequently they are listed in that category. Next, we mea-
computed a user i’s membership score in category c: sure the flow of information from the top k users in each
nic of the four categories to a random sample of 100K ordinary
wic = , (i.e. unclassified) users in two ways: the proportion of peo-
Nc
ple the user follows in each category, and the proportion of
where nic is the number of lists in category c that contain tweets the user received from everyone the user follows in
user i and Nc is the total number of lists in category c. each category.
We then assigned each user to the category in which he Figure 2(a) shows for the snowball sample the share of
or she has the highest membership score. The number of following links (square symbols) and tweets received (dia-
users assigned in this manner to each category is reported monds) by an average user, while Figure 2(b) shows the
in Table 1. same information for the activity sample. Although the nu-
merical values differ slightly, the two sets of results are qual-
3.3.2 Activity Sample of Twitter Lists itatively similar. In particular, for both sampling methods,
Although the snowball sampling method is convenient and celebrities outrank all other categories, followed by the me-
is easily interpretable with respect to our theoretical moti- dia, organizations, and bloggers. Also in both cases, the
vation, it is also potentially biased by our particular choice bulk of the attention is accounted for by a relatively small
of seeds. To address this concern, we also generate a sample number of users within each category, as evidenced by the
of users based on their activity. Specifically, we crawl all relatively flat slope of the attention curves, where we note
lists associated with all users who tweet at least once every that the curve for celebrities asymptotes more slowly than
week for our entire observation period. for the other three categories. Balancing the requirements
This “activity-based” sampling method is also clearly bi- described above, therefore, we chose k = 5000 as a cut-off
ased towards users who are consistently active. Importantly, for the elite categories, where all remaining users are hence-
forth classified as ordinary. In addition, from this point on,
we restrict our analysis to elite categories to the top 5,000 Table 2: # of URLs initiated by category
users identified by the sampling method, noting that both # of URLs
methods generate similar results. category # of URLs per-capita
celeb 139,058 27.81
celebrities media media 5,119,739 1023.94
friends
org 523,698 104.74
10 20 30

10 20 30
tweets received blog 1,360,131 272.03
average %

average %
ordinary 244,228,364 6.10
friends
tweets received
0

0
1000 4000 7000 10000 1000 4000 7000 10000
dle for actor Ashton Kusher, one of the first celebrities to
top k top k embrace Twitter and still one of the most followed users,
while the remain celebrity users—Lady Gaga, Ellen De-
organizations blogs
generes, Oprah Winfrey, and Taylor Swift, are all household
friends friends
10 20 30

10 20 30

tweets received tweets received names. In the media category, CNN Breaking News and the
average %

average %

New York Times are most prominent, followed by Breaking


News, Time, and Asahi, a leading Japanese daily newspa-
per. Among organizations, Google, Starbucks, and Twit-
ter are obviously large and socially prominent corporations,
0

1000 4000 7000 10000 1000 4000 7000 10000


top k top k while JoinRed is the charity organization started by Bono of
U2, and ollehkt is the Twitter account for KT, formerly Ko-
(a) Snowball sample rean Telecom. Finally, among the blogging category, Mash-
able and ProBlogger are both prominent US blogging sites,
celebrities media
while Kibe Loco and Nao Salvo are popular blogs in Brazil,
friends
10 20 30

10 20 30

tweets received
and dooce is the blog of Heather Armstrong, a widely read
average %

average %

“mommy blogger” with over 1.5M followers.

friends
tweets received
Table 3: Top 5 users in each category
0

1000 4000 7000 10000 1000 4000 7000 10000


top k top k
Celebrity Media Org Blog
aplusk cnnbrk google mashable
organizations blogs ladygaga nytimes Starbucks problogger
friends friends TheEllenShow asahi twitter kibeloco
10 20 30

10 20 30

tweets received tweets received


average %

average %

taylorswift13 BreakingNews joinred naosalvo


Oprah TIME ollehkt dooce
0

1000 4000 7000 10000 1000 4000 7000 10000


top k top k 4. “WHO LISTENS TO WHOM”
The results of the previous section provide qualified sup-
(b) Activity sample
port for the conventional wisdom that audiences have be-
come increasingly fragmented. Clearly, ordinary users on
Figure 2: Average fraction of # following (blue line) Twitter are receiving their information from many thou-
and # tweets (red line) for a random user that are sands of distinct sources, most of which are not traditional
accounted for by the top K elites users crawled media organizations—even though media outlets are by far
the most active users on Twitter, only about 15% of tweets
Based on this definition of elite users, Table 2 shows that received by ordinary users are received directly from the
although ordinary users collectively introduce by far the media. Equally interesting, however, is that in spite of this
highest number of URLs, members of the elite categories are fragmentation, it remains the case that 20K elite users, com-
far more active on a per-capita basis. In particular, users prising less than 0.05% of the user population, attracts al-
classified as “media” easily outproduce all other categories, most 50% of all attention within Twitter. Even if the media
followed by bloggers, organizations, and celebrities. Ordi- has lost attention relative to other elites, information flows
nary users originate on average only about 6 URLs each, have not become egalitarian by any means.
compared with over 1,000 for media users. In the rest of The prominence of elite users also raises the question of
this paper, therefore, when we talk about “celebrity”, “me- how these different categories listen to each other. To ad-
dia”, “organization”, “blog”, we refer the top 5K users drawn dress this issue, we compute the volume of tweets exchanged
from the snowball sample listed as “celebrity”, “media”, “or- between elite categories. Specifically, Figure 3 shows the
ganization”, “blog”, respectively. average percentage of tweets that category i receives from
Table 3, which shows the top 5 users in each of the four category j, exhibiting striking homophily with respect to
categories, suggests that the sampling method yields re- attention: celebrities overwhelmingly pay attention to other
sults that are consistent with our objective of identifying celebrities, media actors pay attention to other media ac-
users who are prominent exemplars of our target categories. tors, and so on. The one slight exception to this rule is that
Among the celebrity list, for example, “aplusk,” is the han- organizations pay more attention to bloggers than to them-
light on the theory of the two-step flow, arguably the theory
that has most successfully captured the dueling importance
of mass media and interpersonal influence. The essence of
the two-step flow is that information passes from the media
Celeb Media to the masses not directly, as supposed by early theories of
mass communication, but passes first through an intermedi-
Category of Twitter Users
ate layer of “opinion leaders” who decide which information
A B to rebroadcast to their followers, and which to ignore. As
B receive tweets from A we have already noted, on Twitter the flow of information
to the masses from the media accounts for only a fraction
Org Blog
of the total volume of information. Nevertheless, it is still a
substantial fraction, so it is still interesting to ask: for the
special case of information originating from media sources,
what proportion is broadcast directly to the masses, and
what proportion is transmitted indirectly via some popula-
tion of intermediaries? In addition, we may inquire whether
Figure 3: Share of tweets received among elite cat- these intermediaries, to the extent they exist, are drawn
egories from other elite categories or from ordinary users, as claimed
by the two-step flow theory; and if the latter, in what re-
spects they differ from other ordinary users.
Before proceeding with this analysis, we note that there
are two ways information can pass through an intermedi-
ary in Twitter. The first is via retweeting, which occurs
when a users explicitly rebroadcasts a URL that he or she
has received, along with an explicit acknowledgement of the
source—either using official retweet function provided by
Twitter, or making use of an informal convention such as
“RT @user” or “via @user.” The second mechanism is what
we label reintroduction, where a user subsequently tweets a
URL that has previously been introduced by another user,
but without the acknowledgment, in which case we assume
the information has been rediscovered independently. For
the purposes of studying when a user receives information
directly from the media or indirectly through an intermedi-
ary, we treat retweets and reintroductions equivalently. If
Figure 4: RT behavior among elite categories the first occurrence of a URL in Twitter came from a media
user, but a user received the URL from another source, then
that source can be considered an intermediary, whether they
selves. In general, in fact, attention paid by organizations is are citing the source within Twitter by retweeting the URL,
more evenly distributed across categories than for any other or reintroducing it, having discovered the URL outside of
category. Twitter.
Figure 3, it should be noted, shows only how many URLs To quantify the extent to which ordinary users get their
are received by category i from category j, a particular weak information indirectly versus directly from the media, we
measure of attention for the simple reason that many tweets sampled 1M random ordinary users5 , and for each user,
go unread. A stronger measure of attention, therefore, is counted the number n of bit.ly URLs they had received that
to consider instead only those URLs introduced by category had originated from one of our 5K media users, where of
i that are subsequently retweeted by category j. Figure 4 the 1M total, 600K had received at least one such URL.
shows how much information originating from each category For each member of this 600K subset we then counted the
is retweeted by other categories. As with our previous mea- number n2 of these URLs that they received via non-media
sure of attention, retweeting is strongly homophilous among friends; that is, via a two-step flow. The average fraction
elite categories; however, bloggers are disproportionately re- n2 /n = 0.46 therefore represents the proportion of media-
sponsible for retweeting URLs originated by all categories. originated content that reaches the masses via an interme-
This result reflects the characterization of bloggers as recy- diary rather than directly. As Figure 5 shows, however,
clers and filters of information. However, even though on a this average is somewhat misleading. In reality, the pop-
per-capita basis bloggers disproportionately occupy the role ulation comprises two types—those who receive essentially
of information recyclers—93 retweets per person, compared all of their media-originating information via two-step flows
to only 1.1 retweets per person for ordinary users—the total and those who receive virtually all of it directly from the
number of URLs retweeted by bloggers (465k) is vastly out- media. Unsurprisingly, the former type is exposed to less
weighed by the number retweeted by ordinary users (46M); total media than the latter. What is surprising, however, is
thus their overall impact is relatively minimal. that even users who received up to 100 media URLs dur-
5
As before, performing this analysis for the entire population
4.1 Two-Step Flow of Information of over 40M ordinary users proved to be computationally
Examining information flow on Twitter can also shed new unfeasible.
a b

10000
# of opinion leaders
100
c d

1
0 4 16 64 256 2048 16384 131072
# of two−step recipients

Figure 5: Percentage of information that is received Figure 6: Frequency of intermediaries binned by #


via an intermediary as a function of total volume of randomly sampled users to whom they transmit me-
media content to which a user is exposed. dia content.

ing our observation period received all of them from opinion communications technology in the interim—given, in fact,
leaders. that a service like Twitter was likely unimaginable at the
Who are these intermediaries, and how many of them are time—it is remarkable how well the theory agrees with our
there? In total, the population of intermediaries is smaller observations.
than that of the users who rely on them, but still surprisingly
large, roughly 500K, the vast majority of which (96%) are 5. WHO LISTENS TO WHAT?
classified ordinary users, not elites. Interestingly, Figure 5c
The results in section 4 demonstrate the “elite” users ac-
also shows that at least some intermediaries also receive the
count for a substantial portion of all of the attention on
bulk of their media content indirectly, just like other ordi-
Twitter, but also show clear differences in how the attention
nary users. Comparing Figure 5a and 5c, however, we note
is allocated to the different elite categories. It is therefore
that intermediaries are not like other ordinary users in that
interesting to consider what kinds of content is being shared
they are exposed to considerably more media than randomly
by these categories. Given the large number of URLs in our
selected users, hence the number of intermediaries who rely
observation period (260M ), and the many different ways one
on two-step flows is much smaller than for random users. In
can classify content (video vs. text, news vs. entertainment,
addition, we find that on average intermediaries have more
political news vs. sports news, etc.), classifying even a small
followers than randomly sampled users (543 followers versus
fraction of URLs according to content is an onerous task.
34) and are also more active (180 tweets on average, versus
Bakshy et al. [1], for example, used Amazon’s Mechanical
7). Finally, Figure 6 shows that although all intermediaries,
Turk to classify a stratified sample of 1,000 URLs along a
by definition, pass along media content to at least one other
variety of dimensions; however, this method does not scale
user, a minority satisfies this function for multiple users,
well to larger sample sizes.
where we note that the most prominent intermediaries are
Instead, we restrict attention to URLs originated by the
disproportionately drawn from the 4% elite users—Ashton
New York Times which, with over 2.5M followers, is the
Kucher (asplusk), for example acts as an intermediary for
second-most-followed news organization on Twitter, after
over 100,000 users.
CNN Breaking News. NY Times, however, is roughly ten
Interestingly, these results are all broadly consistent with
times as active as CNN Breaking News, so it is arguable a
the original conception of the two-step flow, advanced over
better source of data. To classify NY Times content, we
50 years ago, which emphasized that opinion leaders were
exploit a convenient feature of their format—namely that
“distributed in all occupational groups, and on every social
all NY Times URLs are classified in a consistent way by
and economic level,” corresponding to our classification of
the section in which they appear (e.g. U.S., World, Sports,
most intermediaries as ordinary. [6]. The original theory
Science, Arts, etc.) 6 . Of the 6398 New York Times bit.ly
also emphasized that opinion leaders, like their followers,
URLs we observed, 6370 could be successfully unshortened
also received at least some of their information via two-step
and assigned to one of 21 categories. Of these, however, only
flows, but that in general they were more exposed to the
9 categories had more than 100 URLs during the observa-
media than their followers—just as we find here. Finally,
tion period, one of which—“NY region”—was highly specific
the theory predicted that opinion leadership was not a bi-
to the New York metropolitan area; thus we focused our
nary attribute, but rather a continuously varying one, cor-
attention on the remaining 8 topical categories. Figure 7
responding to our finding that intermediaries vary widely in
shows the proportion of URLs from each New York Times
the number of users for whom they act as filters and trans-
section retweeted or reintroduced by each category. World
mitters of media content. Given the length of time that has
elapsed since the theory of the two-step flow was articulated, 6
http://www.nytimes.com/year/month/day/category/
and the transformational changes that have taken place in title.html?ref=category
first observation last observation
1. World News 2. U.S. News of URL of URL
0.35
δ τ δ
0.30
0.25
0.20
0.15
0.10
estimation period = 133 days evaluation period = 90 days
0.05
0.00 τ δ
3. Business 4. Sports
0.35
Total observation window = 223 days
0.30
0.25
0.20
Figure 8: Schematic of lifespan estimation proce-
% RTs and Re-introductions

0.15
0.10 dure
0.05
0.00
5. Health 6. Technology
0.35 create large biases for long lived URLs. In particular, URLs
0.30 that appear towards the end of our observation period will
0.25
0.20
be systematically classified as shorter-lived than URLs that
0.15 appear towards the beginning.
0.10
To address the censoring problem, we seek to determine
0.05
0.00
a buffer δ at both the beginning and the end of our 223-
7. Science 8. Arts day period, and only count URLs as having a lifespan of τ
0.35
if (a) they do not appear in the first δ days, (b) they first
0.30
0.25
appear in the interval between the buffers, and (c) they do
0.20 not appear in the last δ days, as illustrated in Figure 8(a).
0.15
To determine δ we first split the 223 day period into two
0.10
0.05 segments—the first 133 day estimation period and the last
0.00 90 day evaluation period (see Figure 8(b))—and then ask: if
blog celeb media org other blog celeb media org other
we (a) observe a URL first appear in the first 133−δ days and
User Category
(b) do not see it in the δ days prior to the splitting point, how
likely are we see it in the last 90 days? Clearly this depends
Figure 7: Number of RTs and Reintroductions of on the actual lifespan of the URL, as the longer a URL
New York Times stories by content category lives, the more likely it will re-appear in the future. Using
this estimation/evaluation split, we find an upper-bound on
lifespan for which we can determine the actual lifespan with
news is the most popular category, followed by U.S. News, 95% accuracy as a function of δ. Finally, because we require
Business, and Sports, where increasingly niche categories a beginning and ending buffer, and because we can only
like Health, Arts, Science, and Technology are less popu- classify a URL as having lifespan τ if it appears at least τ
lar still. In general, the overall pattern is replicated for all days before the end of our window, we need to pick τ and
categories of users, but there are some minor deviations: in δ such that τ + 2δ ≤ 223. We determined that τ = 70
particular, organizations show disproportionately little in- and δ = 70 sufficiently satisfied our constraints; thus for
terest in business and arts-related stories, and dispropor- the following analysis, we consider only URLs that have a
tionately high interest in science, technology, and possibly lifespan τ ≤ 70.
world news. Celebrities, by contrast, show greater interest
in sports and less interest in health, while the media shows 5.2 Lifespan By Category
somewhat greater interest in U.S. news stories. Having established a method for estimating URL lifes-
pan, we now explore the lifespan of URLs introduced by
5.1 Lifespan of Content different categories of users, as shown in Figure 9(a). URLs
In addition to different types of content, URLs introduced initiated by the elite categories exhibit a similar distribu-
by different types of elite users or ordinary users may exhibit tion over lifespan to those initiated by ordinary users. As
different lifespans, by which we mean the time lag between Figure 9(b) shows, however, when looking at the percent-
the first and last appearance of a given URL on Twitter. age of URLs of different lifespans initiated by each category,
Naively, measuring lifespan seems a trivial matter; how- we see two additional results: first, URLs originated by me-
ever, a finite observation period—which results in censoring dia actors generate a large portion of short-lived URLs (es-
of our data—complicates this task. In other words, a URL pecially URLs with lifespan=0, those that only appeared
that is last observed towards the end of the observation pe- once); and second, URLs originated by bloggers are over-
riod may be retweeted or reintroduced after the period ends, represented among the longer-lived content. Both of these
while correspondingly, a URL that is first observed toward results can be explained by the type of content that origi-
the beginning of the observation window may in fact have nates from different sources: whereas news stories tend to
been introduced before the window began. What we ob- be replaced by updates on a daily or more frequent basis,
serve as the lifespan of a URL, therefore, is in reality a the sorts of URLs that are picked up by bloggers are of more
lower bound on the lifespan. Although this limitation does persistent interest, and so are more likely to be retweeted or
not create much of a problem for short-lived URLs—which reintroduced months or even years after their initial intro-
account for the vast majority of our observations—it does duction.
20
log(# of URLs with lifespan = x day) ● other

celeb
● media
15 org

blog


●●



●● ●
●● ●●
●●
●●●●●●●
10 ●●●●
●●●
●●●●●
●●●●●
●●●
●●●●●
●●●●●●●
●●●
●●●●
●●●
●●●

0
0 10 20 30 40 50 60 70 Figure 10: Top 20 domains for URLs that lived more
lifespan (day)
than 200 days
(a) Count

RT rate (# of RTs / total # of occurrences)


1.0 ● other
7 celeb
celeb media
media 0.8 org
% of URLs from elites category

6 org blog
blog
5 0.6

4
0.4
3


0.2 ●
●●

●●
2 ●●●●



●●●●●●●● ●●●●●●●●●●●●●●●
●●●
●●●
●●●
●●●
●●● ●●● ● ●
●● ●●●●●● ● ● ●●

1 0.0 ●

0 10 20 30 40 50 60 70
0
lifespan (day)
0 10 20 30 40 50 60 70
lifespan (day)

(b) Percent Figure 11: Average RT rate by lifespan for each of


the originating categories
Figure 9: 9(a) Count and 9(b) percentage of URLs
initiated by 5 categories, with different lifespans of appearances of URLs after the initial introduction derives
not from retweeting, but rather from reintroduction, where
this result is especially pronounced for long-lived URLs.
To shed more light on the nature of long-lived content on For the vast majority of URLs on Twitter, in other words,
Twitter, we used the bit.ly API service to unshorten 35K longevity is determined not by diffusion, but by many dif-
of the most long-lived URLs (URLs that lived at least 200 ferent users independently rediscovering the same content,
days), and mapped them into 21034 web domains. As Figure consistent with our interpretation above. Second, however,
10 shows, the population of long-lived URLs is dominated by for URLs introduced by elite users, the result is somewhat
videos, music, and books. Twitter is, in other words, should the opposite—that is, they are more likely to be retweeted
be viewed as a subset of a much larger media ecosystem in than reintroduced, even for URLs that persist for weeks.
which content exists and is repeatedly rediscovered by Twit- Although it is unsurprising that elite users generate more
ter users. Some of this content—such as daily news stories— retweets than ordinary users, the size of the difference is
has a relatively short period of relevance, after which a given nevertheless striking, and suggests that in spite of the dom-
story is unlikely to be reintroduced or rebroadcast. At the inant result above that content lifespan is determined to a
other extreme, classic music videos, movie clips, and long- large extent by type, the source of its origin also impacts its
format magazine articles have lifespans that are effectively persistence, at least on average—a result that is consistent
unbounded, and can seemingly be rediscovered by Twitter with previous findings [1].
users indefinitely without losing relevance.
Two related points are illustrated by Figure 11, which
shows the average RT rate (the proportion of tweets con- 6. CONCLUSIONS
taining the URL that are retweets of another tweet) of URLs In this paper, we investigated a classic problem in me-
with different lifespans, grouped by the categories that in- dia communications research, captured by the first part of
troduced the URL7 . First, for ordinary users, the majority Laswell’s maxim—“who says what to whom”—in the context
that only appeared once in our dataset, thus the RT rate is
7
Note here that URLs with lifespan = 0 are those URLs zero.
of Twitter. By restricting our attention to Twitter, our con- 7. REFERENCES
clusions are necessarily limited to one narrow cross-section of [1] E. Bakshy, J. M. Hofman, A. Mason, Winter, and
the media landscape. Moreover, communications on Twitter D. J. Watts. Identifying ‘influencers’ on twitter. In
may be unrepresentative of information flow via more tradi- Fourth ACM International Conference on Web Seach
tional channels, such as TV and radio on the one hand, and and Data Mining (WSDM), Hong Kong, 2011. ACM.
interpersonal interactions on the other hand. However, we [2] W. L. Bennett and S. Iyengar. A new era of minimal
feel the advantages of using Twitter to answer this question effects? the changing foundations of political
outweighed the limitations. First, because Twitter users ex- communication. Journal of Communication,
plicitly opt-in to “follow” each other, and because Twitter 58(4):707–731, 2008.
maintains a complete record of every tweet broadcast, it
[3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.
provides an unprecedented level of resolution and coverage
Gummad. Measuring user influence on twitter: The
regarding who is listening to whom. Second, because Twit-
million follower fallacy. In 4th Int’l AAAI Conference
ter users themselves classify other users by including them
on Weblogs and Social Media, Washington, DC, 2010.
on lists, Twitter effectively provides a ready-made, crowd-
[4] J. S. Coleman, E. Katz, and H. Menzel. The diffusion
sourced classification scheme of users.
of an innovation among physicians. Sociometry,
By studying the flow of information among the five cat-
20(4):253–270, 1957.
egories that we identified (media, celebrities, organizations,
bloggers, and ordinary), our analysis sheds new light on [5] T. Gitlin. Media sociology: The dominant paradigm.
some old questions of communications research. First, we Theory and Society, 6(2):205–253, 1978.
find that although audience attention has indeed fragmented [6] E. Katz and P. F. Lazarsfeld. Personal influence; the
among a wider pool of content producers than classical mod- part played by people in the flow of mass
els of mass media, attention remains highly concentrated, communications. Free Press, Glencoe, Ill. 1955.

where roughly 0.05% of the population accounts for almost [7] G. Kossinets and D. J. Watts. Empirical analysis of an
half of all attention. Within the population of elite users, evolving social network. Science, 311(5757):88–90,
moreover, attention is highly homophilous, with celebrities 2006.
following celebrities, media following media, and bloggers [8] H. Kwak, C. Lee, H. Park, and S. Moon. What is
following bloggers. Second, we find considerable support for twitter, a social network or a news media? In
the two-step flow of information—almost half the informa- Proceedings of the 19th international conference on
tion that originates from the media passes to the masses indi- World Wide Web, pages 591–600. ACM, 2010.
rectly via a diffuse intermediate layer of opinion leaders, who [9] H. D. Lasswell. The structure and function of
although classified as ordinary users, are more connected communication in society. In L. Bryson, editor, The
and more exposed to the media than their followers. Third, Communication of Ideas, pages 117–130. University of
we find that although all categories devote a roughly simi- Illinois Press, Urbana, IL, 1948.
lar fraction of their attention to different categories of news [10] P. F. Lazarsfeld, B. Berelson, and H. Gaudet. The
(World, U.S., Business, etc), there are some differences— people’s choice; how the voter makes up his mind in a
organizations, for example, devote a surprisingly small frac- presidential campaign. Columbia University Press,
tion of their attention to business-related news. We also find New York, 3rd edition, 1968.
that different types of content exhibit very different lifes- [11] R. K. Merton. Patterns of influence: Local and
pans. In particular, media-originated URLs are dispropor- cosmopolitan influentials. In R. K. Merton, editor,
tionately represented among short-lived URLs while those Social theory and social structure, pages 441–474. Free
originated by bloggers tend to be overrepresented among Press, New York, 1968.
long-lived URLs. Finally, we find that the longest-lived [12] J. Surowiecki. The Wisdom of Crowds : Why the many
URLs are dominated by content such as videos and music, are smarter than the few and how collective wisdom
which are continually being rediscovered by Twitter users shapes business, economies, societies, and nations.
and appear to persist indefinitely. Doubleday, New York, 1st edition, 2004. 2003070095
In closing, we note that although our use of Twitter lists James Surowiecki. Includes bibliographical references.
to label users was motivated by a specific set of questions [13] J. B. Walther, C. T. Carr, S. S. W. Choi, D. C.
regarding mass vs interpersonal communications, and that DeAndrea, J. Kim, S. T. Tong, and B. Van Der Heide.
for this reason we have focused on a limited set of predeter- Interaction of interpersonal, peer, and media influence
mined user-categories, it would also be interesting to explore sources online. In Z. Papacharissi, editor, A Networked
automatic classification schemes from which additional user Self: Identity, Community, and Culture on Social
categories could emerge. In particular, such an approach Network Sites, pages 17–38. Routledge, 2010.
would allow one to examine the category of opinion lead-
[14] G. Weimann. The Influentials: People Who Influence
ers in more detail, possibly identifying opinion leaders for
People. State University of New York Press, Albany,
different topics, as has been proposed elsewhere [14]. In
NY, 1994.
addition, another area for future work would be to extract
content information in a more systematic manner, shedding [15] J. Weng, E. P. Lim, J. Jiang, and Q. He. Twitterrank:
more light on the “what” element of Lasswell’s maxim. And finding topic-sensitive influential twitterers. In
finally, a significant challenge for future work is to merge Proceedings of the third ACM international conference
the data regarding information flow on Twitter with other on Web search and data mining, pages 261–270. ACM,
sources of outcome data—relating, for example, to opinions 2010.
or actions that would engage more directly with the “effects”
component of Lasswell’s maxim.

You might also like