Who Says What To Whom On Twitter
Who Says What To Whom On Twitter
Who Says What To Whom On Twitter
ABSTRACT 1. INTRODUCTION
We study several longstanding questions in media communi- A longstanding objective of media communications re-
cations research, in the context of the microblogging service search is encapsulated by what is known as Lasswell’s maxim:
Twitter, regarding the production, flow, and consumption “who says what to whom in what channel with what ef-
of information. To do so, we exploit a recently introduced fect” [9], so-named for one of the pioneers of the field, Harold
feature of Twitter—known as Twitter lists—to distinguish Lasswell. Although simple to state, Laswell’s maxim has
between elite users, by which we mean specifically celebri- proven difficult to satisfy in the more-than 60 years since
ties, bloggers, and representatives of media outlets and other he stated it, in part because it is generally difficult to ob-
formal organizations, and ordinary users. Based on this clas- serve information flows in large populations, and in part
sification, we find a striking concentration of attention on because different channels have very different attributes and
Twitter—roughly 50% of tweets consumed are generated by effects. As a result, theories of communications have tended
just 20K elite users—where the media produces the most in- to focus either on “mass” communication, defined as “one-
formation, but celebrities are the most followed. We also find way message transmissions from one source to a large, rela-
significant homophily within categories: celebrities listen to tively undifferentiated and anonymous audience,” or on “in-
celebrities, while bloggers listen to bloggers etc; however, terpersonal” communication, meaning a “two-way message
bloggers in general rebroadcast more information than the exchange between two or more individuals.” [13].
other categories. Next we re-examine the classical “two-step Correspondingly, debates among communication theorists
flow” theory of communications, finding considerable sup- have tended to revolve around the relative importance of
port for it on Twitter, but also some interesting differences. these two putative modes of communication. For exam-
Third, we find that URLs broadcast by different categories ple, whereas early theories such as the so-called “hypodermic
of users or containing different types of content exhibit sys- model” posited that mass media exerted direct and relatively
tematically different lifespans. And finally, we examine the strong effects on public opinion, mid-century researchers [10,
attention paid by the different user categories to different 6, 11, 4] argued that the mass media influenced the pub-
news topics. lic only indirectly, via what they called a two-step flow of
communications, where the critical intermediate layer was
Categories and Subject Descriptors occupied by a category of media-savvy individuals called
opinion leaders. The resulting “limited effects” paradigm
H.1.2 [Models and Principles]: User/Machine Systems; was then subsequently challenged by a new generation of
J.4 [Social and Behavioral Sciences]: Sociology researchers [5], who claimed that the real importance of the
mass media lay in its ability to set the agenda of public
General Terms discourse. But in recent years rising public skepticism of
two-step flow, communications, classification mass media, along with changes in media and communica-
tion technology, have tilted conventional academic wisdom
once more in favor of interpersonal communication, which
Keywords some identify as a “new era” of minimal effects [2].
Communication networks, Twitter, information flow Recent changes in technology, however, have increasingly
∗
Part of this research was performed while the author was undermined the validity of the mass vs. interpersonal di-
visiting Yahoo! Research, New York. chotomy itself. On the one hand, over the past few decades
mass communication has experienced a proliferation of new
channels, including cable television, satellite radio, special-
ist book and magazine publishers, and of course an array of
Permission to make digital or hard copies of all or part of this work for web-based media such as sponsored blogs, online communi-
personal or classroom use is granted without fee provided that copies are ties, and social news sites. Correspondingly, the traditional
not made or distributed for profit or commercial advantage and that copies
mass audience once associated with, say, network television
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific has fragmented into many smaller audiences, each of which
permission and/or a fee. increasingly selects the information to which it is exposed,
WWW ’11 Hyderabad, India and in some cases generates the information itself. Mean-
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
while, in the opposite direction interpersonal communication deliver qualitatively similar results. In section 4 we analyze
has become increasingly amplified through personal blogs, the production of information on Twitter, particularly who
email lists, and social networking sites to afford individu- pays attention to whom. In section 4.1, we revisit the the-
als ever-larger audiences. Together, these two trends have ory of the two-step flow—arguably the dominant theory of
greatly obscured the historical distinction between mass and communications for much of the past 50 years—finding con-
interpersonal communications, leading some scholars to refer siderable support for the theory as well as some interesting
instead to “masspersonal” communications [13]. differences. In section 5, we consider “who listens to what”,
Nowhere is the erosion of traditional categories more ap- examining first who shares what kinds of media content, and
parent than in the micro-blogging platform Twitter. To il- second the lifespan of URLs as a function of their origin and
lustrate, the top ten most-followed users on Twitter are not their content. Finally, in section 6 we conclude with a brief
corporations or media organizations, but individual people, discussion of future work.
mostly celebrities. Moreover, these individuals communi-
cate directly with their millions of followers, often managed
by themselves or publicists, thus bypassing the traditional 2. RELATED WORK
intermediation of the mass media between celebrities and Aside from the communications literature surveyed above,
fans. Next, in addition to conventional celebrities, a new a number of recent papers have examined information dif-
class of “semi-public” individuals like bloggers, authors, jour- fusion on Twitter. Kwak et al. [8] studied the topological
nalists, and subject matter experts have come to occupy an features of the Twitter follower graph, concluding from the
important niche on Twitter, in some cases becoming more highly skewed nature of the distribution of followers and the
prominent than traditional public figures such as entertain- low rate of reciprocated ties that Twitter more closely resem-
ers and elected officials. Third, in spite of these shifts away bled an information sharing network than a social network—
from centralized media power, media organizations—along a conclusion that is consistent with our own view. In ad-
with corporations, governments, and NGOs—all remain well dition, Kwak et al. compared three different measures of
represented among highly followed users, and are often ex- influence—number of followers, page-rank, and number of
tremely active. And finally, Twitter is primarily made up retweets—finding that the ranking of the most influential
of many millions of users who seem to be ordinary individ- users differed depending on the measure. In a similar vein,
uals communicating with their friends and acquaintances in Cha et al. [3] compared three measures of influence—number
a manner largely consistent with traditional notions of in- of followers, number of retweets, and number of mentions—
terpersonal communication. and also found that the most followed users did not neces-
Twitter, therefore, represents the full spectrum of commu- sarily score highest on the other measures. Weng et al. [15]
nications from personal and private to “masspersonal” to tra- compared number of followers and page rank with a modified
ditional mass media. Consequently it provides an interesting page-rank measure which accounted for topic, again finding
context in which to address Lasswell’s maxim, especially as that ranking depended on the influence measure. Finally,
Twitter—unlike television, radio, and print media—enables Bakshy et al. [1] studied the distribution of retweet cascades
one to easily observe information flows among the members on Twitter, finding that although users with large follower
of its ecosystem. Unfortunately, however, the kinds of ef- counts and past success in triggering cascades were on aver-
fects that are of most interest to communications theorists, age more likely to trigger large cascades in the future, these
such as changes in behavior, attitudes, etc., remain difficult features are in general poor predictors of future cascade size.
to measure on Twitter. Therefore in this paper we limit Our paper differs from this earlier work by shifting atten-
our focus to the “who says what to whom” part of Laswell’s tion from the ranking of individual users in terms of various
maxim. influence measures to the flow of information among dif-
To this end, our paper makes three main contributions: ferent categories of users. In particular, we are interested
in identifying “elite” users, who we differentiate from “ordi-
• We introduce a method for classifying users using Twit- nary” users in terms of their visibility, and understanding
ter Lists into “elite” and “ordinary” users, further clas- their role in introducing information into Twitter, as well as
sifying elite users into one of four categories of interest— how information originating from traditional media sources
media, celebrities, organizations, and bloggers. reaches the masses.
• We investigate the flow of information among these
categories, finding that although audience attention is 3. DATA AND METHODS
highly concentrated on a minority of elite users, much
of the information they produce reaches the masses 3.1 Twitter Follower Graph
indirectly via a large population of intermediaries.
In order to understand how information is transmitted on
• We find that different categories of users place slightly Twitter, we need to know the channels by which it flows;
different emphasis on different types of content, and that is, who is following whom on Twitter. To this end, we
that different content types exhibit dramatically dif- used the follower graph studied by Kwak et al. [8], which
ferent characteristic lifespans, ranging from less than included 42M users and 1.5B edges. This data represents
a day to months. a crawl of the graph seeded with all users on Twitter as
observed by July 31st, 2009, and is publicly available1 . As
The remainder of the paper proceeds as follows. In the reported by Kwak et al. [8], the follower graph is a directed
next section, we review related work. In section 3 we dis- network characterized by highly skewed distributions both
cuss our data and methods, including section 3.3 in which
we describe how we use Twitter Lists to classify users, out- 1
The data is free to download from
line two different sampling methods, and show that they http://an.kaist.ac.kr/traces/WWW2010.html
of in-degree (# followers) and out-degree (# “friends”, Twit- are interested in the relative importance of mass commu-
ter notation for how many others a user follows); however, nications, as practiced by media and other formal organiza-
the out-degree distribution is even more skewed than the tions, masspersonal communications as practiced by celebri-
in-degree distribution. In both friend and follower distribu- ties and prominent bloggers, and interpersonal communica-
tions, for example, the median is less than 100, but the max- tions, as practiced by ordinary individuals communicating
imum # friends is several hundred thousand, while a small with their friends. In addition, we are also interested in the
number of users have millions of followers. In addition, the relationships between these categories of users, motivated
follower graph is also characterized by extremely low reci- by theoretical arguments such as the theory of the two-step
procity (roughly 20%)—in particular, the most-followed in- flow [6]. Rather than pursuing a strategy of automatic clas-
dividuals typically do not follow many others. The Twitter sification, therefore, our approach depends on defining and
follower graph, in other words, does not conform to the usual identifying certain predetermined classes of theoretical in-
characteristics of social networks, which exhibit much higher terest, where both approaches have advantages and disad-
reciprocity and far less skewed degree distributions [7], but vantages. In particular, we restrict our attention to four
instead resembles more the mixture of one-way mass com- classes of what we call “elite” users: media, celebrities, orga-
munications and reciprocated interpersonal communications nizations, and bloggers, as well as the relationships between
described above. these elite users and the much larger population of “ordi-
nary” users.
3.2 Twitter Firehose In additional to these theoretically-imposed constraints,
In addition to the follower graph, we are interested in the our proposed classification method must also satisfy a prac-
content being shared on Twitter—particularly URLs—and tical constraint—namely that the rate limits established by
so we examined the corpus of all 5B tweets generated over Twitter’s API effectively preclude crawling all lists for all
a 223 day period from July 28, 2009 to March 8, 2010 us- Twitter users3 . Thus we instead devised two different sam-
ing data from the Twitter “firehose,” the complete stream pling schemes—a snowball sample and an activity sample—
of all tweets2 . Because our objective is to understand the each with some advantages and disadvantages, discussed be-
flow of information, it is useful for us to restrict attention to low.
tweets containing URLs, for two reasons. First, URLs add
easily identifiable tags to individual tweets, allowing us to 3.3.1 Snowball sample of Twitter Lists
observe when a particular piece of content is either retweeted The first method for identifying elite users employed snow-
or subsequently reintroduced by another user. And second, ball sampling. For each category, we chose a number u0 of
because URLs point to online content outside of Twitter, seed users that were highly representative of the desired cat-
they provide a much richer source of variation than is pos- egory and appeared on many category-related lists. For each
sible in the typical 140 character tweet. Finally, we note of the four categories above, the following seeds were chosen:
that almost all URLs broadcast on Twitter have been short-
ened using one of a number of URL shorteners, of which the • Celebrities: Barack Obama, Lady Gaga, Paris Hilton
most popular is http://bit.ly/. From the total of 5B tweets • Media: CNN, New York Times
recorded during our observation period, therefore, we focus
our attention on the subset of 260M containing bit.ly URLs. • Organizations: Amnesty International, World Wildlife
Foundation, Yahoo! Inc., Whole Foods
3.3 Twitter Lists
Our method for classifying users exploits a relatively re- • Blogs4 : BoingBoing, FamousBloggers, problogger, mash-
cent feature of Twitter: Twitter Lists. Since its launch on able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,
November 2, 2009, Twitter Lists have been welcomed by the dragonblogger, bbrian017, hishaman, copyblogger, en-
community as a way to group people and organize one’s in- gadget, danielscocco, BlazingMinds, bloggersblog, Ty-
coming stream of tweets by specific sets of users. To create coonBlogger, shoemoney, wchingya, extremejohn,
a Twitter List, a user needs to provide a name (required) GrowMap, kikolani, smartbloggerz, Element321, bran-
and description (optional) for the list, and decide whether donacox, remarkablogger, jsinkeywest, seosmarty, No-
the new list is public (anyone can view and subscribe to this tAProBlog, kbloemendaal, JimiJones, ditesco
list) or private (only the list creator can view or subscribe to
this list). Once a list is created, the user can add/edit/delete After reviewing the lists associated with these seeds, the
list members. As the purpose of Twitter Lists is to help users following keywords were hand-selected based on (a) their
organize users they follow, the name of the list can be con- representativeness of the desired categories; and (b) their
sidered a meaningful label for the listed users. List creation lack of overlap between categories:
therefore effectively exploits the “wisdom of crowds” [12] 3
The Twitter API allows only 20K calls per hour, where at
to the task of classifying users, both in terms of their im- most 20 lists can be retrieved for each API call. Under the
portance to the community (number of lists on which they modest assumption of 40M users (roughly the number in the
appear), and also how they are perceived (e.g. news organi- 2009 crawl by [8]), where each user is included on at most
zation vs. celebrity, etc.). 20 lists, this would require 4 ∗ 106 /2 ∗ 103 = 2, 000 hours, or
Before describing our methods for classifying users in terms 11 weeks. Clearly this time could be reduced by deploying
of the lists on which they appear, we emphasize that we multiple accounts, but it also likely underestimates the real
time quite significantly, as many users appear on many more
are motivated by a particular set of substantive questions than 20 lists (e.g. Lady Gaga appears on nearly 140,000)
arising out of communications theory. In particular, we 4
The blogger category required many more seeds because
bloggers are in general lower profile than the seeds for the
2
http://dev.twitter.com/doc/get/statuses/firehose other categories
u0 Table 1: Distribution of users over categories
l0 Snowball Sample Activity Sample
category # of users % of users # of users % of users
u1 celeb 82,770 15.8% 14,778 13.0%
media 216,010 41.2% 40,186 35.3%
l1 org 97,853 18.7% 14,891 13.1%
blog 127,483 24.3% 43,830 38.6%
u2 total 524,116 100% 113,685 100%
l2
10 20 30
tweets received blog 1,360,131 272.03
average %
average %
ordinary 244,228,364 6.10
friends
tweets received
0
0
1000 4000 7000 10000 1000 4000 7000 10000
dle for actor Ashton Kusher, one of the first celebrities to
top k top k embrace Twitter and still one of the most followed users,
while the remain celebrity users—Lady Gaga, Ellen De-
organizations blogs
generes, Oprah Winfrey, and Taylor Swift, are all household
friends friends
10 20 30
10 20 30
tweets received tweets received names. In the media category, CNN Breaking News and the
average %
average %
10 20 30
tweets received
and dooce is the blog of Heather Armstrong, a widely read
average %
average %
friends
tweets received
Table 3: Top 5 users in each category
0
10 20 30
average %
10000
# of opinion leaders
100
c d
1
0 4 16 64 256 2048 16384 131072
# of two−step recipients
ing our observation period received all of them from opinion communications technology in the interim—given, in fact,
leaders. that a service like Twitter was likely unimaginable at the
Who are these intermediaries, and how many of them are time—it is remarkable how well the theory agrees with our
there? In total, the population of intermediaries is smaller observations.
than that of the users who rely on them, but still surprisingly
large, roughly 500K, the vast majority of which (96%) are 5. WHO LISTENS TO WHAT?
classified ordinary users, not elites. Interestingly, Figure 5c
The results in section 4 demonstrate the “elite” users ac-
also shows that at least some intermediaries also receive the
count for a substantial portion of all of the attention on
bulk of their media content indirectly, just like other ordi-
Twitter, but also show clear differences in how the attention
nary users. Comparing Figure 5a and 5c, however, we note
is allocated to the different elite categories. It is therefore
that intermediaries are not like other ordinary users in that
interesting to consider what kinds of content is being shared
they are exposed to considerably more media than randomly
by these categories. Given the large number of URLs in our
selected users, hence the number of intermediaries who rely
observation period (260M ), and the many different ways one
on two-step flows is much smaller than for random users. In
can classify content (video vs. text, news vs. entertainment,
addition, we find that on average intermediaries have more
political news vs. sports news, etc.), classifying even a small
followers than randomly sampled users (543 followers versus
fraction of URLs according to content is an onerous task.
34) and are also more active (180 tweets on average, versus
Bakshy et al. [1], for example, used Amazon’s Mechanical
7). Finally, Figure 6 shows that although all intermediaries,
Turk to classify a stratified sample of 1,000 URLs along a
by definition, pass along media content to at least one other
variety of dimensions; however, this method does not scale
user, a minority satisfies this function for multiple users,
well to larger sample sizes.
where we note that the most prominent intermediaries are
Instead, we restrict attention to URLs originated by the
disproportionately drawn from the 4% elite users—Ashton
New York Times which, with over 2.5M followers, is the
Kucher (asplusk), for example acts as an intermediary for
second-most-followed news organization on Twitter, after
over 100,000 users.
CNN Breaking News. NY Times, however, is roughly ten
Interestingly, these results are all broadly consistent with
times as active as CNN Breaking News, so it is arguable a
the original conception of the two-step flow, advanced over
better source of data. To classify NY Times content, we
50 years ago, which emphasized that opinion leaders were
exploit a convenient feature of their format—namely that
“distributed in all occupational groups, and on every social
all NY Times URLs are classified in a consistent way by
and economic level,” corresponding to our classification of
the section in which they appear (e.g. U.S., World, Sports,
most intermediaries as ordinary. [6]. The original theory
Science, Arts, etc.) 6 . Of the 6398 New York Times bit.ly
also emphasized that opinion leaders, like their followers,
URLs we observed, 6370 could be successfully unshortened
also received at least some of their information via two-step
and assigned to one of 21 categories. Of these, however, only
flows, but that in general they were more exposed to the
9 categories had more than 100 URLs during the observa-
media than their followers—just as we find here. Finally,
tion period, one of which—“NY region”—was highly specific
the theory predicted that opinion leadership was not a bi-
to the New York metropolitan area; thus we focused our
nary attribute, but rather a continuously varying one, cor-
attention on the remaining 8 topical categories. Figure 7
responding to our finding that intermediaries vary widely in
shows the proportion of URLs from each New York Times
the number of users for whom they act as filters and trans-
section retweeted or reintroduced by each category. World
mitters of media content. Given the length of time that has
elapsed since the theory of the two-step flow was articulated, 6
http://www.nytimes.com/year/month/day/category/
and the transformational changes that have taken place in title.html?ref=category
first observation last observation
1. World News 2. U.S. News of URL of URL
0.35
δ τ δ
0.30
0.25
0.20
0.15
0.10
estimation period = 133 days evaluation period = 90 days
0.05
0.00 τ δ
3. Business 4. Sports
0.35
Total observation window = 223 days
0.30
0.25
0.20
Figure 8: Schematic of lifespan estimation proce-
% RTs and Re-introductions
0.15
0.10 dure
0.05
0.00
5. Health 6. Technology
0.35 create large biases for long lived URLs. In particular, URLs
0.30 that appear towards the end of our observation period will
0.25
0.20
be systematically classified as shorter-lived than URLs that
0.15 appear towards the beginning.
0.10
To address the censoring problem, we seek to determine
0.05
0.00
a buffer δ at both the beginning and the end of our 223-
7. Science 8. Arts day period, and only count URLs as having a lifespan of τ
0.35
if (a) they do not appear in the first δ days, (b) they first
0.30
0.25
appear in the interval between the buffers, and (c) they do
0.20 not appear in the last δ days, as illustrated in Figure 8(a).
0.15
To determine δ we first split the 223 day period into two
0.10
0.05 segments—the first 133 day estimation period and the last
0.00 90 day evaluation period (see Figure 8(b))—and then ask: if
blog celeb media org other blog celeb media org other
we (a) observe a URL first appear in the first 133−δ days and
User Category
(b) do not see it in the δ days prior to the splitting point, how
likely are we see it in the last 90 days? Clearly this depends
Figure 7: Number of RTs and Reintroductions of on the actual lifespan of the URL, as the longer a URL
New York Times stories by content category lives, the more likely it will re-appear in the future. Using
this estimation/evaluation split, we find an upper-bound on
lifespan for which we can determine the actual lifespan with
news is the most popular category, followed by U.S. News, 95% accuracy as a function of δ. Finally, because we require
Business, and Sports, where increasingly niche categories a beginning and ending buffer, and because we can only
like Health, Arts, Science, and Technology are less popu- classify a URL as having lifespan τ if it appears at least τ
lar still. In general, the overall pattern is replicated for all days before the end of our window, we need to pick τ and
categories of users, but there are some minor deviations: in δ such that τ + 2δ ≤ 223. We determined that τ = 70
particular, organizations show disproportionately little in- and δ = 70 sufficiently satisfied our constraints; thus for
terest in business and arts-related stories, and dispropor- the following analysis, we consider only URLs that have a
tionately high interest in science, technology, and possibly lifespan τ ≤ 70.
world news. Celebrities, by contrast, show greater interest
in sports and less interest in health, while the media shows 5.2 Lifespan By Category
somewhat greater interest in U.S. news stories. Having established a method for estimating URL lifes-
pan, we now explore the lifespan of URLs introduced by
5.1 Lifespan of Content different categories of users, as shown in Figure 9(a). URLs
In addition to different types of content, URLs introduced initiated by the elite categories exhibit a similar distribu-
by different types of elite users or ordinary users may exhibit tion over lifespan to those initiated by ordinary users. As
different lifespans, by which we mean the time lag between Figure 9(b) shows, however, when looking at the percent-
the first and last appearance of a given URL on Twitter. age of URLs of different lifespans initiated by each category,
Naively, measuring lifespan seems a trivial matter; how- we see two additional results: first, URLs originated by me-
ever, a finite observation period—which results in censoring dia actors generate a large portion of short-lived URLs (es-
of our data—complicates this task. In other words, a URL pecially URLs with lifespan=0, those that only appeared
that is last observed towards the end of the observation pe- once); and second, URLs originated by bloggers are over-
riod may be retweeted or reintroduced after the period ends, represented among the longer-lived content. Both of these
while correspondingly, a URL that is first observed toward results can be explained by the type of content that origi-
the beginning of the observation window may in fact have nates from different sources: whereas news stories tend to
been introduced before the window began. What we ob- be replaced by updates on a daily or more frequent basis,
serve as the lifespan of a URL, therefore, is in reality a the sorts of URLs that are picked up by bloggers are of more
lower bound on the lifespan. Although this limitation does persistent interest, and so are more likely to be retweeted or
not create much of a problem for short-lived URLs—which reintroduced months or even years after their initial intro-
account for the vast majority of our observations—it does duction.
20
log(# of URLs with lifespan = x day) ● other
●
celeb
● media
15 org
●
blog
●
●
●●
●
●
●
●● ●
●● ●●
●●
●●●●●●●
10 ●●●●
●●●
●●●●●
●●●●●
●●●
●●●●●
●●●●●●●
●●●
●●●●
●●●
●●●
0
0 10 20 30 40 50 60 70 Figure 10: Top 20 domains for URLs that lived more
lifespan (day)
than 200 days
(a) Count
6 org blog
blog
5 0.6
4
0.4
3
●
●
0.2 ●
●●
●
●●
2 ●●●●
●
●
●
●●●●●●●● ●●●●●●●●●●●●●●●
●●●
●●●
●●●
●●●
●●● ●●● ● ●
●● ●●●●●● ● ● ●●
1 0.0 ●
0 10 20 30 40 50 60 70
0
lifespan (day)
0 10 20 30 40 50 60 70
lifespan (day)