0% found this document useful (0 votes)
134 views

Automatically Identifying Fake News in Popular Twitter Threads

d

Uploaded by

Juanito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

Automatically Identifying Fake News in Popular Twitter Threads

d

Uploaded by

Juanito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE International Conference on Smart Cloud

Automatically Identifying Fake News in Popular


Twitter Threads
Cody Buntain Jennifer Golbeck
Intelligence Community Postdoctoral Fellow College of Information Studies
University of Maryland, College Park University of Maryland, College Park
Email: [email protected] Email: [email protected]

Abstract—Information quality in social media is an increas- suggest machine learning-based approaches could help address
ingly important issue, but web-scale data hinders experts’ ability these quality issues.
to assess and correct much of the inaccurate content, or “fake In this paper, we present a method for automating “fake
news,” present in these platforms. This paper develops a method
for automating fake news detection on Twitter by learning to news” detection in Twitter, one of the most popular online
predict accuracy assessments in two credibility-focused Twitter social media platforms. This method uses a classification
datasets: CREDBANK, a crowdsourced dataset of accuracy model to predict whether a thread of Twitter conversation will
assessments for events in Twitter, and PHEME, a dataset of be labeled as accurate or inaccurate using features inspired
potential rumors in Twitter and journalistic assessments of their by existing work on credibility of Twitter stories [13], [6].
accuracies. We apply this method to Twitter content sourced from
BuzzFeed’s fake news dataset and show models trained against We demonstrate this approach’s ability to identify fake news
crowdsourced workers outperform models based on journalists’ by evaluating it against the BuzzFeed dataset of 35 highly
assessment and models trained on a pooled dataset of both shared true and false political stories curated by Silverman et
crowdsourced workers and journalists. All three datasets, aligned al. [14] and extracted from Twitter. This work is complicated
into a uniform format, are also publicly available. A feature by the limited availability of data on what is “fake news”
analysis then identifies features that are most predictive for
crowdsourced and journalistic accuracy assessments, results of online, however, so to train this system, we leverage two
which are consistent with prior work. We close with a discussion Twitter datasets that study credibility in social media: the
contrasting accuracy and credibility and why models of non- PHEME journalist-labeled dataset [15] and the CREDBANK
experts outperform models of journalists for fake news detection crowdsourced dataset [5]. PHEME is a curated data set of
in Twitter. conversation threads about rumors in Twitter replete with jour-
Index Terms—misinformation, credibility, accuracy, data qual-
ity, fake news, twitter nalist annotations for truth, and CREDBANK is a large-scale
set of Twitter conversations about events and corresponding
crowdsourced accuracy assessments for each event.
I. I NTRODUCTION
Results show our accuracy prediction model correctly clas-
Measuring accuracy and credibility in text are well-studied sifies two-thirds of the Twitter fake news stories and outper-
topics in disciplines from psychology to journalism[1], [2], forms prior work in this area. Furthermore, accuracy models
[3]. The proliferation of large-scale social media data and generated from crowdsourced workers outperform models
its increasing use as a primary news source [4], however, trained on journalists in classifying potentially fake Twitter
is forcing a re-examination of these issues. Past approaches threads. Feature analysis also shows crowdsourced workers’
that relied on journalistically trained “gatekeepers” to filter accuracy assessments are more influenced by network effects
out low-quality content are no longer applicable as social while journalists’ assessments rely more on tweet content and
media’s volume has quickly overwhelmed our ability to control language.
quality manually. Instead, platforms like Twitter and Facebook This work makes the following contributions:
have allowed questionable and inaccurate “news” content to • An automated mechanism for classifying popular Twitter
reach wide audiences without review. Social media users’s bias threads into true and fake news stories,
toward believing what their friends share and what they read • An analysis of the different features used by journalists
regardless of accuracy allows these fake stories to propagate and crowdsourced workers/non-experts in assessing ac-
widely through and across multiple platforms[5]. Despite curacy in social media stories, and
research into rumor propagation on Twitter [6], [7], [8], fake • An aligned collection of three datasets that capture accu-
image sharing in disaster aftermath [9], and politically moti- racy judgements across true and false stories.
vated “astroturfing” [10], rumor and “fake news” are becoming
increasingly problematic. Computational methods have proven II. R ELEVANT W ORK AND DATASETS
useful in similar contexts where data volumes overwhelm Social media’s explosions in popularity has enabled research
human analysis capabilities. Furthermore, regularities in bot into credibility in the online context, especially on microblog-
behavior [11] and financially motivated sensationalists [12] ging platforms. Several previous efforts have proposed meth-

978-1-5386-3684-8/17 $31.00 © 2017 IEEE 208


DOI 10.1109/SmartCloud.2017.40
ods for evaluating the credibility of a given tweet [8] or user (tweet author’s degree of confidence in his/her support), and
[16] while others have focused more on the temporal dynamics evidentiality (what sort of evidence does the tweet provide in
of rumor propagation [6]. Most relevant to our paper, however, supporting or refuting the source tweet) for each tweet in the
is the 2013 Castillo et al. work, which provides a compre- conversation. Past work found disagreement and refutation in
hensive examination of credibility features in Twitter [13]. threads to be predictive of accuracy [13], and these annotations
This study was built on an earlier investigation into Twitter of whether a tweet supports or refutes the original tweet help
usage during the 2010 Chile earthquake, where Twitter played quantify this disagreement, which we leverage later.
a significant role both in coordination and misinformation [17]. Of the 330 conversation trees in PHEME, 159 were labeled
The later study developed a system for identifying newsworthy as true, 68 false, and 103 unverified.
topics from Twitter and leveraged Amazon’s Mechanical Turk
(AMT) to generate labels for whether a topic was credible, B. The CREDBANK Dataset
similar to CREDBANK but at a smaller scale. Castillo et al. In 2015, Mitra and Gilbert introduced CREDBANK, a large-
developed a set of 68 features that included characteristics scale crowdsourced data set of approximately 37 million of
of messages, users, and topics as well as the propagation which were unique. The data set covered 96 days starting
tree to classify topics as credible or not. They found a in October of 2014, broken down into over 1,000 sets of
subset of these features, containing fifteen topic-level features event-related tweets, with each event assessed for accuracy by
and one propagation tree feature, to be the best performing 30 annotators from AMT [5]. CREDBANK was created by
feature set, with a logistic regression model achieving an collecting tweets from Twitter’s public sample stream, identi-
accuracy of 64% for credibility classification. Given general fying topics within these tweets, and using human annotators
users have difficulty judging correct and accurate information to determine which topics were about events and which of
in social media [18], [7], however, crowdsourced credibility these events contained accurate content. Then, the systems
assessments like these should be treated with caution. The used Twitter’s search API to expand the set of tweets for each
investigation presented herein builds on this past work by event.
evaluating whether crowdsourced workers (as used in both CREDBANK’s initial set of tweets from the 96-day capture
CREDBANK and Castillo et al.) are valid accuracy assessment period contained approximately one billion tweets that were
sources. then filtered for spam and grouped into one-million-tweet
windows. Mitra and Gilbert used online topic modeling from
A. The PHEME Rumor Dataset
Lau et al. [19] to extract 50 topics (a topic here is a set
The PHEME rumor scheme data set was developed by the of three tokens) from each window, creating a set of 46,850
University of Warwick in conjunction with Swissinfo, part candidate event-topic streams. Each potential event-topic was
of the Swiss Broadcasting Company [15]. Swissinfo jour- then passed to 100 annotators on AMT and labeled as an event
nalists, working with researchers from Warwick, constructed or non-event, yielding 1,049 event-related topics (the current
the PHEME data set by following a set of major events on version of CREDBANK contains 1,377 events). These event-
Twitter and identifying threads of conversation that were likely topics were then sent to 30 additional AMT users to determine
to contain or generate rumors. A “rumor” in this context the event-topic’s accuracy.
was defined as an unverified and relevant statement being This accuracy annotation task instructed users to assess “the
circulated, and a rumor could later be confirmed as true, false, credibility level of the Event” by reviewing relevant tweets
or left unconfirmed. on Twitter’s website (see Figure 5 in Mitra and Gilbert [5]).
During each rumor selected in the PHEME dataset, journal- Annotators were then asked to provide an accuracy rating
ists selected popular (i.e., highly retweeted) tweets extracted on a 5-point Likert scale of “factuality” (adapted from Sauri
from Twitter’s search API and labeled these tweets as rumor et al. [20]) from [−2, +2], where −2 represented “Certainly
or non-rumor. This construction resulted in a set of 330 Inaccurate” and +2 was “Certainly Accurate” [5]. Annotators
labeled rumorous source tweets across 140 stories. For each were required to provide a justification for their choice as
tweet in this labeled set, the authors then extracted follow- well. These tweets, topics, event annotations, and accuracy
up tweets that replied to the source tweet and recursively annotations were published as the CREDBANK dataset.1
collected descendant tweets that responded to these replies. Data provided in CREDBANK includes the three-word topics
This collection resulted in a tree of conversation threads of extracted from Twitter’s sample stream, each topic’s event
4,512 additional descendant tweets. Journalists from Swissinfo annotations, the resulting set of event-topics, a mapping of
labeled source tweets for each of these threads as true, false, event-topics’ relevant tweets, and a list of the AMT accuracy
or unverified. Once this curated set of labeled source tweets annotations for each event-topic. One should note that CRED-
and their respective conversation threads were collected, the BANK does not contains binary labels of event accuracy but
PHEME data set was then made available to crowdsourced instead has a 30-element vector of accuracy labels.
annotators to identify characteristics of these conversation In CREDBANK, the vast majority (> 95%) of event accu-
threads. This crowdsourced task asked annotators to identify racy annotations had a majority rating of “Certainly Accurate”
levels of support (does a tweet support, refute, ask for more
1 Available
information about, or comment on the source tweet), certainty online http://compsocial.github.io/CREDBANK-data/

209
[5]. Only a single event had a majority label of inaccurate: textual aspects of tweets, like polarity, subjectivity, and agree-
the rumored death of Chris Callahan, the kicker from Baylor ment. Lastly, temporal features capture trends in the previous
University’s football team, during the 2015 Cotton Bowl (this features over time, e.g., the slopes of the number of tweets or
rumorous event was clearly false as Callahan was tweeting average author age over time. As mentioned, many features
about his supposed death after the game). After presenting this were inspired by or reused from Castillo et al. [13] (indicated
tendency towards high ratings, Mitra and Gilbert thresholds for by  ).
majority agreement and found that 76.54% of events had more 1) Structural Features: Structural features are specific to
than 70% agreement, and 2% of events had 100% agreement each Twitter conversation thread and are calculated across the
among annotators. The authors then chose 70% majority- entire thread. These features include the number of tweets,
agreement value as their threshold, and 23% of events in which average tweet length , thread lifetime (number of minutes
less than 70% of annotators agreed were “not perceived to between first and last tweet), and the depth of the conversation
be credible” [5]. This skew is consistent with Castillo et al. tree (inspired by other work that suggests deeper trees are
[13], where authors had to remove the “likely to be true” label indicators of contentious topics [21]). We also include the
because crowdsourced workers labeled nearly all topics thusly. frequency and ratio (as in Castillo et al.) of tweets that contain
We address this bias below. hashtags, media (images or video), mentions, retweets, and
web links .
C. BuzzFeed News Fact-Checking Dataset 2) User Features: While the previous set focuses on ac-
In late September 2016, journalists from BuzzFeed News tivities and thread characteristics, the following features are
collected over 2,000 posts from nine large, verified Facebook attributes of the users taking part in the conversations, their
pages (e.g., Politico, CNN, AddictingInfo.org, and Freedom connectedness, and the density of interaction between these
Daily) [14]. Three of these pages were from mainstream media users. User features include account age ; average follower-

sources, three were from left-leaning organizations, and three , friend-, and authored status counts ; frequency of verified
were from right-leaning organizations. BuzzFeed journalists authors , and whether the author of the first tweet in the thread
fact-checked each post, labeling it as “mostly true,” “mostly is verified. We also include the difference between when an
false,” “mixture of true and false,” or “no factual content.” account was created and the relevant tweet was authored (to
Each post was then checked for engagement by collecting capture bots or spam accounts).
the number of shares, comments, and likes on the Facebook This last user-centric feature, network density, is measured
platform. In total, this data set contained 2,282 posts, 1,145 by first creating a graph representation of interactions between
from mainstream media, 666 from right-wing pages, and 471 a conversation’s constituent users. Nodes in this graph repre-
from left-wing pages [14]. sent users, and edges correspond to mentions and retweets
between these users. The intuition here is that highly dense
III. M ETHODS networks of users are responding to each other’s posts and
endogenous phenomena. Sparser interaction graphs suggest
This paper’s central research question is whether we can the conversation’s topic is stimulated by exogenous influences
automatically classify popular Twitter stories as either accurate outside the social network and are therefore more likely to be
or inaccurate (i.e., true or fake news). Given the scarcity of true.
data on true and false stories, however, we solve this classi- 3) Content Features: Content features are based on tweets’
fication problem by transferring credibility models trained on textual aspects and include polarity (the average positive or
CREDBANK and PHEME to this fake news detection task negative feelings expressed a tweet), subjectivity (a score of
in the BuzzFeed dataset. To develop a model for classifying whether a tweet is objective or subjective), and disagreement ,
popular Twitter threads as accurate or inaccurate, we must as measured by the amount of tweets expressing disagreement
first formalize four processes: featurizing accuracy prediction, in the conversation. As mentioned in PHEME’s description,
aligning the three datasets, selecting which features to use, and tweet annotations include whether a tweet supports, refutes,
evaluating the resulting models. comments on, or asks for information about the story pre-
sented in the source tweet. These annotations directly support
A. Features for Predicting Accuracy evaluating the hypothesis put forth in Mendoza, Poblete, and
Here, we describe 45 features we use for predicting accu- Castillo [17], stating that rumors contain higher proportions of
racy that fall across four types: structural, user, content, and contradiction or refuting messages. We therefore include these
temporal. Of these features, we include fourteen of the most disagreement annotations (only a binary value for whether
important features found in Castillo et al., omitting the two the tweet refutes the source). Also borrowing from Castillo
features on most frequent web links. Structural features capture et al., we include the frequency and proportions of tweets that
Twitter-specific properties of the tweet stream, including tweet contain question marks, exclamation points, first/second/third-
volume and activity distributions (e.g., proportions of retweets person pronouns, and smiling emoticons.
or media shares). User features capture properties of tweet 4) Temporal Features: Recent research has shown temporal
authors, such as interactions, account ages, friend/follower dynamics are highly predictive when identifying rumors on
counts, and Twitter verified status. Content features measure social media [6], so in addition to the frequency and ratio

210
features described above, we also include features that describe and 1.867 respectively. In theory, events with mean ratings on
how these values change over time. These features are devel- the extreme ends of this spectrum should capture some latent
oped by accumulating the above features at each minute in the quality measure, so events below or above the minimum or
conversation’s lifetime and converting the accumulated value maximum quartile ratings in CREDBANK are more likely to
to logarithmic space. We then fit a linear regression model to be inaccurate and accurate respectively. To construct “truth”
these values in log space and use the slope of this regression labels from this data, we use the top and bottom 15% quan-
as the feature’s value, thereby capturing how these features tiles, so events with average accuracy ratings more than 1.9
increase or decrease over time. We maintain these temporal become positive samples or less than 1.467 become negative
features for account age, difference between account age and samples. These quantiles were chosen (rather than quartiles)
tweet publication time, author followers/friends/statuses, and to construct a dataset of similar size to PHEME. Events whose
the number of tweets per minute. mean ratings are between these values are left unlabeled and
removed from the dataset. This labeling process results in 203
B. Dataset Alignment positive events and 185 negative events.
While working with multiple datasets from different pop- 3) Capturing Twitter’s Threaded Structure: Another major
ulations reduces bias in the final collection, to compare the difference between PHEME and CREDBANK/BuzzFeed is
resulting models, we must translate these datasets into a the form of tweet sets: in PHEME, topics are organized
consistent format. That is, we must generate a consistent into threads, starting with a popular tweet at the root and
feature set and labels across all three datasets. replies to this popular tweet as the children. This threaded
1) Extracting Twitter Threads from BuzzFeed’s Facebook structure is not present in CREDBANK or our BuzzFeed data
Dataset: The most glaring difference among our datasets is as CREDBANK contains all tweets that match the related
that BuzzFeed’s data captures stories shared on Facebook, event-topic’s three-word topic query, and BuzzFeed contains
whereas CREDBANK and PHEME are Twitter-based. Since popular tweeted headlines. To capture thread depth, which
Facebook data is not publicly available, and its reply structure may be a proxy for controversy [21], we adapt CREDBANK’s
differs from Twitter’s (tweets can have arbitrarily deep replies tweet sets and BuzzFeed’s popular tweet headlines into threads
whereas Facebook supports a maximum depth of two), we using PHEME’s thread-capture tool. For our BuzzFeed data,
cannot compare these datasets directly. Instead, we use the we use the popular headline tweets as the thread roots and
following intuition to extract Twitter threads that match the capture replies to these roots to construct the thread structure
BuzzFeed dataset: Each element in the BuzzFeed data repre- mimicking PHEME’s. In CREDBANK, we identify the most
sents a story posted by an organization to its Facebook page, retweeted tweet in each event and use this tweet as the thread
and all of these organizations have a presence on Twitter as root. Any CREDBANK thread that has no reactions gets
well, so each story posted on Facebook is also shared on discarded, leaving a final total of 115 positive samples and
Twitter. To align this data with PHEME and CREDBANK, 95 negative samples.
we extract the ten most shared stories from left- and right- 4) Inferring Disagreement in Tweets: One of the more
wing pages and search Twitter for these headlines (we use a important features suggested in Castillo et al. is the amount
balanced set from both political sides to avoid bias based on of disagreement or contradiction present in a conversation
political leaning). We then keep the top three most retweeted [13]. PHEME already contained this information in the form
posts for each headline, resulting in 35 topics with journalist- of “support” labels for each reply to the thread’s root, but
provided labels, 15 of which are labeled “mostly true,” and 20 CREDBANK and our BuzzFeed data lack these annotations.
“mostly false.” Once these we identify these tweets, our Buz- To address this omission, we developed a classifier for identi-
zFeed dataset mirrors the CREDBANK dataset in structure. fying tweets that express disagreement. This classifier used
2) Aligning Labels: While the PHEME and BuzzFeed a combination of the support labels in PHEME and the
datasets contain discrete class labels describing whether a story “disputed” labels in the CreateDebate segment of the Internet
is true or false (and points between), CREDBANK instead Argument Corpus (IACv2) [22]. We merged PHEME’s support
contains a collection of annotator accuracy assessments on labels and IACv2 into a single ground-truth dataset to train this
a Likert scale. We must therefore convert CREDBANK’s disagreement classifier. Augmenting PHEME support labels
accuracy assessment vectors into discrete labels comparable with the IAC was necessary to achieve sufficient area under
to those in the other datasets. Given annotator bias towards the receiver operating characteristic curve of 72.66%.
“certainly accurate” assessments and the resulting negatively This disagreement classifier modeled tweet and forum text
skewed distribution of average assessments, a labeling ap- bags of unigrams and bigrams. After experimenting with
proach that addresses this bias is required. support vector machines, random forests, and naive Bayes
Since majority votes are uninformative in CREDBANK, we classifiers, we found stochastic gradient descent to be the
instead compute the mean accuracy rating for each event, best predictor of disagreement and disputed labels. 10-fold
the quartiles across all mean ratings in CREDBANK, and cross validation of this classifier achieved a mean area under
construct discrete labels based on these quartiles. First, the the receiver operating characteristic curve of 86.7%. We then
grand mean of CREDBANK’s accuracy assessments is 1.7, applied this classifier to the CREDBANK and BuzzFeed
the median is 1.767, and the 25th and 75th quartiles are 1.6 threads to assign disagreement labels for each tweet. A human

211
then reviewed a random sample of these labels. While human resulting classifier is applied to the BuzzFeed dataset, again
annotators would be better for this task, an automated classifier restricted to the source dataset’s most performant feature set,
was preferable given CREDBANK’s size. and the ROC-AUC for that classifier is calculated using the
BuzzFeed journalists’ truth labels. This training and appli-
C. Per-Set Feature Selection cation process is repeated 20 times, and we calculate the
The previous sections present the features we use to capture average ROC-AUC across these repetitions. We also build
structure and behavior in potentially false Twitter threads. Our a third classification model by pooling both CREDBANK
objective is to use these features to train models capable of and PHEME datasets together and using the union of most
predicting labels in the PHEME and CREDBANK datasets performant features in each set. We then plot the ROC curves
and evaluate how these models transfer to the BuzzFeed fake for both source datasets, the pooled dataset, and a random
news dataset, but machine learning tasks are often sensitive baseline that predicts BuzzFeed labels through coin tosses and
to feature dimensionality. That is, low-quality features can select the highest-scoring model.
reduce overall model performance. To address this concern, we
perform a recursive feature elimination study within PHEME IV. R ESULTS
and CREDBANK to identify which features are the most
predictive of accuracy in their respective datasets. A. Feature Selection
For each training dataset (i.e., CREDBANK and PHEME), Recursively removing features from our models and evalu-
we evaluate feature performance by measuring the area under ating classification results yielded significantly reduced feature
the receiver operating characteristic curve (ROC-AUC) for a sets for both PHEME and CREDBANK, the results of which
model trained using combinations of features. The area under are shown in Figure 1. The highest performing feature set for
this ROC curve characterizes model performance on a scale PHEME only contained seven of the 45 features: proportions
of 0 to 1 (a random coin toss would achieve a ROC-AUC and frequency of tweets sharing media; proportions of tweets
of 0.5 for a balanced set). For each feature set, we perform sharing hashtags; proportions of tweets containing first- and
thirty instances of 10-fold cross-validation using a 100-tree third-person pronouns; proportions of tweets expressing dis-
random forest classifier (an ensemble made of 100 separate agreement; and the slope of the average number of authors’
decision trees trained on a random feature subset) to estimate friends over time. The top ten features also included account
the ROC-AUC for that feature set. age, frequency of smile emoticons, and author friends. This
With the classifier and evaluation metric established, our PHEME feature set achieved an ROC-AUC score of 0.7407
feature selection process recursively removes the least per- and correctly identified 66.93% of potentially false threads
formant feature in each iteration until only a single feature within PHEME.
remains. The least performant feature is determined using CREDBANK’s most informative feature set used 12 of the
a leave-one-out strategy: in an iteration with k features, k 45 features: frequencies of smiling emoticons, tweets with
models are evaluated such that each model uses all but one mentions, and tweets with multiple exclamation or question
held-out feature, and the feature whose exclusion results in marks; proportions of tweets with multiple exclamation marks,
the highest ROC-AUC is removed from the feature set. This one or more question marks, tweets with hashtags, and tweets
method identifies which features hinder performance since with media content; author account age relative to a tweet’s
removing important features will result in losses in ROC- creation date; average tweet length; author followers; and
AUC score, and removing unimportant or bad features will whether the a thread started with a verified author. Propor-
either increase ROC-AUC or have little impact. Given k tions of tweets with question marks and multiple exclama-
features, the process will execute k − 1 iterations, and each tion/question marks were not in the top ten features, however.
iteration will output the highest scoring model’s ROC-AUC. This feature set achieved an ROC-AUC score of 0.7184 and
By inspecting these k − 1 maximum scores, we determine the correctly identified 70.28% of potential false threads within
most important feature subset by identifying the iteration at CREDBANK.
which the maximum model performance begins to decrease. Of these feature subsets, only three features are shared by
both crowdsourced worker and journalist models (frequency
D. Evaluating Model Transfer
of smile emoticons and proportion of tweets with media or
Once the datasets are aligned and the most performance hashtags). These results are also consistent with the difficulty
feature subsets in CREDBANK and PHEME are identified in identifying potentially fallacious threads of conversation in
(these feature subsets are constructed separately and may not Twitter discussed in Castillo et al. [13]. Furthermore, both
overlap), we can then evaluate how well each dataset predicts PHEME and CREDBANK’s top ten features contain five of
truth in the BuzzFeed dataset. This evaluation is performed the 16 best features found in Castillo et al. [13]. Despite these
by restricting each source dataset (either CREDBANK or consistencies, our models outperform the model presented in
PHEME) to its most performant feature subset and training this prior work (61.81% accuracy in Castillo et al. versus
a 100-tree random forest classifier on each source.2 Each 66.93% and 70.28% in PHEME and CREDBANK). These
2 We tested other classifiers here as well, and they all performed approxi- increases are marginal, however, but are at least consistent
mately equally. with past results.

212
0.76
popular news stories on Twitter as true or fake. Second, the
0.74 limited predictive feature overlap in PHEME and CRED-
BANK suggest these populations evaluate accuracy in social
0.72
media differently.
Regarding crowdsourced performance against the BuzzFeed
ROC-AUC

0.70

dataset, since these stories were fact-checked by journalists,


0.68
one might expect the PHEME model to perform better in
0.66 this context. We propose an alternate explanation: When a
0.64 CREDBANK-Threaded thread in Twitter starts with a story headline and link, the
PHEME story’s truth, as a journalist would define it, influences but does
0.62
0 5 10 15 20 25 30 35 40 45 not dictate crowdsourced workers’ perceptions of the thread.
Deleted Feature Count
Rather, it is this perception of accuracy that dictates how the
Fig. 1: Feature Elimination Study story is shared. Stated another way, the CREDBANK model
captures user perceptions that drive engagement and sharing
better than the PHEME model. Furthermore, our CREDBANK
model is more rooted in the Twitter context than PHEME since
B. Predicting BuzzFeed Fact-Checking CREDANK assessors were asked to make their judgements
Applying the most performant CREDBANK and PHEME based solely on the tweets they saw rather than the additional
models to our BuzzFeed dataset shows both the pooled and external information PHEME journalists could leverage. While
CREDBANK-based models outperform the random baseline, a CREDBANK assessor may have used external resources like
but the PHEME-only model performs substantially worse, search engines to check results, the majority of assessor jus-
as shown in Figure 2. From this graph, CREDBANK-based tifications for their judgements were based on perception and
models applied to the BuzzFeed performed nearly equivalently how they felt rather than external fact checking [5]. From this
to performance in their native context, achieving a ROC- perspective, CREDBANK models may be more appropriate for
AUC of 73.80% and accuracy of 65.29%. The pooled model a social media-based automated fake news detection task since
scores about evenly with the random baseline, with a ROC- both rely primarily on signals endogenous to social media
AUC of 53.14% and accuracy of 51.00%. The PHEME-based (rather than external journalistic verification). Finally, given
model only achieved a ROC-AUC of 36.52% and accuracy the commensurate performance CREDBANK and PHEME
of 34.14%. None of the dataset’s results were statistically exhibit in their native contexts, PHEME’s poor performance
correlated with the underlying actual labels either, with CRED- for fake news suggest some fundamental difference between
BANK’s χ2 (1, N = 35) = 2.803, p = 0.09409, PHEME’s how endogenous rumors propagate in social media and how
χ2 (1, N = 35) = 2.044, p = 0.1528, and the pooled model’s fake news is perceived and shared, but more work is needed
χ2 (1, N = 35) = 0.2883, p = 0.5913. here.
Along similar lines, though CREDBANK assessors are
clearly biased towards believing what they read, our results
1.0 show that the differences between story ratings capture some
latent feature of accuracy. That is, while users may be more
0.8 likely to perceive false news stories as credible, their as-
sessments suggest incorrect stories still receive lower scores.
0.6 Future research can use this information to correct for non-
TPR

experts’ bias towards believing what they read online, which


0.4 may yield better models or better inform researchers about
Random (AUC=49.90%) how fake news stories can be stopped before they spread.
Pooled (AUC=53.14%)
0.2
CREDBANK (AUC=73.80%)
Regarding contrasting accuracy models, we see diverging
PHEME (AUC=36.52%) feature sets between PHEME and CREDBANK. A review of
0.0
0.0 0.2 0.4 0.6 0.8 1.0
the important features in each model suggest PHEME assess-
FPR ment is more linked to structural and content features rather
than user, or temporal features. CREDBANK assessments, on
Fig. 2: Adapting to Fake News Classification
the other hand, focused more on different content markers, like
formality of language (e.g., emoticons and many exclamation
points), and user features, such as whether the tweet was from
V. D ISCUSSION a verified author. While both datasets are built on “accuracy”
Analysis of the above results suggest two significant results: assessments, we theorize this question captures two separate
First, models trained against non-expert, crowdsourced work- qualities: for PHEME’s journalists, “accuracy” is objective or
ers outperform models trained against journalists in classifying factual truth, whereas CREDBANK’s crowdsourced workers
equate “accuracy” with credibility, or how believable the story

213
seems. In PHEME, journalists evaluate the factual accuracy provides a useful and less expensive means to classify true
of conversation threads after “a consensus had emerged about and false stories on Twitter rapidly. Such a system could be
the facts relating to the event in question” and after reviewing valuable to social media users by augmenting and supporting
all the captured tweets relevant to that event [15]. CRED- their own credibility judgements, which would be a crucial
BANK assessors, as mentioned, focus more on perception boon given the known weaknesses users exhibit in these
of accuracy, or believability, in their justifications and are judgements. These results may also be of value in studying
driven to make judgements rapidly by CREDBANK’s “real- propaganda on social media to determine whether such stories
time responsiveness” [5]. This distinction would also explain follow similar patterns.
assessors’ significant bias towards rating threads as accurate,
which was present in both CREDBANK and Castillo et al. ACKNOWLEDGEMENTS
[13], since readers are pre-disposed to believe online news This research was supported by an appointment to the
[23], [7]. Intelligence Community Postdoctoral Research Fellowship
Finally, by making an aligned version of this cross-platform Program at the University of Maryland, College Park, ad-
dataset available, future research can explore differences be- ministered by Oak Ridge Institute for Science and Education
tween assessment populations. Our results suggest journalists through an interagency agreement between the U.S. Depart-
and crowdsourced workers use distinct signals in evaluating ment of Energy and the Office of the Director of National
accuracy, which could be expanded and used to educate non- Intelligence.
experts on which features they should focus when reading
social media content. Similarly, enhancing journalists’ un- DATA AVAILABILITY
derstanding of the features non-experts use when assessing Data and analyses are available online at: https://github.com/
accuracy may allow for better-crafted corrections to propagate cbuntain/CREDBANK-data
through social media more rapidly.
R EFERENCES
A. Limitations
[1] A. A. Memon, A. Vrij, and R. Bull, Psychology and law: Truthfulness,
While the results discussed herein suggest crowdsourced accuracy and credibility. John Wiley & Sons, 2003.
workers provide a good source for identifying fake news, [2] S. R. Maier, “Accuracy Matters: A Cross-Market Assessment of News-
several limitations may influence our results. This work’s paper Error and Credibility,” Journalism & Mass Communication Quar-
terly, vol. 82, no. 3, pp. 533–551, 2005.
main limitation lies in the structural differences between [3] B. J. Fogg and H. Tseng, “The Elements of Computer Credibility,” in
CREDBANK and PHEME, which could affect model transfer. Proceedings of the SIGCHI Conference on Human Factors in Computing
If the underlying distributions that generated our samples di- Systems, ser. CHI ’99. New York, NY, USA: ACM, 1999, pp. 80–87.
[Online]. Available: http://doi.acm.org/10.1145/302979.303001
verge significantly, differences in feature sets or cross-context [4] B. Liu, J. D. Fraustino, and Y. Jin, “Social Media Use during Disasters:
performance could be attributed to structural issues rather than A Nationally Representative Field Experiment,” College Park, MD,
actual model capabilities. In future work, this limitation could Tech. Rep., 2013.
[5] T. Mitra and E. Gilbert, “CREDBANK: A Large-Scale Social
be addressed by constructing a single data set of potential Media Corpus With Associated Credibility Annotations,” International
rumors and fake news threads and using both crowdsourced AAAI Conference on Web and Social Media (ICWSM), 2015. [Online].
and journalist assessors to evaluate the same data. This new Available: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/
view/10582
data set would obviate any issues or biases introduced by the [6] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Prominent features
alignment procedure we employed herein. of rumor propagation in online social media,” Proceedings - IEEE
Another potential limitation is this work’s focus on popu- International Conference on Data Mining, ICDM, pp. 1103–1108, 2013.
[7] K. Starbird, J. Maddock, M. Orand, P. Achterman, and R. M. Mason,
lar Twitter threads. We rely on identifying highly retweeted “Rumors, False Flags, and Digital Vigilantes: Misinformation on Twitter
threads of conversation and use the features of these threads after the 2013 Boston Marathon Bombing,” iConference 2014 Proceed-
to classify stories, limiting this work’s applicability only to ings, pp. 654–662, 2014.
[8] V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei, “Rumor
the set of popular tweets. Since the majority of tweets are Has It: Identifying Misinformation in Microblogs,” in Proceedings
rarely retweeted, this method therefore is only usable on a of the Conference on Empirical Methods in Natural Language
minority of Twitter conversation threads. While a limitation, Processing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2011, pp. 1589–1599. [Online]. Available:
its severity is mitigated by the fact that fake news that is not http://dl.acm.org/citation.cfm?id=2145432.2145602
being retweeted either is not gaining traction among the user [9] A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi, “Faking Sandy:
base, or the user base has already identified it as fake. Hence, characterizing and identifying fake images on Twitter during Hurricane
Sandy,” Proceedings of the 22nd . . . , pp. 729–736, 2013. [Online].
our applicability to more popular tweets is valuable, as popular Available: http://dl.acm.org/citation.cfm?id=2488033
but fake stories have more potential to misinform than less [10] J. Ratkiewicz, M. Conover, M. Meiss, B. Gonçalves, S. Patil,
popular fake stories. A. Flammini, and F. Menczer, “Detecting and Tracking the Spread of
Astroturf Memes in Microblog Streams,” CoRR, vol. abs/1011.3, 2010.
[Online]. Available: http://arxiv.org/abs/1011.3768
VI. C ONCLUSIONS [11] K. Starbird, “Examining the Alternative Media Ecosystem through
This work demonstrates an automated system for detecting the Production of Alternative Narratives of Mass Shooting Events
on Twitter,” International AAAI Conference on Web and Social
fake news in popular Twitter threads. Furthermore, leverag- Media (ICWSM), 2017. [Online]. Available: http://faculty.washington.
ing non-expert, crowdsourced workers rather than journalists edu/kstarbi/Alt Narratives ICWSM17-CameraReady.pdf

214
[12] L. Sydell, “We Tracked Down A Fake-News Creator In The [18] Stanford History Education Group, “Evaluating information: The
Suburbs. Here’s What We Learned,” nov 2016. [Online]. Available: cornerstone of civic online reasoning,” Stanford University, Stanford,
http://www.npr.org/sections/alltechconsidered/2016/11/23/503146770/ CA, Tech. Rep., 2016. [Online]. Available: https://sheg.stanford.edu/
npr-finds-the-head-of-a-covert-fake-news-operation-in-the-suburbs upload/V3LessonPlans/ExecutiveSummary11.21.16.pdf
[13] C. Castillo, M. Mendoza, and B. Poblete, “Predicting information [19] J. Lau, N. Collier, and T. Baldwin, “On-line Trend Analysis with
credibility in time-sensitive social media,” Internet Research, vol. 23, Topic Models: #twitter Trends Detection Topic Model Online,”
no. 5, pp. 560–588, 2013. International Conference on Computational Linguistics (COLING),
[14] C. Silverman, L. Strapagiel, H. Shaban, E. Hall, and J. Singer-Vine, vol. 2, no. December, pp. 1519–1534, 2012. [Online]. Available:
“Hyperpartisan Facebook Pages Are Publishing False And Misleading https://www.aclweb.org/anthology/C/C12/C12-1093.pdf
Information At An Alarming Rate,” oct 2016. [Online]. Available: [20] R. Sauri and J. Pustejovsky, Factbank: A corpus annotated with event
https://www.buzzfeed.com/craigsilverman/partisan-fb-pages-analysis factuality, 2009, vol. 43, no. 3.
[15] A. Zubiaga, G. W. S. Hoi, M. Liakata, R. Procter, and P. Tolmie,
“Analysing How People Orient to and Spread Rumours in Social [21] C. Tan, C. Danescu-niculescu mizil, V. Niculae, C. Danescu-niculescu
Media by Looking at Conversational Threads,” PLoS ONE, pp. 1–33, mizil, and L. Lee, “Winning Arguments: Interaction Dynamics and
2015. [Online]. Available: http://arxiv.org/abs/1511.07487 Persuasion Strategies in Good-faith Online Discussions,” in Proceedings
[16] B. Kang, J. O’Donovan, and T. Höllerer, “Modeling topic specific of the 25th International Conference on World Wide Web, ser. WWW
credibility on twitter,” in Proceedings of the 2012 ACM international ’16. Republic and Canton of Geneva, Switzerland: International
conference on Intelligent User Interfaces, ser. IUI ’12. New World Wide Web Conferences Steering Committee, 2016, pp. 613–624.
York, NY, USA: ACM, 2012, pp. 179–188. [Online]. Available: [Online]. Available: https://doi.org/10.1145/2872427.2883081
http://doi.acm.org/10.1145/2166966.2166998 [22] R. Abbott, B. Ecker, P. Anand, and M. A. Walker, “Internet Argument
[17] M. Mendoza, B. Poblete, and C. Castillo, “Twitter Under Crisis: Corpus 2.0: An SQL schema for Dialogic Social Media and the Corpora
Can We Trust What We RT?” in Proceedings of the First to go with it,” pp. 4445–4452, 2015.
Workshop on Social Media Analytics, ser. SOMA ’10. New [23] J. B. Mackay and W. Lowrey, “The Credibility Divide: Reader Trust Of
York, NY, USA: ACM, 2010, pp. 71–79. [Online]. Available: Online Newspapers And Blogs,” Journal of Media Sociology, vol. 3, no.
http://doi.acm.org/10.1145/1964858.1964869 1-4, pp. 39–57, 2011.

215

You might also like