Knowledge and The Web 2017/18 (2) Overview of Fake-News Detection
Knowledge and The Web 2017/18 (2) Overview of Fake-News Detection
Knowledge and The Web 2017/18 (2) Overview of Fake-News Detection
CRISP-DM
• CRoss Industry Standard
Process for Data Mining
• a data mining process
model that describes
commonly used
approaches that expert
data miners use to tackle
problems.
http://www.crisp-dm.org/Images/187343_CRISPart.jpg
• 1. “Business understanding”
• What IS fake news?
Fake news – a definition
• Fake news […] is where individuals or organisations intentionally
publish hoaxes, propaganda and other misinformation and
present it as factual.
• This can include blog and social media posts and fake online
media releases.
• It does not include news satire sites such as The Onion or The
Shovel as they are not presenting their content as legitimate
factual news. Their intention is satire rather than misinformation.
• It also does not include articles that are written from the
perspective of a particular opinion or editorial standpoint,
provided the information included is factually correct.
SeeTina Fey in
Weekend Update
https%3A%2F%2Ffanyv88.com%3A443%2Fhttp%2Finsider.foxnews.com%2F2017%2F02%2F24%2Fwash-post-stands-behind-9-source-story-after-trump-calls-it-fake-news
Facts?
Opinions?
FN?
https://www.nytimes.com/2017/10/04/opinion/vegas-gun-control-
debate.html
Fake news – a definition
• Fake news […] is where individuals or organisations intentionally
publish hoaxes, propaganda and other misinformation and
present it as factual.
• This can include blog and social media posts and fake online
media releases.
• It does not include news satire sites such as The Onion or The
Shovel as they are not presenting their content as legitimate
factual news. Their intention is satire rather than misinformation.
• It also does not include articles that are written from the
perspective of a particular opinion or editorial standpoint,
provided the information included is factually correct.
Intention?
• The intentionality of deception is also a
requirement in Rubin et al.’s (2015) definition
• Whose intentionality?
– Creator of the news?
• E.g. Governments ( WMD example)
• The press
– Purveyor of the news?
• The press
• Social media you!
• How to capture the intention as a DM/ML feature?
https://fas.org/irp/cia/product/image016.jpg
https://en.wikipedia.org/wiki/Iraq_and_wea
pons_of_mass_destruction
Weapons of Mass
Destruction
and the 2003 Iraq
war
• If A, knowing that an item X is satire, retweets
X “without the metadata that it is satire”,
• And B reads and believes it
• Then did A create and/or spread FN?
"On Bullshit" (2005), by philosopher
Harry G. Frankfurt, is an essay that presents
a theory of bullshit that defines the concept
and analyzes the applications of bullshit in
the contexts of communication. Frankfurt
determines that bullshit is speech intended
to persuade (a.k.a. rhetoric), without regard
for truth.
• The liar cares about the truth and attempts
to hide it;
• the bullshitter doesn't care if what they
say is true or false, but rather only cares
whether or not their listener is persuaded.
CS
approach:
Define a
“ground
truth”
http://www.fakenewschallenge.org
• 2. + 3. Data understanding and preparation
http://www.fakenewschallenge.org/
• More options and steps possible / necessary
• Use off-the-shelf tools for NLP processing and
feature extraction
• Some pointers will be published on Toledo
• 4. Modelling
• 4.1. Approach: what do humans do to debunk
FN?
Slide from (Melbourne Atheneum Library, n.d.)
Slide from (Melbourne Atheneum Library, n.d.)
• 4. Modelling
• 4.2. Approach: How to formalise FN
detection?
• What is the task?
(Human) strategies
Human strategies translate to various
machine tasks
Strategy “Read past the headline”
• The goal of the Fake News Challenge is to explore how artificial intelligence technologies,
particularly machine learning and natural language processing, might be leveraged to
combat the fake news problem. We believe that these AI technologies hold promise for
significantly automating parts of the procedure human fact checkers use today to
determine if a story is real or a hoax.
• Assessing the veracity of a news story is a complex and cumbersome task, even for trained
experts 3. Fortunately, the process can be broken down into steps or stages. A helpful first
step towards identifying fake news is to understand what other news organizations are
saying about the topic. We believe automating this process, called Stance Detection,
could serve as a useful building block in an AI-assisted fact-checking pipeline. So stage #1
of the Fake News Challenge (FNC-1) focuses on the task of Stance Detection.
• Stance Detection involves estimating the relative perspective (or stance) of two pieces of
text relative to a topic, claim or issue. The version of Stance Detection we have selected
for FNC-1 extends the work of Ferreira & Vlachos 4. For FNC-1 we have chosen the task of
estimating the stance of a body text from a news article relative to a headline. Specifically,
the body text may agree, disagree, discuss or be unrelated to the headline.
http://www.fakenewschallenge.org/
Task “stance detection”
(à la Fake News Challenge)
http://www.fakenewschallenge.org/
Stance
detection
-
Example
http://www.fakenewschallenge.org/
Strategies “Follow links and check sources”
and “Check other news outlets”
Task:
Claim
validation
http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.html
SOLAT in the SWEN (2)
http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.html
SOLAT in the SWEN (3)
http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.html
Slide from (Hanselowski & Gurevych, 2017)
Example Claim Validation
• (work in progress)
Slide from (Hanselowski & Gurevych, 2017)
Slide from (Hanselowski & Gurevych, 2017)
Slide from (Hanselowski & Gurevych, 2017)
Slide from (Hanselowski & Gurevych, 2017)
Classification by Conroy et al. (2015)
• Linguistic approaches
– Mainly word-based
– Syntax-based approaches
– Semantic analysis
• Compare “profile” of document with others known to be genuine
– Rhetorical structure and discourse analysis
• Systematic differences between deceptive and truthful messages in terms of their coherence and
structure
– Classifiers
• Classification of sentiment: assumption that deceivers use unintended emotional communication
– In sum, linguistic approaches most suited to domain-specific studies (e.g. product reviews,
business), may have limited generalizability
• Network approaches
– Linked data
– Social network behaviour
• Hybrids
Classification by Conroy et al. (2015)
• Linguistic approaches more details: 18 October
– Mainly word-based
– Syntax-based approaches
– Semantic analysis
• Compare “profile” of document with others known to be genuine
– Rhetorical structure and discourse analysis
• Systematic differences between deceptive and truthful messages in terms of their coherence and
structure
– Classifiers
• Classification of sentiment: assumption that deceivers use unintended emotional communication
– In sum, linguistic approaches most suited to domain-specific studies (e.g. product reviews,
business), may have limited generalizability
• Network approaches
– Linked data next week
– Social network behaviour invited lecture on 8 November
• Hybrids 18 October
• 4. Modelling
• 4.3. Beyond data mining / machine learning
What about the other strategies?
citation
analysis +
reputation?!
What about the other strategies?
I would add:
Don’t assume
something is
true just
because it is
entertaining.
Maybe the satire news point to some other
cause?
• News consumption as entertainment
• Including satire-news?
https%3A%2F%2Ffanyv88.com%3A443%2Fhttp%2Fwww.fipp.com%2Fnews%2Finsightnews%2Fchart-
millennials-pay-for-entertainment-not-news
https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.pinterest.co.uk%2Fmeyerlinger%2Fhead-down-
generation-smartphone-zombies
What about the other strategies?
Awareness
tools /
nudges?!
Example: The FB audience nudge (better example: the
timer nudge, but I didn’t find a picture of it)
HCI over
and
above
DM/ML!
Using these features, it is not necessary to use a powerful and expressive model to
learn the complex mapping from these features to the stance label.
For this, Gradient-Boosted Decision Trees were chosen because of the model’s
robustness with regard to the different scales of our feature vectors. Specifically, no
normalization is needed and it can be regularized in several different ways to avoid
overfitting. Furthermore, XGBoost is a very efficient, open-source implementation that
was easily applied to the handcrafted features.
http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.html
Human performance as an upper bound?