Fake News Detection Using Machine Learning Algorithm
Fake News Detection Using Machine Learning Algorithm
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In our modern era where the internet social media, the standard of stories on social media is less
is ubiquitous, everyone relies on various online than traditional news organizations. However, because it's
resources for news. Along with the increase in the inexpensive to supply news online and far faster and easier
use of social media platforms like Facebook,
Twitter, etc. news spread rapidly among millions of
users within a very short span of time. The spread
of fake news has far-reaching consequences like the
creation of biased opinions to swaying election
outcomes for the benefit of certain candidates.
Moreover, spammers use appealing news headlines
to generate revenue using advertisements via clickbaits.
In this paper, we aim to perform binary
classification of various news articles available
online with the help of concepts pertaining to
Artificial Intelligence, Natural Language
Processing and Machine Learning. We aim to
provide the user with the ability to classify the news
as fake or real and also check the authenticity of
thewebsite publishing the news.
Key Words: Internet, Social Media, Fake News,
Classification, Artificial Intelligence, Machine
Learning, Websites, Authenticity.
1. INTRODUCTION
matters (mostly political). Samples of such websites could Himank Gupta et. al. [10] gave a framework based on
also be found in Ukraine, United States of America, different machine learning approach that deals with
Germany, China and much of other countries [4]. Thus, various problems including accuracy shortage, time lag
fake news may be a global issue also as a worldwide (BotMaker) and high processing time to handle thousands
challenge. Many scientists believe that fake news issue of tweets in 1 sec. Firstly, they have collected 400,000
could also be addressed by means of machine learning and tweets from HSpam14 dataset. Then they further
AI [5]. There‘s a reason for that: recently AI algorithms characterize the 150,000 spam tweets and 250,000 non-
have begun to work far better on many classification spam tweets. They also derived some lightweight features
problems (image recognition, voice detection then on) along with the Top-30 words that are providing highest
because hardware is cheaper and larger datasets are information gain from Bag-ofWords model. 4. They were
available. There are several influential articles about able to achieve an accuracy of 91.65% and surpassed the
automatic deception detection. In [6] the authors provide existing solution by approximately18%.
a general overview of the available techniques for the
Marco L. Della Vedova et. al. [11] first proposed a novel
matter. In [7] the authors describe their method for fake
ML fake news detection method which, by combining
news detection supported the feedback for the precise
news content and social context features, outperforms
news within the micro blogs. In [8] the authors actually
existing methods in the literature, increasing its accuracy
develop two systems for deception detection supported
up to 78.8%. Second, they implemented their method
support vector machines and Naive Bayes classifier (this
within a Facebook Messenger Chabot and validate it with
method is employed within the system describedduring
a real-world application, obtaining a fake news detection
this paper as well) respectively. They collect the info by
accuracy of 81.7%. Their goal was to classify a news item
means of asking people to directly provide true or false
as reliable or fake; they first described the datasets they
information on several topics – abortion, execution and
used for their test, then presented the content-based
friendship. The accuracy of the detection achieved by the
approach they implemented and the method they proposed
system is around 70%. This text describes an easy fake
to combine it with a social-based approach available in the
news detection method supported one among the synthetic
literature. The resulting dataset is composed of 15,500
intelligence algorithms – naïve Bayes classifier, Random
posts, coming from 32 pages (14 conspiracy pages, 18
Forest and Logistic Regression. The goal of the research
scientific pages), with more than2, 300, 00 likes by
is to look at how these particular methods work for this
900,000+ users. 8,923 (57.6%) posts are hoaxes and 6,577
particular problem given a manually labelled news dataset
(42.4%) are non-hoaxes. Cody Buntain et. al. [12]
and to support (or not) the thought of using AI for fake
develops a method for automating fake news detection on
news detection. The difference between these article and
Twitter by learning to predict accuracy assessments in two
articles on the similar topics is that during this paper
credibilityfocused Twitter datasets: CREDBANK, a
Logistic Regression was specifically used for fake news
crowd sourced dataset of accuracy assessments for events
detection; also, the developed system was tested on a
in Twitter, and PHEME, a dataset of potential rumours in
comparatively new data set, whichgave a chance to gauge
Twitter and journalistic assessments of their accuracies.
its performance on a recent
They apply this method to Twitter content sourced from
A. Characteristics of Fake News: BuzzFeed‟s fake news dataset. A feature analysis
identifies features that are most predictive for crowd
They often have grammatical mistakes. They are often
sourced and journalistic accuracy assessments, results of
emotionally coloured. They often try to affect readers‘
which are consistent with prior work. They rely on
opinion on some topics. Their content is not always true.
identifying highly retweeted threads of conversation and
They often use attention seeking words and news format
use the features of these threads to classify stories, limiting
and click baits. They are too good to be true. Their sources
this work‘s applicability only to the set of popular tweets.
are not genuine most of the times [9].
Since the majority of tweets are rarely retweeted, this
method therefore is only usable on a minority of Twitter
2. Body of Paper conversationthreads.
Mykhailo Granik et. al. in their paper [3] shows a simple
approach for fake news detection using naive Bayes his paper, Shivam B. Parikh et. al. [13] aims to present an
classifier. This approach was implemented as a software insight of characterization of news story in the modern
system and tested against a data set of Facebook news diaspora combined with the differential content types of
posts. They were collected from three large Facebook news story and its impact on readers. Subsequently, we
pages each from the right and from the left, as well as three dive into existing fake news detection approaches that are
large mainstream political news pages (Politico, CNN, heavily based on textbased analysis, and also describe
ABC News). They achieved classification accuracy of popular fake news datasets. We conclude the paper by
approximately 74%. Classification accuracy for fake news identifying 4 key open research challenges that can guide
is slightly worse. This may be caused by the skewness of future research. It is a theoretical Approach which gives
the dataset: only 4.9% of it is fakenews
IV. IMPLEMENTATION:
4.1 DATA COLLECTION AND ANALYSIS
We can get online news from different sources like social
media websites, search engine, homepage of news agency
websites or the fact-checking websites. On the Internet,
there are a few publicly available datasets for Fake news
classification like Buzzfeed News, LIAR [15], BS
B. System Architecturei) Static SearchThe architecture of Detector etc. These datasets have been widely used in
Static part of fake news detection system is quite simple different research papers for determining the veracity of
and is done keeping in mind the basic machine learning news. In the following sections, I have discussed in brief
process flow. The system design is shown below and self- about the sources of the dataset used in this work.
explanatory. The main processes in the design are Online news can be collected from different sources, such
as news agency homepages, search engines, and social
media websites. However, manually determining the
veracity of news is a challenging task, usually requiring
annotators with domain expertise who performs careful
analysis of claims and additional evidence, context, and
reports from authoritative sources. Generally, news data
with annotations can be gathered in the following ways:
Expert journalists, Fact-checking websites, Industry
detectors, and Crowd sourced workers. However, there
are no agreed upon benchmark datasets for the fake news
detection problem. Data gathered must be pre-processed-
that is, cleaned, transformed and integrated before it can basic pre processing was done on the News training data.
undergo training process [16]. The dataset that we used is This step was comprised of
explained below: Data Cleaning: While reading data, we get data in the
structured or unstructured format. A structured format has
LIAR: This dataset is collected from fact-checking a well defined pattern whereas unstructured data has no
website PolitiFact through its API [15]. It includes 12,836 proper structure. In between the 2 structures, we have a
human labelled short statements, which are sampled from semi-structured format which is a comparably better
various contexts, such as news releases, TV or radio structured than unstructured format. Cleaning up the text
interviews, campaign speeches, etc. The labels for news data is necessary to highlight attributes that we‘re going
truthfulness are fine-grained multiple classes: pants-fire, to want our machine learning system to pick up on.
false, barely-true, half-true, mostly true, and true. The Cleaning (or pre processing) the data typically consists of
data source used for this project is LIAR dataset which a number of steps:
contains 3 files with .csv format for test, train and A. Remove punctuation Punctuation can provide
validation. Below is some description about the data files grammatical context to a sentence which supports
used for this project. our understanding. But for our vectorizer which
1. LIAR: A Benchmark Dataset for Fake News counts the number of words and not the context,
Detection William Yang Wang, ―Liar, Liar it does not add value, so we remove all special
Pants on Fire‖: A New Benchmark Dataset for characters. eg: How are you?-
Fake News Detection, to appear in Proceedings >How are you
of the 55th Annual Meeting of the Association B. c) Remove stopwords Stopwords are common
for Computational Linguistics (ACL 2017), words that will likely appear in any text. They
short paper, Vancouver, BC, Canada, July 30- don‘t tell us much about our data so we remove
August 4, ACL. them. eg: silver or lead is fine for me-> silver,
2. Below are the columns used to create 3 lead, fine
datasets that C. d) Stemming Stemming helps reduce a word to
3. have been in used in this project- its stem form. It often makes sense to treat
4. Column1: Statement (News headline or text). related words in the same way. It removes
5. Column2: Label (Label class contains: True, suffices, like ―ing‖, ―ly‖, ―s‖, etc. by a simple
6. False) rule-based approach. It reduces the corpus of
7. The dataset used for this project were in csv words but often the actual words get neglected.
format eg: Entitling, Entitled -> Entitle. Note: Some
8. named train.csv, test.csv and valid.csv. search engines treat words with the same stem
9. 2. REAL_OR_FAKE.CSV we used this dataset for as synonyms
10. passive aggressive classifier. It contains 3 D. B. Feature Generation We can use text data to
columns viz generate a number of features like word count,
11. 1- Text/keyword, 2-Statement, 3-Label frequency of large words, frequency of unique
(Fake/True) words, n-grams etc. By creating a representation
of words that capture their meanings, semantic
4.2 DEFINITIONS AND DETAIL relationships, and numerous types of context
have a semi-structured format which is a comparably they are used in, we can enable computer to
better structured than unstructured format. Cleaning up understand text and perform Clustering,
the text data is necessary to highlight attributes that we‘re Classification etc [19]
going to want our machine learning system to pick up on. E. Vectorizing Data: Vectorizing is the process of
Cleaning (or pre processing) the data typically consists of encoding text as integers i.e. numeric form to
a number of steps: Social media data is highly create feature vectors so that machine learning
unstructured – majority of them are informal
algorithms can understand our data1.
communication with typos, slangs and bad-grammar etc.
Vectorizing Data: Bag-Of-Words Bag of Words
[17]. Quest for increased performance and reliability has
made it imperative to develop techniques for utilization (BoW) or CountVectorizer describes the
of resources to make informed decisions [18]. To achieve presence of words within the text data. It gives a
better insights, it is necessary to clean the data before it result of 1 if present in the sentence and 0 if not
can be used for predictive modelling. For this purpose, present. It, therefore, creates a bag of words
with a document matrix count in each text
document.
F. 2. Vectorizing Data: N-Grams N-grams are Note: Vectorizers outputs sparse matrices.
simply all combinations of adjacent words or Sparse Matrix is a matrix in which most entries
letters of length n that we can find in our source are 0 [21].
text. Ngrams with n=1 are called unigrams.
Similarly, bigrams (n=2), trigrams (n=3) and so
on can also be used. Unigrams usually don‘t B. Algorithms used for Classification This section
contain much information as compared to deals with training the classifier. Different
bigrams and trigrams. The basic principle behind classifiers were investigated to predict the class of
n-grams is that they capture the letter or word is the text. We explored specifically four different
likely to follow the given word. The longer the n- machine learning algorithms – Multinomial Naïve
gram (higher n), the more context you have to Bayes Passive Aggressive Classifier and Logistic
work with [20]. 1. Vectorizing Data: Bag-Of- regression. The implementations of these
Words Bag of Words (BoW) or CountVectorizer classifiers were done using Python library Sci-Kit
describes the presence of words within the text Learn.
data. It gives a result of 1 if present in the C. Brief introduction to the algorithms 1. Naïve
sentence and 0 if not present. It, therefore, creates Bayes Classifier: This classification technique is
a bag of words with a document matrix count in based on Bayes theorem, which assumes that the
each text document presence of a particular feature in a class is
G. 2. Vectorizing Data: N-Grams N-grams are independent of the presence of any other feature.
simply all combinations of adjacent words or It provides way for calculating the posterior
letters of length n that we can find in our source probability.
text. Ngrams with n=1 are called unigrams. D. P(c|x)= posterior probability of class given
Similarly, bigrams (n=2), trigrams (n=3) and so predictor P(c)= prior probability of class P(x|c)=
on can also be used. Unigrams usually don‘t likelihood (probability of predictor given class)
contain much information as compared to P(x) = prior probability of predictor
bigrams and trigrams. The basic principle behind E. 2. Random Forest: Random Forest is a trademark
n-grams is that they capture the letter or word is term for an ensemble of decision trees. In Random
likely to follow the given word. The longer the n- Forest, we‘ve collection of decision trees (so
gram (higher n), the more context you have to known as ―Forest‖). To classify a new object
work with [20]. based on attributes, each tree gives a classification
H. 3. Vectorizing Data: TF-IDF It computes and we say the tree ―votes‖ for that class. The
―relative frequency‖ that a word appears in a forest chooses the classification having the most
document compared to its frequency across all votes (over all the trees in the forest). The random
documents TF-IDF weight represents the relative forest is a classification algorithm consisting of
importance of a term in the document and entire many decisions trees. It uses bagging and feature
corpus [17]. TF stands for Term Frequency: It randomness when building each individual tree to
calculates how frequently a term appears in a try to create an uncorrelated forest of trees whose
document. Since, every document size varies, a prediction by committee is more accurate than that
term may appear more in a long sized document of any individual tree. Random forest, like its
that a short one. Thus, the length of the document name implies, consists of a large number of
often divides Term frequency. individual decision trees that operate as an
Note: Used for search engine scoring, text ensemble. Eachindividual tree in the random
summarization, document clustering. forest spits out a class prediction and the class with
IDF stands for Inverse Document Frequency: A the most votes becomes our model‘s prediction.
word is not of much use if it is present in all the The reason that the random forest model works so
documents. Certain terms like ―a‖, ―an‖, the‖, well is:
―on‖, ―of‖ etc. appear many times in a F. A large number of relatively uncorrelated models
document but are of little importance. IDF (trees) operating as a committee will outperform
weighs down the importance of these terms and any of the individual constituent models. So how
increase the importance of rare ones. The more does random forest ensure that the behaviour of
the value of IDF, the more unique is the word each individual tree is not too correlated with the
TF-IDF is applied on the body text, so the behaviour of any of the other trees in the model?
relative count of each word in the sentences is It uses the following two methods:
stored in the document matrix
G. 2.1 Bagging (Bootstrap Aggregation) — The extracted features are fed into different
Decisions trees are very sensitive to the data they classifiers. We have used Naive-bayes, Logistic
are trained on — small changes to the training set Regression, and Random forest classifiers from
can result in significantly different tree structures. sklearn. Each of the extracted features was used
Random forest takes advantage of this by allowing in all of the classifiers. Step 3: Once fitting the
each individual tree to randomly sample from the model, we compared the f1 score and checked the
dataset with replacement, resulting in different confusion matrix. Step 4: After fitting all the
trees. This process is known as bagging or classifiers, 2 best performing models were
bootstrapping. selected as candidate models for fake news
H. 2.2 Feature Randomness — In a normal decision classification. Step 5: We have performed
tree, when it is time to split a node, we consider parameter tuning by implementing
every possible feature and pick the one that GridSearchCV methods on these candidate
produces the most separation between the models and chosen best performing paramters for
observations in the left node vs. those in the right these classifier. Step 6: Finally selected model
node. In contrast, each tree in a random forest can was used for fake news detection with the
pick only from a random subset of features. This probability of truth. Step 7: Our finally selected
forces even more variation amongst the trees in and best
the model and ultimately results in lower C. It takes a news article as input from user then
correlation across trees and more diversification model is used for final classification output that
[22]. is shown to user along with probability of truth.
I. 3. Logistic Regression: It is a classification not a D. problem can be broken down into 3 statements
regression algorithm. It is used to estimate discrete 1) Use NLP to check the authenticity of a news
values (Binary values like 0/1, yes/no, true/false) article. 2) If the user has a query about the
based on given set of independent variable(s). In authenticity of a search query then we he/she can
simple words, it predicts the probability of directly search on our platform and using our
occurrence of an event by fitting data to a logit custom algorithm we output a confidence score.
function. Hence, it is also known as logit 3) Check the authenticity of a news source. These
regression. Since, it predicts the probability, its sections have been produced as search fields to
output values lies between 0 and 1 (as expected). take inputs in 3 different forms in our
Mathematically, the log odds of the outcome are implementation of the problem statement
J. . Passive Aggressive Classifier: The Passive E. 4.4 EVALUATION MATRICES
Aggressive Algorithm is an online algorithm; F. Evaluate the performance of algorithms for fake
ideal for classifying massive streams of data (e.g. news detection problem; various evaluation
twitter). It is easy to implement and very fast. It metrics have
works by taking an example, learning from it and G. been used. In this subsection, we review the most
then throwing it away [24]. Such an algorithm widely used metrics for fake news detection.
remains passive for a correct classification Most existing approaches consider the fake news
outcome, and turns aggressive in the event of a problem as a classification problem that predicts
miscalculation, updating and adjusting. Unlike whether a news article is fake or not: True
most other algorithms, it does not converge. Its Positive (TP): when predicted fake news pieces
purpose is to make updates that correct the loss, are actually classified as fake news; True
causing very little change in the norm of the Negative (TN): when predicted true news pieces
weight vector [25]. are actually classified as true news; False
K. 4.3 IMPLEMENTATION STEPS Negative (FN): when predicted true news pieces
A. Static Search Implementation In static part, we are actually classified as fake news; False
have trained and used 3 out of 4 algorithms for Positive (FP): when predicted fake news pieces
classification. They are Naïve Bayes, Random are actually classified as true news
Forest and Logistic Regression.
B. Step 1: In first step, we have extracted features Confusion Matrix: A confusion matrix is a table
from the already pre-processed dataset. These that is often used to describe the performance of
features are; Bag-of-words, Tf-Idf Features and a classification model (or ―classifier‖) on a set of
N-grams. Step 2: Here, we have built all the test data for which the true values are known. It
classifiers for predicting the fake news detection. allows the visualization of the performance of
an algorithm. A confusion matrix is a summary T Implementation was done using the above algorithms
with Vector features- Count Vectors and Tf-Idf vectors at
Word level and Ngram-level. Accuracy was noted for all
models. We used K-fold cross validation technique to
improve the effectiveness of the models.
This cross-validation technique was used for splitting the
dataset randomly into k-folds. (k-1) folds were used for
building the model while kth fold was used to check the
effectiveness of the model. This was repeated until each
of the k-folds served as the test set. I used 3-fold cross
validation for this experiment where 67% of the data is
used for training the model and remaining 33% for
testing.
B. Confusion Matrices for Static System After applying
various extracted features (Bag-of words, Tf-Idf. N-
grams) on three different classifiers (Naïve bayes,
Logistic Regression and Random Forest), their confusion
matrix showing actual set and predicted sets are
mentioned below:
V. RESULTS
As evident above our best model came out to be Logistic [cs.SI], 3 Sep 2017 [3] M. Granik and V. Mesyura, "Fake
Regression with an accuracy of 65%. Hence we then used news detection using naive Bayes classifier," 2017 IEEE
grid search parameter optimization to increase the First Ukraine Conference on Electrical and Computer
performance of logistic regression which then gave us the Engineering (UKRCON), Kiev, 2017, pp. 900-903. [4]
accuracy of 80%. Hence we can say that if a user feed a Fake news websites. (n.d.) Wikipedia. [Online].
particular news article or its headline in our model, there Available:
are 80% chances that it will be classified to its true nature. https://en.wikipedia.org/wiki/Fake_news_website.
Accessed Feb. 6, 2017 [5] Cade Metz. (2016, Dec. 16).
3. CONCLUSIONS The bittersweet sweepstakes to build an AI that destroys
In the 21st century, the majority of the tasks are done fake news. [6] Conroy, N., Rubin, V. and Chen, Y.
online. Newspapers that were earlier preferred as hard (2015). ―Automatic deception detection: Methods for
copies are now being substituted by applications like finding fake news‖ at Proceedings of the Association for
Facebook, Twitter, and news articles to be read online. Information Science and Technology, 52(1), pp.1-4. [7]
Whatsapp‘s forwards are also a major source. The Markines, B., Cattuto, C., & Menczer, F. (2009, April).
growing problem of fake news only makes things more ―Social spam detection‖. In Proceedings of the 5th
complicated and tries to change or hamper the opinion International Workshop on Adversarial Information
and attitude of people towards use of digital technology. Retrieval on the Web (pp. 41-48) [8] Rada Mihalcea ,
When a person is deceived by the real news two possible Carlo Strapparava, The lie detector: explorations in the
things happen- People start believing that their automatic recognition of[8] Rada Mihalcea , Carlo
perceptions about a particular topic are true as assumed. Strapparava, The lie detector: explorations in the
Thus, in order to curb the phenomenon, we have automatic recognition of deceptive language,
developed our Fake news Detection system that takes Proceedings of the ACL-IJCNLP [9] Kushal Agarwalla,
input from the user and classify it to be true or fake. To Shubham Nandan, Varun Anil Nair, D. Deva Hema,
implement this, various NLP and Machine Learning ―Fake News Detection using Machine Learning and
Techniques have to be used. The model is trained using Natural Language Processing,‖ International Journal of
an appropriate dataset and performance evaluation is also Recent Technology andEngineering (IJRTE) ISSN:
done using various performance measures. The best 2277-3878, Volume-7, Issue-6, March 2019 [10] H.
model, i.e. the model with highest accuracy is used to Gupta, M. S. Jamal, S. Madisetty and M. S. Desarkar, "A
classify the news headlines or articles. As evident above framework for real-time spam detection in Twitter," 2018
for static 10th International Conference on Communication
search, our best model came out to be Logistic Regression Systems & Networks (COMSNETS), Bengaluru, 2018,
with an accuracy of 65%. Hence we then used grid search pp. 380-383 [11] M. L. Della Vedova, E. Tacchini, S.
parameter optimization to increase the performance of Moret, G. Ballarin, M. DiPierro and L. de Alfaro,
logistic regression which then gave us the accuracy of "Automatic Online Fake News Detection Combining
75%. Hence we can say that if a user feed a particular Content and Social Signals," 2018 22nd Conference of
news article or its headline in our model, there are 75% Open Innovations Association (FRUCT), Jyvaskyla,
chances that it will be classified to its true nature. The user 2018, pp. 272-279. [12] C. Buntain and J. Golbeck,
can check the news article or keywords online; he can also "Automatically Identifying Fake News in Popular Twitter
check the authenticity of the website. The accuracy for Threads," 2017 IEEE International Conference on Smart
dynamic system is 93% and it increases with every Cloud (SmartCloud), New York, NY, 2017, pp. 208-215.
iteration. We intent to build our own dataset which will [13] S. B. Parikh and P. K. Atrey, "Media-Rich Fake
be kept up to date according to the latest news. All the News Detection: A Survey," 2018 IEEE Conference on
live news and latest data will be kept in a database using Multimedia Information Processing and Retrieval
Web Crawler and online database. VII. (MIPR), Miami, FL, 2018, pp. 436-441 [14] Scikit-
Learn- Machine Learning In Python [15] Dataset- Fake
News detection William Yang Wang. " liar, liar pants on
_re": A new benchmark dataset for fake news detection.
arXiv preprint arXiv:1705.00648, 2017[16] Shankar M.
REFERENCES Patil, Dr. Praveen Kumar, ―Data mining model for
[1] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and effective data analysis of higher education students using
Huan Liu, ―Fake News Detection on Social Media: A MapReduce‖ IJERMT, April 2017 (Volume-6, Issue-4).
Data Mining Perspective‖ arXiv:1708.01967v3 [cs.SI], 3 [17] Aayush Ranjan, ― Fake News Detection Using
Sep 2017 [2] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Machine Learning‖, Department Of Computer Science &
Tang, and Huan Liu, ―Fake News Detection on Social Engineering Delhi Technological University, July 2018.
Media: A Data Mining Perspective‖ arXiv:1708.01967v3