Machine Learning-Based Approach For Fake News Detection
Machine Learning-Based Approach For Fake News Detection
News Detection
1
Department of Information Technology, Manipal Institute of Technology
Bengaluru, Manipal Academy Of Higher Education, Manipal, India
2
Department of Information Science and Engineering, Vidyavardhaka College of
Engineering, Mysuru, India
3
Department of Artificial Intelligence and Machine Learning, Alva’s Institute of
Engineering and Technology, Mangalore, India
4
IDSIA USI-SUPSI, Department of Innovative Technologies, University of Applied
Sciences and Arts of Southern Switzerland, CH
5
Department of Computer Science and Engineering, Vidyavardhaka College of
Engineering, Mysuru, India
E-mail: [email protected], [email protected]
∗
Corresponding Author
Abstract
In the modern era where the internet is found everywhere and there is rapid
adoption of social media which has led to the spread of information that was
never seen within human history before. This is due to the usage of social
media platforms where consumers are creating and sharing more information
where most of them are misleading with no relevance with reality. Classifying
the text article automatically as misinformation is a bit challenging task.
This development addresses how automated classification of text articles can
be done. We use a machine learning approach for the classification of news
articles. Our study involves exploring different textual properties that may
be often used to distinguish fake contents from real ones. By using those
properties, can train the model with different machine learning algorithms
and evaluate their performances. The classifier with the best performance is
used to build the classification model which predicts the reliability of the
news articles present in the dataset.
1 Introduction
Social media and the internet have the access to information data a whole
lot simpler and comfortable. As most of our lives are spent interacting online
via social media systems, we generally look out for or devour news from
web-based platforms rather than conventional news associations because it is
easy to share and talk about the information with buddies or different readers
in social media. Instead of the advantages supplied through social media,
the standard and high quality of stories are much less than the traditional
news organizations. News outlets have been benefited through the enormous
use of different social media platforms by giving up-to-date information in
close to the actual time to its subscribers. The news articles with intentionally
deliberately fake facts are produced online for spreading the news with the
purpose of financial or political gain. The more spread of fake news informa-
tion might negatively impact society and individuals. False information may
do authenticity equilibrium break of ecosystem related to the news. False
news is typically manipulated and its modifications the manner how human
beings interpret and see real news. False news is used by spammers mainly
to make revenues by using advertisements through click-baits.
Fake news is one of the greatest threats to commerce, journalism, and
democracy all over the world, with huge collateral damages. A US $130 bil-
lion loss in the stock market was the direct result of a fake news report that US
President Barak Obama got injured in an explosion [1]. Another case of fake
news campaigns that demonstrate the enormous impact that fake news can
have include the sudden shortage of salt in Chinese supermarkets after a fake
report that iodized salt would help counteract the effects of radiation after the
Fukushima nuclear leak in Japan [2], and an escalation of tensions between
India and Pakistan that began with fake reporting of the Bagalkot strike and
resulted in the deaths of military personnel and the loss of expensive military
equipment.
Social media has a big impact on society, so few people take advantage
or benefit from this reality. This will result in generating data articles, which
Machine Learning-Based Approach for Fake News Detection 511
aren’t real or possibly fake. Some websites produce fake news deliberately
post half of-truths, hoaxes, and disinformation which assert to be real infor-
mation. They use a social community to power website visitors of the net. The
essential intention of fake records internet sites is to affect ordinary public
opinion on topics.
2 Literature Survey
Mykhailo Granik et al. in their paper proposed a totally simple technique
for phony news identification by the utilization of credulous Bayes classifier.
This kind of got completed as a product program and got inspected towards
an agreement set of Facebook news posts. These have been amassed from the
three huge pages of Facebook all of them have news from the right and the
left, and there are three gigantic, huge standard political news pages. They
have accomplished the exactness of arrangement about generally 80% of
Classification precision for news which is phony is somewhat more terrible.
This may cause the skewness of the dataset to just 4.7% is likewise a piece of
phony information.
Himank Gupta has given a system which is upheld varying sorts of AI
idea which has different issues, which has the lack of exactness, interference,
time-stretch which handles numerous tweets in scarcely any sec. To start with,
they gather 40,000 tweets from the HSpam14 of a dataset. Then, at that point,
they will go on and describe the 150,000 tweets which are spam and 250,000
tweets that are non-spam. They will likewise determine the assortment of
lightweight highlights along with Top-30 words which are giving the most
perfect information acquired from the Bag of Words Model. 4. Accomplish-
ment has had the option to get with the precision of around 92% and defeated
this arrangement by practically 18%.
Marco l. dellavedova et. al.is the principal who proposed a hypothesis of
ML counterfeit news location idea which should be possible by the blend of a
news channel and the social substance highlights, outperforms and strategies
which are existing in the writing, exactness has been expanded up to 79.8%.
Besides, their strategy has been executed in the Facebook Messenger Chatbot
which they checked its anything but true applications, by which counter-
feit news recognition exactness was acquired around 82%. Just objective
was to separate the reason as solid or phony, they chief portrayed dataset
which they would use for his/her test, then, at that point, the content-based
methodology has been introduced that they would execute and the system
they proposed to follow it’s anything but a socially based term accessible in
512 H. Lakshmi et al.
writing. The outcome dataset comprises 15,600 posts, which came from 33
pages with more than 220000 preferences by 800,000 above clients. 8,922
fabrications posts and 6,578 non-scams posts.
Cody buntain et al. has fostered a strategy for mechanization of a
phony news distinguishing on applications like Twitter and by understanding
the method of expectation of the exactness of evaluations in exact way 2
believability news coverage-based appraisals of the correctness. This kind of
strategy is applied to the Twitter content which is sourced from Buzz Feeds
counterfeit news datasets. The element investigation recognizes every one
of the highlights which will be for the most part anticipated by structure
publicly supports and editorial exactness tests, and consequences of these are
following past work. They rely upon recognizing the profoundly re-tweeted
strings of discussions and utilizations the highlights of these sensible strings
to separate the tales and constraints of these work appropriateness just to
the well-known arrangement of tweets. As a larger part of tweets that are
infrequently re-tweeted, this technique consequently can be just utilized on a
minor measure of the Twitter discussion strings.
Shivam B. Parikh et al. likes to introduce an outline of the portrayal of
the article in the conditions of the late populace along with the differentiable
substance styles of the article and its peruser’s effect. Further, we move into
existing phony news discovery and intensely upheld approaches that are
text-based investigation and tell the famous ones of phony news datasets.
We finish up the paper by recognizing five keys open-research difficulties
which will assist us with knowing the future exploration. It is likewise the
hypothetical way approach that gives the outlines about the phony news
recognition and by breaking down the variables like mental ones.
One of the earliest works at the automated detection of fake news become
via Vlachos and Riedel. The authors described the project of fact-checking,
gathered a dataset from two popular reality-checking websites, and took into
consideration k-Nearest Neighbors classifiers for managing fact-checking as
a class undertaking. Wang (2017) released the LIAR dataset, which contains
12.8K manually labeled brief statements from PolitiFact.
Table 1 shows the various techniques and datasets that have been used by
various researchers for fake news detection.
3 Methodology
This framework involves fundamental theoretical information about every
single component and every aspect of the project. It gives a picture of
Machine Learning-Based Approach for Fake News Detection 513
Table 1 Techniques and datasets that have been used for fake news detection
Work Year Detection Model
Wang 2016 Fake news 6 levels majority LR, SVM,
bi-LSTM, CNN
Ma et al. 2016 Rumors 2 levels RF, DT SVM, RNN
Ruchansky 2017 LSTM
Popat 2018 Credibility 2 or 5 levels bi-LSTM, LSTM, CNN
Buntain and Gollbeck 2017 Credibility RF
Yang et al. 2018 Fake news 2 classes TI-CNN,LSTM, RNN
Karimi et al. 2019 Fake news 2 classes N-grams, LIWC, RST,
BiGRNN-CNN LSTM,
HDSF
Ahmed et al. 2018 Fake news & reviews SVM, SDG, LR, KNN, DT
Zhou et al. 2020 Fake news Clickbaits SVM, RF XGB, LR, NB
Disinform
Pamungkas et al. 2016 Stance LR
special classification and which we say the tree “votes” for that specific class.
Random forest consistently picks the characterization which has more votes.
Random forest may be a classification algorithm that consists of many sorts
of decisions trees. Random forest will use different characteristic randomness
and bagging when the building of every single tree and uncorrelated forest of
trees will be created whose forecast is made by the committee is most accu-
rately perfect than the other single tree. Random forest like the name suggests
comprises a more prominent number of single and individual choice trees that
will be worked as a group. Every individual and single tree within the forest
spits out into a prediction of class which of sophistication with a high no of
votes and that will become our prediction model. The explanation of each one
of the models will work so well as an outsized number of models which are
relative and which are uncorrelated will be operating as a committee that can
outperform anybody of the individual and single constituent models. To know
how random forest makes sure that the behavior of every individual tree is not
an excessive amount of correlated with the behavior of any other individual
trees within the model. It uses the two subsequent methods:
3.2.1.1 Bagging
Decisions trees are more delicate to the information and that are trained
and little modifications within the set of training may end up in various
unlike the structure of trees. Random forest will exploit this by permitting
every individual tree to a dataset with a replacement which will randomly
sample that leads to various types of trees. This kind of process is called
bootstrapping or bagging.
Machine Learning-Based Approach for Fake News Detection 515
As the aim of our model is to simply classify the news article as true/false,
logistic regression is a good choice.
1. We first create an object of Logistic Regression.
2. The logistic regression classifier is trained to bypass the news article
training set into the fit function. After it is trained, predictions are made
on the test set using the predict function.
3. Accuracy is calculated to understand the performance of the classifier.
1: Procedure STOCHASTICGRADIENTDESCENT (D, Labels, Iter)
Input: Dataset D, Labels of Dataset, Iteration num
Output: Optimal weight of logistic regression
2: w ← [1, 1, . . . , 1]
3: Initialize qi with zero for all i
4: for k = 1 → Iter do
5: choose Data = D
6: for i = 1 → m do
7: γ ← Learning Rate
8: λ ← Regularization Lambda
9: u = u + γλ
10: Select a index of choose Data idx randomly
11: x ← Choose Data [idx]
12: del choose Data [idx]
13: for i e featuresinsamplex do
14: wi = wi − γδloss(w,xi)
δw
15: wh ← w
16: if wi > 0 then
17: wi ← max (0, wi − (u + qi))
18: else if wi < 0 then
19: wi ← min (0, wi − (u + qi))
20: end if
21: qi ← qi + (wi − wh)
22: end for
23: end for
24: end for
where learning and gaining from it would be easier and afterward sput it
aside. An algorithm of this kind will remain passive for the correctly predicted
data of outcomes and will turn more aggressive for the false data of outcomes
and will make the update and also will make required adjustments. Unlike
other algorithms, it will not converge. These is called Passive-Aggressive
algorithms because of the following reasons:
• PASSIVE: If it is the correct prediction, then it will keep the model the
same and will not make any changes.
• AGGRESSIVE: If it is an incorrect prediction, then it will make
changes to the model.
INPUT: Aggressiveness parameter C > 0
INITIALIZE: w1 = (0,. . . ,0)
For t = 1,2, . . .
• Receive instance: xt e Rn
• Predict: yt = sign (wt, xt)
• Receive correct label: yt e {−1, +1}
• Suffer loss: lt = max {0, 1 − yt (wt,xt)}
• Update
SVM Pseudocode
F [0. . . N−1]: A feature set in N features that is sorted by information gain
in decreasing order accuracy (i): accuracy of the prediction model based on
SVM with F [0. . . i] gone set
518 H. Lakshmi et al.
Figure 2 Flowchart.
Low = 0
High = N−1
Value = accuracy (N−1)
IG_RFE_SVM (F [0. . . N-1], value, low, high) {
if (high<=low) then
return F [0. . . N-1] and value
Mid = (low + high) / 2
Value_2 = accuracy (mid)
if (value_2 >=value) then
return IG_RFE_SVM (F[0. . . mid], value_2, low, mid)
else(value_2 < value)
return IG_REF_SVM (F[0. . . high], value, mid, high)
end if
end if
3.6 Flowchart
The flowchart given in Figure 2 starts with the collection of datasets.
The dataset is preprocessed and then it is subjected to features election.
Four different Machine Learning algorithms are used to train the model.
Confusion matrix which is used here to calculate performance and accuracy
Results from all classifiers are compared and the classifier which gives the
best accuracy for the given dataset is used to build the classification model to
predict the reliability of the news.
4 Implementation
The implementation part is almost the same as the framework and system
design part which describes the system, which is performed at the finest level
Machine Learning-Based Approach for Fake News Detection 519
of details, down to the code level. This topic is regarding the realization of
the topics and earlier developed ideas.
within each and every document. TF addresses the Term Frequency and
also computes the frequency of a term appearing within a document.
IDF stands for Inverse Document Frequency.
To store each word of relative count in the document matrix TF-IDF is
applied to the body.
Number of times t occurs in document ‘d’
TF (t, d) =
Total word count of document ‘d’
Total number of documents
IDF (t, d) =
Number of documents with term t in it
TFIDF (t, d) = T F (t, d) ∗ IDF (t)
Figure 8 Graph obtained from all the algorithms based on accuracy versus models.
524 H. Lakshmi et al.
of predictions are concluded with value count and further, each class will
be broken. A confusion matrix is a summary of prediction results on a
classification model. It describes how the model of classification will be
confused when it makes predictions. It’s anything but an understanding not
just into the mistakes that are made by the classifier yet more critically the
sorts of blunders that are being made.
Machine Learning-Based Approach for Fake News Detection 525
Figure 5 shows the Confusion Matrix for predicted label and True label
obtained from Random Forest Algorithm. The true label contains 2 labels are
True and Fake.
Figure 6 shows the Confusion Matrix for predicted label and True label
obtained from Logistic Regression Algorithm.
6 Conclusion
Fake news detection is a research area that has a lot of scopes and also has a
large dataset. Our model is run against the existing dataset. From Table 2, we
conclude Passive-aggressive algorithms show a Maximum Accuracy of up to
94%. So using this classifier we built our classification model for fake news
detectors. The user can enter the text or keyword on the web page and check
the reliability of the news.
In our future work, we are looking forward to building a dataset on our
own which will be up to date and will have all the news accordingly. All the
latest data and live news will be updated in the database and the subsequent
stage is to train the model and break down how the exactness change with the
new information to add further develop it.
References
[1] Iftikhar Ahmad et al. “Fake news detection using machine learning
ensemble methods”. In: Complexity 2020 (2020).
[2] Monther Aldwairi and Ali Alwahedi. “Detecting fake news in
social media networks”. In: Procedia Computer Science 141 (2018),
pp. 215–222.
[3] Cody Buntain and Jennifer Golbeck. “Automatically identifying fake
news in popular twitter threads”. In: 2017 IEEE International Confer-
ence on Smart Cloud (SmartCloud). IEEE. 2017, pp. 208–215.
[4] Nadia K Conroy, Victoria L Rubin, and Yimin Chen. “Automatic decep-
tion detection: Methods for finding fake news”. In: Proceedings of the
association for information science and technology 52.1 (2015), pp. 1–4.
Machine Learning-Based Approach for Fake News Detection 527
[5] Marco L Della Vedova et al. “Automatic online fake news detection
combining content and social signals”. In: 2018 22nd Conference of
Open Innovations Association (FRUCT). IEEE. 2018, pp. 272–279.
[6] Mykhailo Granik and Volodymyr Mesyura. “Fake news detection using
naive Bayes classifier”. In: 2017 IEEE first Ukraine conference on elec-
trical and computer engineering (UKRCON). IEEE. 2017, pp. 900–903.
[7] A Santhosh Kumar et al. “Fake News Detection on Social Media
Using Machine Learning”. In: Journal of Physics: Conference Series.
Vol. 1916. 1. IOP Publishing. 2021, p. 012235.
[8] Benjamin Markines, Ciro Cattuto, and Filippo Menczer. “Social spam
detection”. In: Proceedings of the 5th international workshop on adver-
sarial information retrieval on the web. 2009, pp. 41–48.
[9] Cade Metz. “The bittersweet sweepstakes to build an AI that destroys
fake news”. In: Wired.com (2016).
[10] Rada Mihalcea and Carlo Strapparava. “The lie detector: Explorations
in the automatic recognition of deceptive language”. In: Proceedings of
the ACL-IJCNLP 2009 conference short papers. 2009, pp. 309–312.
[11] Shivam B Parikh and Pradeep K Atrey. “Media-rich fake news detec-
tion: A survey”. In: 2018 IEEE conference on multimedia information
processing and retrieval (MIPR). IEEE. 2018, pp. 436–441.
[12] Kai Shu et al. “Fake news detection on social media: A data mining
perspective”. In: ACM SIGKDD explorations newsletter 19.1 (2017),
pp. 22–36.
[13] Kelly Stahl. “Fake news detection in social media”. In: California State
University Stanislaus 6 (2018), pp. 4–15.
[14] William Yang Wang. “liar,liar pants on fire: A new benchmark dataset
for fake news detection”. In: arXivpreprint arXiv:1705.00648 (2017).
[15] Jiawei Zhang, Bowen Dong, and S Yu Philip. “Fake detector: Effec-
tive fake news detection with deep diffusive neural network”. In: 2020
IEEE 36th International Conference on Data Engineering (ICDE). IEEE.
2020, pp. 1826–1829.
528 H. Lakshmi et al.
Biographies