0% found this document useful (0 votes)
76 views5 pages

Machine Learning Techniques For The Classification of Fake News

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views5 pages

Machine Learning Techniques For The Classification of Fake News

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine Learning Techniques for the Classification

of Fake News
2021 International Conference on Computational Intelligence and Computing Applications (ICCICA) | 978-1-6654-2040-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCICA52458.2021.9697267

Swatej Patil Suyog Vairagade Dipti Theng


Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering, G H Raisoni College of Engineering, G H Raisoni College of Engineering, G H Raisoni College of
Engineering, Nagpur, India. Engineering, Nagpur, India. Engineering, Nagpur, India.
[email protected] [email protected] [email protected]

Abstract—Social Networking sites like Twitter, Instagram, and information (text) using machine learning techniques that will
Facebook have become an essential part of our daily lives, but differentiate between the “fake” and the “real” news.
social media comes with its own advantages and disadvantages.
Many of the time, these social networking platforms are used to The manuscript is organized as the first section has
distribute fake news or incorrect information, and there is a discussed the background of fake news detection and its need.
growing demand for classification and categorization of this In the second section, a brief review is presented on popular
type of content. As a result, we have explored a novel technique researches on the fake news detection problem briefing about
for classifying fake news that incorporates machine learning their findings, and future opportunities for further research.
methods. This paper describes the development of a method that The next section has briefed about the dataset source and its
provides the TF-IDF Vectorizer to classify which news is properties. Implementations of the popular machine learning
legitimate and which is fraudulent. Implementation is
performed using datasets from Kaggle. The results indicate that
techniques are presented in the fourth section. In the end, a
this method performs effectively. summary of the work with further opportunities for research
in fake news detection is given in the conclusion section.
Keyword—Fake News, Machine Learning, Term
Frequency, Inverse Document Frequency, Vectorizer. II. LITERATURE SURVEY

This section provides an overview of existing research on


dealing with false news detecting systems and discusses the
I. INTRODUCTION findings. The authors in [2] developed a model called Article
The increasing popularity and use of social media Abstraction by using the method BiMPM. They have created
networks has tremendously impacted the way news is their own datasets of Korean news. The model gives the
created, transmitted, and perceived in our society. The output by checking and comparing the news articles with the
growing usage of online information has created a breeding fact DB. However, this model has a drawback that as the
ground for false news intended to deceive. Their spread is length of the input sentences increases, the performance
becoming increasingly serious, drawing the attention of the decreases and the model finds it difficult to make a good
scientific community. Social networking sites like Facebook, judgment when it encounters an unlearned word.
Instagram, and Twitter have become an essential part of our The authors in [3], have shown the comparative
daily life, but social media comes with its own advantages performance of different classifiers such as Naive Bays,
and disadvantages as it is a platform where anyone can share Logistic Regression, Artificial Neural Network, Decision
their options and thoughts, and this has also become a tool for Tree, and Support Vector Machine. The author used a
manipulation and misleading people. Most of the time, these dataset from Kaggle and then compared the output of these
social networking sites are being used to spread fake news or classifiers. According to the author, the performance of the
false information, and there is an increasing demand for model utilised is greatly reliant on the size of the dataset.
classifying and categorizing this kind of news. The goal of SVM with TF-IDF vectorizer seemed to have the highest
fake news identification is to avoid rumors from spreading accuracy of all models used, while NN had the lowest
across numerous channels. accuracy with any of the vectorizers. Furthermore, according
Fraudulent information, often known as fake news, is to the author, Logistic Registration performs well in both TF-
such piece of misleading information published by a business IDF and Count Vectorizer, and it is particularly useful when
eager seeking attention and generate promotional revenue, or the dataset is large. The author of [4] uses Sentiment Analysis
to propagate slander-related breaches in order to have a to create a fake news detecting system. Their technique uses
political influence on the society [1]. The fake news has also sentiment as a crucial component to predict the model's
been linked to misleading people in the recent U. S. accuracy. They have tested their model performance on four
Presidential Elections. The number of people using social different datasets. The author examines different methods of
media is growing rapidly, thus we need some structure that detecting fake news and concludes that Tf-IDF with cosine
can accurately distinguish false information from real one. similarity performs better than Tf-IDF without cosine
The goal of the article is to present a classification of false similarity, but the performance increase was not that
significant.

978-1-6654-2040-2/21/$31.00 (c)2021 IEEE

Authorized licensed use limited to: Khon Kaen University provided by UniNet. Downloaded on September 29,2022 at 04:52:29 UTC from IEEE Xplore. Restrictions apply.
The authors in [5] used a simple approach for tackling this performance metrics, the Linear Support Vector (Linear
problem, they have implemented a Naive Bayes classifier in SVC), Logistic Regression (LR), and Passive Aggressive
their model of fake news detection. They have used a dataset (PA) algorithms each perform better while using the TF-IDF,
from Buzzfeed News and they managed to achieve 74% CV, and HV feature extraction approaches, respectively. To
accuracy. The authors found a resemblance in spam messages select the best model, the authors [19] have investigated
and fake news articles, and because the Naive Bayes classifier conventional machine learning approaches to develop a
works well on spam messages, the authors decided to use that supervised machine learning model that can categorize fake
in classifying fake news articles. In [6], the authors have news as positive or negative, using tools such as Natural
shown the performance of twelve classifiers considering their Language Processing (NLP) and Python Scikit Learn for text
false prediction ratio on three different datasets. Based on the analysis.

TABLE I. SUMMARY OF THE REVIEW WORK

Algorithms Research gap and future


Ref. No Dataset Used Results and Findings
Implemented scope
Authors have only targeted
Created their dataset by political news. Therefore,
The proposed model has
N-gram, TF-IDF collecting news articles from there is scope to check the
[7] shown an accuracy of 92%
with (LSVM) varying sources such as model over other areas
on the dataset created.
Reuters, Kaggle where fake news detection
is a major challenge.
George McIntire created the
LIAR Dataset by combining
articles from The Washington
Post, Reuters, Vox, Gigaword
News, NPR, The Guardian, 1) The authors used three
New York Post, National different datasets
CONV-HAN, Review, Buzzfeed News, 2) the authors have The LSTM model's
BI-LSTM, , Talking Points Memo, Fox implemented advanced accuracy is related to the
[9]
CNN, C-LSTM, News, The Atlantic, Business deep learning models. length and information
Naive Bayes Insider, CNN, Breitbart, Achieved maximum provided in the text.
Activist Report and Natural accuracy of 95% for the BI-
News from genuine New LSTM and C-LSTM model
York Times, DC Gazette,
American News, Clickhole,
Borowitz Report, and The
Onion
Proposed TCNN-URG
Public Weibo (A Chinese incorporates discriminative
social network) dataset, but and generative modeling, The proposed model is not
Weibo only has short articles, while the other model applicable in real-time
TCNN-URG therefore the authors collected ignores the user responses, because is it completely
[11] early fake news news articles from genuine the proposed model uses dependent on the user
detection model sources such as The them to accurately predict responses which are
Guardian, New York Times, fake news. 79% accuracy unavailable for the prior
and some fake news sources on Weibo Dataset and detection of fake news.
such as the NaturalNews. 77.47% accuracy on the
self-collected dataset
1) The Gibbs sampling
approach is capable of
accessing news authenticity
and the users’ credibility
simultaneously. 2) the
LIAR Dataset(wang 2017) Need to explore more
Gibbs sampling Method presented is
[12] and dataset from BuzzFeed datasets and compare the
algorithm. unsupervised which does
News performance on them.
not require much time and
labor. Accuracy was 75.9%
on LIAR Dataset and
67.9% on the BuzzFeed
News dataset
Authors used user 1), The dataset contains
Contracted two datasets
engagement as a tool to very low ground truth
[13] adapted TFF which include ground truth
find whether given news is information. 2) User ability
labs from trustworthy
true or fake, and find a whether he/she believes in

Authorized licensed use limited to: Khon Kaen University provided by UniNet. Downloaded on September 29,2022 at 04:52:29 UTC from IEEE Xplore. Restrictions apply.
resources, Buzzfeed and correlation between fake fake news or not cannot be
PolitiFact. news and user profiles. accurately judged.

Passive
The author used various
aggressive Using a Web Crawler and
algorithms to accurately
classifier, an online database, the
access whether the news is
Logistic BuzzFeed News, BS detector, authors hope to create their
[14] true or fake.
regression, LIAR Dataset. own dataset that will be
Implementation results
Random forest, maintained up to date with
showed an accuracy of
Naive Bayes all live and important news.
92.73 %
classifier
Random Forest,
Naïve Bayes, Need to explore more
0.98 f1-score on fandom
[15] Multinomial, Indonesian news dataset datasets and compare the
forest
Support Vector performance on them.
Machine
The authors compare their 1) The authors used an
model's performance using absolute probability
1) OpenSources.co, 2) Gather
three different features, threshold while assessing
the data from Before its
e.g., TF-IDF using their model. For models
News, Zero Hedge, Raw
syntactical structure with poorly calibrated
Story, etc. for fake news
TF-IDF bi- frequency, bigram probability scoring, this
[16] articles and, BCC, USA
gram, PCFG frequency, and union of method is unreliable. 2)
Today, Washington Post, etc.
features to find which The authors used vectorized
for reliable news articles. It
factors are most predictive. approvals that make it
contains a total of 11051
Implementation results difficult to predict which
news articles.
showed an accuracy of features are more
77.2% important.
1) The proposed model not
only determines whether
The proposed model will
the information is true or
Article from google news, not able to predict the news
Naive Bayes, not but also suggests
[17] Feedly news360 to compare article as fake or real if it is
SVM, NLP relevant and genuine news
them with the given text. too recent and not available
articles. The proposed
in the database.
model works with 93.6%
accuracy.
Random
Forest(RF),
Support Vector
The views of Machine
Classifier, Naïve Experiment results showed
Learning and Natural
Bayes, OpenSources dataset, Kaggle that Gradient Boosting has
Language Processing can
[18] AdaBoost, dataset, dataset by George the maximum as 88% mean
be compared to a deep
KNN, Multi- McIntire accuracy and 0.91 F1-
learning strategy for
Layer Score.
detecting false news.
Perceptron &
Gradient
Boosting

In [20], authors have proposed the method for predicting dataset more quickly and easier to implement the proposed model.
fake news that mixes the headline and the content of the
article. Authors have mentioned in their observations from TABLE II. DATASET DESCRIPTION
the experimental results that such combinations can more
accurately forecast bogus news. The deployment of deep
learning approaches for false news detection has been studied Total Fake Real
Corp Cleaned #Featu
by the authors to overcome the limitations of machine articl new new
us articles res
learning techniques in fake news detection [21]. es s s

III. OVERVIEW OF THE DATASET Kagg 2080 103 103


20718 5
le 0 69 49
The model was trained using the corpora collected from
the Kaggle website. Data acquisition and preprocessing are
done before the model is fully implemented to remove Dataset as mentioned in table-2 is a compilation of 20800
punctuation and stop words [6]. This will help us to scan the total articles and 5 features. Out of 20800 total articles, 20718 are

Authorized licensed use limited to: Khon Kaen University provided by UniNet. Downloaded on September 29,2022 at 04:52:29 UTC from IEEE Xplore. Restrictions apply.
cleaned which includes 10369 are fake news and 10349 are
real news.

IV. IMPLEMENTATION & METHODOLOGY

This section will brief about the implementation of


Inverted Document Frequency (IDF) and Term Frequency
(TF) on corpus taken from the Kaggle source. Experiments
were carried out in Python programming using Google colab
environment.
A. Term Frequency (TF): TF is an abbreviation of
Term Frequency, “Term Frequency is an approach
that utilizes the counts of words appearing in the
documents to figure out the similarity between
documents” [7].

B. Term Frequency-Inverted Document Frequency


(IDF): “The Term Frequency-Inverted Document
Frequency (TF-IDF) is a weighting metric often
used in information retrieval and natural language
processing. It is a statistical metric for determining
the importance of a term in a dataset document” [7].

, , , ∙ , (1)

Where,

, 1 , (2)

(3)
:

C. Passive Aggressive Classifier: Passive can be


defined as if the information in the model or the
dataset is correct, then the classifier does not make
any changes, in other terms data in the dataset is not
enough to make any changes. Aggressive can be Fig 1. General flowchart for the process of fake news
defined as if the judgment made is inaccurate, then detection
the classifier can make changes to the model. In
passive, the information hidden in the example is not
enough for updating; in aggressive, the information V. RESULTS AND DISCUSSIONS

shows that at least this time you are wrong, a better Corpus dataset is implemented for the fake news detection
model should modify this mistake. using TF-IDF vectorizer and TF-IDF vectorizer with NLP. Results
of both feature extraction techniques are presented comparatively
D. Natural Language Processing: A field in Artificial in table-3, 4, and 5 using the corpus discussed in dataset section.
It is observed from the results presented in table-3 that both
Intelligence that is used to make an analogy between algorithms perform equally on the dataset. The accuracy of the
computer and human language and how to build an model does not vary significantly when NLP is used. TF-IDF
application that can process and identify meaningful vectorizer has attained 92.72% accuracy without any additional
information in a given set of texts [9]. algorithm. However, TF-IDF vectorizer with NLP gives 92.66%
accuracy which is not much changed as compared with TF-IDF
vectorizer.

Authorized licensed use limited to: Khon Kaen University provided by UniNet. Downloaded on September 29,2022 at 04:52:29 UTC from IEEE Xplore. Restrictions apply.
TABLE III. RESULTS OF COUNT VECTORIZER focused on exploring fake news detecting techniques. In future
IMPLEMENTATION ON KAGGLE DATASET
more broad and large datasets can be used to increase the accuracy
of the techniques studied.
Count F- Preci
Accuracy Recall
Vectorizer score sion REFERENCES
Multinomia [1] Rohit Kumar Kaliyar, “Fake News Detection Using A Deep Neural
89.78% 0.947 1.00 0.900 Network”, 2018 4th International Conference on Computing Communication
l and Automation (ICCCA), 2018, IEEE.
Passive [2] Kyeong-hwan Kim and Chang-sung Jeong, “Fake News Detection System
93.62% 0.947 1.00 0.900 using Article Abstraction”, 2019 16 International Conference on Computer
Aggressive
Science and Software Engineering(ICSSE), 2019, IEEE
[3] KarishnuPoddar, Geraldine Bessie Amali D. And K. S. Umadevi,
“Comparison of Various Machine Learning Models for Accurate Detection
TABLE IV. RESULTS OF COUNT VECTORIZER
of Fake News”2019 Innovations in Power and Advanced Computing
IMPLEMENTATION ON KAGGLE DATASET
Technologies (i-PACT), 2019, IEEE.
TF-IDF Accura F- Precisi Recal [4] BhavikaBhutani, NehaRastogi, PriyanshuSehgal and Archana Purwar, “Fake
Vectorizer cy score on l News Detection Using Sentiment Analysis”, 2019 Twelfth International
Conference on Contemporary Computing (IC3), 2019, IEEE.
Multinomia 89.99 [5] MykhailoGranik, VolodymyrMesyura, “Fake News Detection Using Naive
0.947 1.00 0.900 Bayes Classifier”, 2017 IEEE First Ukraine Conference on Electrical and
l %
Computer Engineering (UKRCON), 2017, IEEE.
Passive 95.11 [6] Kaur, Sawinder, Parteek Kumar, and PonnurangamKumaraguru.
0.947 1.00 0.900 "Automating fake news detection system using multi-level voting model."
Aggressive %
Soft Computing 24.12 (2020): 9049-9069.
[7] Ahmed, Hadeer, IssaTraore, and SherifSaad. "Detection of online fake news
TABLE V. RESULTS OF COUNT VECTORIZER using n-gram analysis and machine learning techniques." International
IMPLEMENTATION ON UNIVERSITY OF VICTORIA DATASET conference on intelligent, secure, and dependable systems in distributed and
cloud environments. Springer, Cham, 2017.
Count Accurac F- Precisi [8] Reis, Julio CS, et al. "Explainable machine learning for fake news detection."
Recall
Vectorizer y score on Proceedings of the 10th ACM conference on web science. 2019.
[9] Khan, JunaedYounus, et al. "A benchmark study on machine learning
Multinom methods for fake news detection." arXiv preprint arXiv:1905.04749 (2019).
96.27% 0.951 0.997 0.981
ial [10] Gravanis, Georgios, et al. "Behind the cues: A benchmarking study for fake
news detection." Expert Systems with Applications 128 (2019): 201-213.
Passive
[11] Qian, Feng, et al. "Neural User Response Generator: Fake News Detection
Aggressiv 97.2% 0.953 1.0 0.994 with Collective User Intelligence." IJCAI. Vol. 18. 2018.
e
[12] Yang, Shuo, et al. "Unsupervised fake news detection on social media: A
Passive aggressive approach wins over the multinomial generative approach." Proceedings of the AAAI conference on artificial
intelligence. Vol. 33. No. 01. 2019.
which is clearly analyzed from the implementation results.
[13] Shu, Kai, Suhang Wang, and Huan Liu. "Understanding user profiles on
Moreover, TF-IDF Vectorizer has shown good performance social media for fake news detection." 2018 IEEE Conference on Multimedia
as compared with the Count Vectorizer. Information Processing and Retrieval (MIPR). IEEE, 2018.
[14] Sharma, Uma, SiddarthSaran, and Shankar M. Patil. "Fake News Detection
using Machine Learning Algorithms." International Journal Of Engineering
Research & Technology (IJERT) NTASU 9.03 (2020).
[15] Al-Ash, HerleyShaori, et al. "Ensemble learning approach on Indonesian
fake news classification." 2019 3rd International Conference on Informatics
and Computational Sciences (ICICoS). IEEE, 2019.
[16] Dyson, Lauren, and Alden Golab. "Fake News Detection Exploring the
Application of NLP Methods to Machine Identification of Misleading News
Sources." CAPP 30255 Adv. Mach. Learn. Public Policy (2017).
[17] Jain, Anjali, et al. "A smart system for fake news detection using machine
learning." 2019 International Conference on Issues and Challenges in
Intelligent Computing Techniques (ICICT). Vol. 1. IEEE, 2019.
[18] Bali, Arvinder Pal Singh, et al. "Comparative performance of machine
learning algorithms for fake news detection." International conference on
Fig 2. Comparison of the results obtained for TF-IDF advances in computing and data sciences. Springer, Singapore, 2019.
vectorizer with and without NLP [19] Khanam, Z., et al. "Fake News Detection Using Machine Learning
Approaches." IOP Conference Series: Materials Science and Engineering.
Vol. 1099. No. 1. IOP Publishing, 2021.
VI. CONCLUSION AND FUTURE SCOPE [20] Nagaraja, Arun, et al. "Fake News Detection Using Machine Learning
Methods." International Conference on Data Science, E-learning and
In the first section, the manuscript presents a review Information Systems 2021. 2021.
of popular research related to false news detection. As a [21] Manzoor, Syed Ishfaq, and Jimmy Singla. "Fake news detection using
future work, researchers can explore machine learning machine learning approaches A systematic review." 2019 3rd International
techniques not only in determining for article is fake or real Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2019.
but also in many other applications. This manuscript has

Authorized licensed use limited to: Khon Kaen University provided by UniNet. Downloaded on September 29,2022 at 04:52:29 UTC from IEEE Xplore. Restrictions apply.

You might also like