0% found this document useful (0 votes)
110 views50 pages

Fake News Detection by Using Machine Learning

The document discusses fake news detection using machine learning techniques. It talks about the prevalence of fake news during the COVID-19 pandemic in India, with 30% of Indians using WhatsApp for COVID misinformation and only half fact-checking the information before sharing. The document proposes using machine learning algorithms like Naive Bayes, Support Vector Machine, and Logistic Regression to accurately identify fake news. It highlights the need for efficient fake news detection with advances in machine learning techniques.

Uploaded by

Sameer Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views50 pages

Fake News Detection by Using Machine Learning

The document discusses fake news detection using machine learning techniques. It talks about the prevalence of fake news during the COVID-19 pandemic in India, with 30% of Indians using WhatsApp for COVID misinformation and only half fact-checking the information before sharing. The document proposes using machine learning algorithms like Naive Bayes, Support Vector Machine, and Logistic Regression to accurately identify fake news. It highlights the need for efficient fake news detection with advances in machine learning techniques.

Uploaded by

Sameer Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Fake news detection by using Machine

Learning.

A Project Report
Submitted in partial fulfillment for the requirements of the award of the
degree
Of

BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY

Submitted by

Shivam Tripathi (1802913156)


Rakesh Gupta (1802913129)
Shivam Tayal (1802913158)

.
Supervised by

Dr. Ajay Agarwal

DEPARTMENT OF INFORMATION TECHNOLOGY

KIET GROUP OF INSTITUTIONS, GHAZIABAD, UTTAR PARDESH

(Affiliated to Dr. A. P. J. Abdul Kalam Technical University, Lucknow, U.P., India)

Session 2021-22
DECLARATION

We declare that
a. The work contained in this report is original and has been done by us under the guidance
of our supervisor.
b. The work has not been submitted to any other institute for any degree or diploma.
c. We have followed the guidelines provided by the institute to prepare the report.
d. We have conformed to the norms and guidelines given in the ethical code of conduct
of the institute.
e. Wherever we have used materials (data, theoretical analysis, figures and text) from
other sources, we have given due credit to them by citing them in the text of the report
and giving their details in the references.

Signature of the student


Name: Shivam Tripathi
Roll number: 1802913156

Signature of the student


Name: Rakesh Gupta
Roll number: 1802913129

Signature of the student


Name: Shivam Tayal
Roll number: 1802913158

Place: KIET Group of Institutions, Ghaziabad


Date: 15 may 2022

i
CERTIFICATE

This is to certify that the project Report entitled, “Fake news detection by using
Machine Learning” submitted by Shivam Tripathi, Rakesh Gupta, Shivam Tayal, in
the Department of Information Technology of KIET Group of Institutions, Ghaziabad,
affiliated to Dr. A. P. J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh,
India, is a record of bona fide project work carried out by them under my supervision
and guidance and is worthy of consideration for the award of the degree of Bachelor of
Technology in Information Technology of the Institute.

Signature of Supervisor:

Supervisor Name: Dr. Ajay Agarwal

Date: 15 May 2022

ii
List of Figures

1.1 It shows survey results about COVID-19 and fake news in India (2020).

1.2 Shows of the perceived multidimensional of fake news (left) and real news (right),
averaged and per individual item. Error bars show 95% confidence intervals [12].

1.3 Naïve Base Explanation.

1.4 Gaussian Distribution

1.5 Support vector machine.

1.6 Sample classification on SVM.

1.7 Sigmoid Function

1.8 Logistic Regression Confusion matrix logistic regression

1.9 Binary logistic regression

2.1 Describe the Proposed System Methodology.

2.2 Proposed Model.

3.1 Number of words vs Title pattern in real news

3.2 Number of words vs Title pattern in fake news

4.1 Training data.

4.2 Test data.

4.3 Comparison of Algorithms.

iii
List of Tables

1.1 Shows the inductive typology claims regarding Covid-19 related misinformation.

4.1 Shows algorithm accuracy output of models.

3.1 Result comparison of existing approaches

iv
List of Acronyms

SVM SUPPORT VECTOR MACHINE

ML MACHINE LEARNING

LR LOGISTIC REGRESSION

RNN RECURRENT NEURAL NETWORK

NN NEURAL NETWORK

KNN K-nearest NETWORK

RF RANDOM FOREST

v
CONTENT
PAGE No.

Declaration i

Certificate ii

List of Figures iii

List of Tables iv

List of Acronyms v

Abstract vii

CHAPTER 1: Introduction 1

1.1 Methodology 6
1.2 APPROACH 7
1.2.1 NAÏVE BAYES 7
1.2.2 SUPPORT VECTOR MACHINE 12
1.2.3 LOGISTIC REGRESSION 15

CHAPTER 2: SYSTEM ARCHITECTURE 20

CHAPTER 3: LITERATURE REVIEW 22

CHAPTER 4: RESULT 28

CHAPTER 5: CONCLUSION 30

REFERENCE 31

APPENDICES:

APPENDIX A: WRITE UP OF RELATED WORK

APPENDIX B: CERTIFICATE OF REVIEW PAPER

APPENDIX C: RESEARCH PAPER WORK

vi
Abstract

Fake information is all around us whether we can identify it or not. Individuals and
organizations publish fake news all the time to override the unfavorable truths. A good example
of fake news is Covid -19 Vaccine Before the vaccine came out huge amounts of fake news
and altered images were circulating on the internet.

There were some sources stated that there was already a fully effective vaccine available, some
stated that it was coming very soon, and other stated that it would take decade for safe and
functional one to be released but trusting and following the wrong sources can lead to harm
than good. This paper takes a look at the application of Support Vector Machine, Logistic
Regression, Naïve bays Learning techniques to identify fake news accurately.

vii
CHAPTER 1

Introduction

The main motive behind creating a fake news is largely to mislead people by
making them fall prey to a range of hoaxes, propaganda and inaccurate
information. There are articles that are either completely false or simply random
opinion of single one presented as news.

Nowadays all the key platform of social media, like Facebook, Twitter,
WhatsApp and Reddit are spreading fake news rapidly. In our research paper we
propose a technique for identification of the fake news employing a few ML
methods like Naïve-Bayes, SV-Machine, and Logistic-Regression.

The types of fake news are as follows: -

1. For the sake of politics.

2. For unrelated stuff, use a fictitious image.

3. Content that is completely unfounded.

4. The IT cell spreads rumors.

5. Religious content that is deceptive.

As we all know, there was a lot of misinformation surrounding COVID-19 during


the pandemic.
A new wave of COVID-19 poured across India, bringing with it a new flood of
fake news. A scientific study [1] issued in Journal of Medical Research by
medical practitioners from Rochester, New York, and Pune- India, supplies
perception in conduct of Indian cyberspace customers during the epidemic on
social media, a major origin of coronavirus-19 hoaxes in the nation.

Page 1 of 42
A good example of fake news is the Covid-19 vaccine. Prior to the launch of this
vaccine, a large amount of fake news and altered images were widespread on the
Internet.

Some sources say that a fully effective vaccine is available, some say it's coming
soon, and some say it will take 10 years for a safe and effective vaccine to be
released. But trusting and following bad sources can do more harm than good.

According to the study, about thirty per-cent of Indians used WhatsApp for
COVID-19 hoax, and only about half of those who used WhatsApp for COVID-
19 fact-checked fewer than half of the texts before sharing them. Even more
shocking, 13% of respondents indicated they never fact-checked communications
before forwarding them on. Individuals and organizations are constantly posting
fake news to avoid unnecessary facts

The study too noticed age groupings, finding those beyond 65 were more likely
to hear disinformation, as well as believe and act on it, while those under the age
of 25 were the least likely. This resulted that between twenty four and twenty
seven per-cent of participants hold that they had contemplated using COVID-19
therapies that were herbal, Ayurveda, or homoeopathic.

Seven to eight percent stated they had tried them, while twelve percent said they
had tried home treatments.

Even though an attached link or reference of a source does not necessarily make
a claim authentic, three-quarters of Indians believe it makes a message more
trustworthy. Only a third of Indians said they believed signals from foreign
countries.

With advancement in tech stack related to application of Support Vector


Machine, Logistic Regression, Naïve bays learning techniques we can efficiently
identify fake news accurately.

Page 2 of 42
Figure 1.1 shows survey results about COVID-19 and fake news in India (2020)

Figure 1.1

Page 3 of 42
Table 1.1shows the inductive typology claims regarding Covid-19 related
misinformation [11].

Table 1.1

Page 4 of 42
Figure 2 shows the perceived manipulativeness of fake news (left) and real news
(right), averaged and per individual item. Error bars show 95% confidence
intervals [12].

Figure 1.2

Page 5 of 42
1.1 METHODOLOGY

Recognizing the category of news is difficult due to the multi-dimensional


nature of fake news. It is self-evident that a realistic approach is required. To
be effective, a technique must include a variety of viewpoints precisely deal
with the problem the reason behind this is the proposed technique is a
combination of Naïve bays, Support vector machine and Logistic Regression.

Artificial Intelligence (AI) was used to create this as the deadline approaches,
it is critical to place precise orders rather than distinguishing between the real
and the counterfeit, using computations that aren't able to reflect a person's
feelings capacities.

The three-part strategy is a hybrid between the calculations of Machine


Learning divide into procedures for supervised learning, and preparing a
distinctive language technique.

1.2 Approach

Recognizing the category of news is difficult due to the multi-dimensional nature


of fake news. It is self-evident that a realistic approach is required. To be
effective, a technique must include a variety of viewpoints precisely deal with the
problem.

The reason behind this is the proposed technique is a combination of Naïve Bayes,
Support vector machine and Logistic Regression. Artificial Intelligence (AI) was
used to create this as the deadline approaches, it is critical to place precise orders
rather than distinguishing between the real and the counterfeit, using
computations that aren't able to reflect a person's feelings capacities.
The three-part strategy is a hybrid between the calculations of Machine Learning
divide into procedures for supervised learning, and preparing a distinctive
language technique.

1.2.1 Naive Bayes:


Page 6 of 42
As shown in Eq. (1), a Bayesian network lends to a naive Bayes classifier. For
the sake of clarity, attributes are discrete). Let c stand for a class label and xi for
a value of the Xi property. A distribution is induced using a naive Bayes
algorithm:

Eq. (1)

We can use highest likelihood or MAP estimation to estimate these parameters


from (labelled) data. We can label new examples after studying a Naive classifier
from input by determining the class label c* with the highest anterior probability
by monitoring sx1..., sxn.

Eq. (2)

Given Eq 1 and 2 help in the classifications. Naïve Bayesian models are easy to
build and particularly useful for small & medium sized data sets (the one used
in this article is evidence of that!). Along with simplicity, Naïve Bayes is known
to outperform even highly sophisticated classification methods.

This classifier is used in various critical domains such as diagnosis of diseases,


sentiment analysis and building email spam classifiers for this reason.

It is called naïve Bayes or idiot Bayes because the calculation of the


probabilities for each hypothesis are simplified to make their calculation
tractable.

Rather than attempting to calculate the values of each attribute value P (d1, d2,
d3|h), they are assumed to be conditionally independent given the target value
and calculated as P (d1|h) * P (d2|H) and so on.

Page 7 of 42
Given a naïve Bayes model, you can make predictions for new data using Bayes
theorem.

MAP (h) = max (P(d|h) * P(h))

Using our example above, if we had a new instance with the weather of sunny,
we can calculate:

Go-out = P(weather=sunny|class=go-out) * P(class=go-out)

Stay-home = P(weather=sunny|class=stay-home) * P(class=stay-home)

We can choose the class that has the largest calculated value. We can turn these
values into probabilities by normalizing them as follows:

P(go-out|weather=sunny) = go-out / (go-out + stay-home)

P(stay-home|weather=sunny) = stay-home / (go-out + stay-home)

If we had more input variables, we could extend the above example. For
example, pretend we have a “car” attribute with the values “working” and
“broken “. We can multiply this probability into the equation.

For example, below is the calculation for the “go-out” class label with the
addition of the car input variable set to “working”:

Go-out = P(weather=sunny|class=go-out) * P(car=working|class=go-out) *


P(class=go-out).

Page 8 of 42
Figure 1.3 shows the classification of naïve bayes, X1,X2,…………….Xn are
probabilistic events.

Figure 1.3

Page 9 of 42
Types of Naive Bayes Classifier

1. Multinomial Naïve Bayes Classifier:


Feature vectors represent the frequencies with which certain events have been
generated by a multinomial distribution. This is the event model typically used
for document classification.

2. Bernoulli Naïve Bayes Classifier:


In the multivariate Bernoulli event model, features are independent Booleans
(binary variables) describing inputs.

Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence (i.e. a word occurs in a document or not)
features are used rather than term frequencies (i.e. frequency of a word in the
document).

3. Gaussian Naïve Bayes Classifier:

In Gaussian Naïve Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution (Normal
distribution).

Page 10 of 42
When plotted, it gives a bell-shaped curve which is symmetric about the mean
of the feature values as shown in Figure 1.4.

Figure 1.4

The likelihood of the features is assumed to be Gaussian; hence, conditional


probability is given by:

Page 11 of 42
Eq. (3)

Eq. 3 where P(x) describe the probability for Gaussian function.


Now, what if any feature contains numerical values instead of categories i.e.,
Gaussian distribution.
One option is to transform the numerical values to their categorical counterparts
before creating their frequency tables. The other option, as shown above, could
be using the distribution of the numerical variable to have a good guess of the
frequency. For example, one common method is to assume normal or Gaussian
distributions for numerical variables.

1.2.2 Support Vector Machine

The Support Vector Machine (SVM) was created with the goal of determining
the best boundary for classifying positive and negative data points. The original
data point can be mapped into a high-dimensional.
Vector space using a well-defined kernel function, allowing features to be
extracted; thus, SVM is regarded as an important machine learning technique of
regression and classification. [9]

According to the SVM algorithm we find the points closest to the line from both
the classes. These points are called support vectors. Now, we compute the
distance between the line and the support vectors. This distance is called the
margin. Our goal is to maximize the margin. The hyper plane for which the
margin is maximum is the optimal hyper plane.

This data is clearly not linearly separable. We cannot draw a straight line that can
classify this data. But, this data can be converted to linearly separable data in

Page 12 of 42
higher dimension. Let’s add one more dimension and call it z-axis. Let the co-
ordinates on z-axis be governed by the constraint shown by Eq. 4,

z = x²+y² Eq. (4)

Now the data is clearly linearly separable. Let the purple line separating the
data in higher dimension be z=k, where k is a constant. Since, z=x²+y² we
get x² + y² = k; which is an equation of a circle. So, we can project this linear
separator in higher dimension back in original dimensions using this
transformation.

For example let’s assume a line to be our one dimensional Euclidean space (i.e.
let’s say our datasets lie on a line). Now pick a point on the line, this point divides
the line into two parts. The line has 1 dimension, while the point has 0 dimensions.
So a point is a hyper plane of the line.

Kernel functions
Linear
These are commonly recommended for text classification because most of these
types of classification problems are linearly separable.

The linear kernel works really well when there are a lot of features, and text
classification problems have a lot of features. Linear kernel functions are faster
than most of the others and you have fewer parameters to optimize.

f(X) = w^T * X + b Eq. (5)

Page 13 of 42
In this equation (5), w is the weight vector that you want to minimize, X is the
data that you're trying to classify, and b is the linear coefficient estimated from
the training data. This equation defines the decision boundary that the SVM
returns.

Representation of components of SVM is shown in Figure 1.5.

Figure 1.5

Page 14 of 42
Figure 1.6

Figure 1.6 shows an example of classification of two different samples. Class A and
Class B are respective datasets.

1.2.3 Logistic Regression

Logistic regression is another fundamental method initially formulated by David


Cox in 195832 that builds a logistic model (also known as the logit model). Its
most significant advantage is that it can be used both for classification and class
probability estimation, because it is tied with logistic data distribution.
It takes a linear combination of features and applies to them a nonlinear sigmoidal
function. In the basic version of logistic regression, the output variable is binary,
however, it can be extended into multiple classes (then it is called multinomial
logistic regression).

Page 15 of 42
The binary logistic model classifies specimen into two classes, whereas the
multinomial logistic model extends this to an arbitrary number of classes without
ordering them.
The mathematics of logistic regression rely on the concept of the “odds” of the
event, which is a probability of an event occurring divided by the probability of
an event not occurring.
Just as in linear regression, logistic regression has weights associated with
dimensions of input data. In contrary to linear regression, the relationship
between the weights and the output of the model (the “odds”) is exponential, not
linear.

Another excellent method for categorizing issues is logistic regression. The term
"logistic regression" talks about the use to solve classification hurdles. Explain
below is function used to depict a variable in logistic regression.

The main difference amongst linear and logistic is that later has only range (0, 1).
In addition, unlike linear regression, logistic do not require a linearity between
the I/O variables. It is accurate and for better clarification refer to Eq. (6).

Eq. (6)

In given Eq. 6 x is the input variable in the logistic function equation. Let's input
the logistic function values ranging from 20 to 20.

Page 16 of 42
Figure 1.7

Confusion matrix is one of the easiest and most intuitive metrics used for finding
the accuracy of a classification model, where the output can be of two or more
categories.

This is the most popular method used to evaluate logistic regression. Below is
the confusion matrix for logistic regression.

Page 17 of 42
True positive is nothing but the case where the actual value as well as the
predicted value are true. The patient has been diagnosed with cancer, and the
model also predicted that the patient had cancer.

In False negative, the actual value is true, but the predicted value is false, which
means that the patient has cancer, but the model predicted that the patient did not
have cancer.

This is the case where the predicted value is true, but the actual value is false.
Here, the model predicted that the patient had cancer, but in reality, the patient
doesn’t have cancer. This is also known as Type 1 Error.

The confusion matrix in Figure 1.8.

Figure 1.8

Page 18 of 42
And ordinal logistic regression deals with three or more classes in a
predetermined order.

Binary logistic regression was mentioned earlier in the case of classifying an


object as an animal or not an animal—it’s an either/or solution. There are just two
possible outcome answers. This concept is typically represented as a 0 or a 1 in
coding.

Figure 1.9

Multinomial logistic regression

Multinomial logistic regression is a model where there are multiple classes that
an item can be classified as. There is a set of three or more predefined classes set
up prior to running the model.

Ordinal logistic regression

Ordinal logistic regression is also a model where there are multiple classes that
an item can be classified as; however, in this case an ordering of classes is
required. Classes do not need to be proportionate. The distance between each
class can vary.

Page 19 of 42
CHAPTER 2
System Architecture
First, we gather news from various sources that is being circulating in media
Then we classify the news as social, controversial etc. tags depending on the
Nature of news. Then we performed data analysis by using algorithms like SVI,
Naïve Bayes logistic regression.

By considering output probabilities from these algorithms, we can classify news


as hoaxes and genuine.

Figure 2.1. Describe the Proposed System Methodology

Figure 2.1 shows the flow of system methodology where Figure 2.2 is describing
the proposed model architecture and classified each domain role briefly.

Page 20 of 42
Figure 2.2 Proposed Model

This is the overall system architecture of our model which help us to determine
and news filtrations.

Page 21 of 42
` CHAPTER 3
Literature Review

Many automatic detection algorithms for hoaxes have been described in the
literature. Since there are numerous hoaxes, ranging from Chabot’s to promote
misinformation to the use of click baits to spread rumors, many clickable links
are there on platforms such as Facebook, which encourage people to share and
like postings, spreading false information. There has been a lot of effort put
towards detecting fake data.

Feature extraction is at the heart of several existing false news detection


algorithms. Language-based strategies make use of crucial linguistic
characteristics seen in fake news. N-Grams, Punctuations, Readability, and
Syntax are some of these properties.

In this paper, Shivam B. Parikh and Pradeep K. Atrey (2018) [2] developed a
Nave-Bayes classifier based on the idea that fakes are more likely to be detected.

News pieces frequently utilize the same set of terms, whereas actual news uses a
unique set of words. The accuracy score of the overall. The accuracy of the model
utilizing the nave-bays classifier was around 70%.

In their study, Mykhailo Granik et al. [3] provide a simple strategy for detecting
fake news using a naïve-Bayes. This method was turned into a software system
and put to the test on a collection of Facebook news posts.

They came from three major Facebook pages on the right and left, as well as
three major mainstream political news sites (Politico, CNN, and ABC News).
They were able to attain a classification accuracy of around 74%. The accuracy
of fake news classification is slightly lower.

This could be due to the dataset's skewness: only 4.9 percent of it is bogus news.

During and during the 2016 U.S. presidential election, the authors of [4] saw
frequency or retweeting of roughly 14 million data is about 400,000 times on
Twitter. Bots are running for president and will be elected. If there are methods
to categories the posts spread by bots.

Page 22 of 42
Linguistic Cue Approaches with Machine Learning, BOW Approach, Rhetorical
Structure and Discourse Analysis are all described by the authors in [5].

The authors of [6] suggested a Fake detector automatic hoax spotting model
depending on textual categorization that uses an extensive diffusive network
model to simultaneously grasp the depiction of hoax, authors, and subjects. Fake
Detector covers two primary components: representation feature learning and
credibility label inference, which will be combined to form the Fake Detector
deep diffusive network model.

Good deception modelling algorithms (bag of words, rhetorical algorithm) give


robust and solid architecture, and the author focuses on specialized linguistics
content.

The authors of [7] have developed a model that uses Machine Learning
techniques. They used a variety of machine learning algorithms to improve
accuracy, including the Linear Support Vector Machine, Multilayer Perceptron,
Random Forest Algorithm, and KNN.

Using random forest, the author was able to obtain a 98 percent accuracy. In this
research paper, the author's goal is to examine small sentences and news in
concern and construct the reliability count with the news by putting serially
feature extraction and credibility scores reliability count with the news by putting
serially feature extraction and credibility scores in this research paper, the author's
goal is to examine small sentences and news in concern and construct the
reliability count with the news by putting serially feature extraction and
credibility scores in this research paper, the author's goal is to examine small
sentences and news in concern and construct the reliability count.

Page 23 of 42
Figure 3.1 shows characteristic observed authentic news.

Figure 3.1

By analyzing the bulk of data related true news it is observe the presence of
distinct patterns in fake hoxes It takes into account the number of words in an
article of authentic news with title of the news.

Page 24 of 42
By analyzing the bulk of data related false news it is observe the presence of
distinct patterns in fake hoxes. It takes into account the number of words in an
article of authentic news with title of the news.

Figure 3.2 shows characteristics observed in hoax.

Figure 3.2

It takes into account the number of words in an article of authentic news with title
of the news.

Table 3.1 summarizes the related works.

Figure 3.2

Page 25 of 42
Table 3.1 summarizes the related works [09][14][15].

Page 26 of 42
CHAPTER 4
Result
For the implementation purpose, the 3 existing approaches are considered. The
dataset [10] comprises 4897 rows with labeling 1 as true and 0 as false id
represents the id of text and text column comprises text of articles. Figure 4.1 and
Figure 4.2 shows the implementation.

Figure 4.1

Page 27 of 42
Figure 4.2

Page 28 of 42
Table 4.3 shows algorithm accuracy Output of models

IMPLEMENTATION MODEL ACCURACY

SUPPOR VECTOR MACHINE .94

NAÏVE BAYES .81

LOGISTIC REGRESSION .85

Figure 4.1 Comparison of Algorithms

Page 29 of 42
CHAPTER 5

CONCLUSION
Finding the accuracy of news that is available on the internet is critical. The
components for spotting fake news are listed in the study discussed.

A realization that not all, if not all, are phony Web-based networking will spread
the word about the media. Currently, the recommended solution is being tested.
SVM and the Naïve Bayes classifier technique for Logistic Regression are
employed. The resulting algorithm could be useful in the future.

Hybrid techniques yield superior results when the same goal is achieved. As
previously described, the technology detects bogus news based on the models that
were used.

It also supplied some assistance and suggested breaking news on the subject,
which is highly interesting. Any user will find this handy. The efficiency of the
system will improve in the future, and the prototype's precision can be improved
to a certain extent, as well as the user interface.

Page 30 of 42
REFRENCES

[1] J. A. Bapaye and A. Bapaye, “Impact of WhatsApp as a Source of Misinformation for


the Novel Coronavirus Pandemic in a Developing Country: Cross-Sectional Questionnaire
Study”, June 2021.

[2] S. B. Parikh and P. K. Atrey, “Media-Rich Fake News Detection: A Survey,” in


Proceedings - IEEE 1st Conference on Multimedia Information Processing and Retrieval,
MIPR 2018, Jun. 2018

[3] M. Granik and V. Mesyura, “FAKE STATEMENTS DETECTION WITH


ENSEMBLE OF MACHINE LEARNING ALGORITHMS,” Problems of Information
Technology, vol. 09, no. 2, pp. 48–52, Jul. 2018

[4] C. Shao, G. L. Ciampaglia, O. Varol, K. Yang, A. Flammini, and F. Menczer, “The


spread of low-credibility content by social bots,” Jul. 2017.

[5] N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic Deception Detection: Methods


for Finding Fake News, 2015” Jun. 2021.

[6] S. I. Manzoor, J. Singla, and Nikita, “Fake news detection using machine learning
approaches: A systematic review,” in Proceedings of the International Conference on Trends
in Electronics and Informatics, ICOEI 2019, Apr. 2019.

[7] I. Ahmad, M. Yousaf, S. Yousaf, and M. O. Ahmad, “Fake News Detection Using
Machine Learning Ensemble Methods,” Complexity, vol. 2020, 2020.

Page 31 of 42
[8] P. Kaviani and M. S. Dhotre, “Short Survey on Naive Bayes Algorithm,”
International Journal of Advance Engineering and Research Development, vol. 4, no. 11,
2017.

[9] A. Jain, A. Shakya, H. Khatter, and A. K. Gupta, “A smart System for Fake News
Detection Using Machine Learning,” Sep. 2019.

[10] Dataset link- https://www.kaggle.com/c/fakenewskdd2020/data?select=train.csv


access date 2 march 2022

[11] Felix Simon, Dr. Philip N, Howard, Prof Rasmus Kelis Nielsen, “Types, sources, and
claims of COVID-19 misinformation”, 7 april 2020

[12] Melisa Basol, Manon Berrich, Fatih Uenal, Sander Van Der Linden, “Towards
psychological herd immunity: Cross-cultural evidence for two prebunking interventions
against COVID-19 misinformation”, may 2021

[13] P. R. Humanante-Ramos, F. J. Garcia-Penalvo, and M. A. Conde-Gonzalez, “PLEs in


Mobile Contexts: New Ways to Personalize Learning,” Rev. Iberoam. Tecnol. del Aprendiz.,
vol. 11, no. 4, pp. 220–226, 2016.

[14] S. B. Parikh, V. Patil, and P. K. Atrey, “On the Origin, Proliferation and Tone of
Fake News,” Proc. - 2nd Int. Conf. Multimed. Inf. Process. Retrieval, MIPR 2019, pp. 135–
140, 2019.

[15] S. Das Bhattacharjee, A. Talukder, and B. V. Balantrapu, “Active learning-based


news veracity detection with feature weighting and deep-shallow fusion,” Proc. - 2017 IEEE
Int. Conf. Big Data, Big Data 2017, vol. 2018–January, pp. 556–565, 2018.

Page 32 of 42
APPENDIX A

Review paper on Fake news detection by using Machine


Learning.
We have submitted the survey paper on ICDABM2022 (International
Conference
on Data Analytics in Business and Marketing).

Authors: Shivam Tripathi, Ajay Agarwal, Shivam Tayal and Rakesh Gupta
Title: Survey paper on Fake News Detection Using Machine Learning

Number : 23

Acceptance date is March 20, 2022

Registration fee paid and CRC submitted. Successfully presented in the


conference and Certificate issued on 21 April 2022.

Research Paper on Fake news detection by using Machine


Learning.

We have submitted the Research paper on BCIPECH-2022

Authors: Shivam Tripathi, Ajay Agarwal, Shivam Tayal and Rakesh Gupta
Title: Research paper on Fake News Detection Using Machine Learning

Number: 6713

Acceptance date is May 16, 2022

Registration fee paid and CRC Successfully submitted.

Page 33 of 42
APPENDIX B

CERTIFICATE ISSUED BY ICDABM2022 FOR REVIEW PAPER

Page 34 of 42
APPENDIX C

Fake news detection by using Machine


Learning
Shivam Tripathi Ajay Agrawal

KIET Group of Institutions, Ghaziabad KIET Group of Institutions, Ghaziabad

[email protected] [email protected]

Rakesh Gupta Shivam Tayal

KIET Group of Institutions, Ghaziabad KIET Group of Institutions, Ghaziabad

[email protected] [email protected]

Abstract: Fake information is all around us The main motive behind creating a fake news is largely to
whether we can identify it or not. Individuals and mislead people by making them fall prey to a range of
organization publish fake news all the time to override hoaxes, propaganda and inaccurate information. There are
the unfavorable truths. A good example of fake news is articles that are either completely false or simply random
Covid -19 Vaccine Before the vaccine came out huge opinion of single one presented as news. Now a days all
amounts of fake news and altered images were the key platform of social media, like Facebook, Twitter,
circulating on the internet. There were some sources WhatsApp and Reddit are spreading fake news rapidly. In
stated that there was already a fully effective vaccine our research paper we propose a technique for
available, some stated that it was coming very soon, identification of the fake news employing a few ML
and other stated that it would take decade for safe and methods like Naïve-Bayes, SV-Machine, and Logistic-
functional one to be released but trusting and following Regression
the wrong sources can lead to harm than good. This The types of fake news are as follows: -
paper takes a look at the application of Support Vector
1.For the sake of politics.
Machine, Logistic Regression, Naïve bayes Learning
techniques to identify fake news accurately. Our 2.For unrelated stuff, use a fictitious image.
research looks into many textual qualities that can be
used to tell the difference between bogus and genuine 3.Content that is completely unfounded.
content. We train a variety of machine learning
algorithms using various ensemble approaches and 4.The IT cell spreads rumors.
then evaluate their performance on real-world
datasets using those properties. According to 5.Religious content that is deceptive.
experimental results, the performance of our
suggested ensemble learner strategy is superior to that As we all know, there was a lot of misinformation
of individual learners. surrounding COVID-19 during the pandemic.
A new wave of COVID-19 poured across India, bringing
Keywords-Fake News; clickbait’s; social media, with it a new flood of fake news. A scientific study [1]
issued in Journal of Medical Research by medical
Classification, Machine learning.
practitioners from Rochester, New York, and Pune- India,
supplies perception in conduct of Indian cyberspace
I. Introduction:

Page 35 of 42
customers during the epidemic on social media, a major Table 1
origin of coronavirus-19 hoaxes in the nation.
According to the study, about thirty per-cent of Indians
used WhatsApp for COVID-19 hoax, and only about half
of those who used WhatsApp for COVID-19 fact-checked
fewer than half of the texts before sharing them. Even
more shocking, 13% of respondents indicated they never
fact-checked communications before forwarding them on.
The study too noticed age groupings, finding those beyond
65 were more likely to hear disinformation, as well as
believe and act on it, while those under the age of 25 were
the least likely. This resulted that between twenty-four and
twenty-seven per-cent of participants hold that they had
contemplated using COVID-19 therapies that were herbal,
ayurvedic, or homoeopathic. Seven to eight percent stated
they had tried them, while twelve percent said they had
tried home treatments.

Even though an attached link or reference of a source does


not necessarily make a claim authentic, three-quarters of
Indians believe it makes a message more trustworthy.
Only a third of Indians said they believed signals from
foreign countries.

Figure 2
Figure 2 shows of the perceived manipulativeness of fake
news (left) and real news (right), averaged and per
individual item. Error bars show 95% confidence intervals
[12].

II. LITERATURE SURVEY ON FAKE NEWS


DETECTION

Many automatic detection algorithms for hoaxes have


been described in the literature. Since there are numerous
hoaxes, ranging from chatbots to promote misinformation
to the use of click baits to spread rumors, many clickable
*This chart shows survey results about COVID-19 and links are there on platforms such as Facebook, which
fake news in India (2020) encourage people to share and like postings, spreading
false information. There has been a lot of effort put
Figure 1 towards detecting fake data.
Table 1 shows the inductive typology claims regarding
Covid-19 related misinformation [11]. Feature extraction is at the heart of several existing false
news detection algorithms. Language-based strategies
Following Table 1 summarizes about the various make use of crucial linguistic characteristics seen in fake
typologies of hoaxes that are circulated in society to create news. N-Grams, Punctuations, Readability, and Syntax
unrest and feeling of hatred. These categories tend to are some of these properties. In this paper, Shivam B.
disturb harmony among various sections of society and Parikh and Pradeep K. Atrey (2018) [2] developed a Nave-
attacks on brotherhood among citizens . Bayes classifier based on the idea that fakes are more
likely to be detected. News pieces frequently utilize the

Page 2 of 42
same set of terms, whereas actual news uses a unique set that uses a extensive diffusive network model to
of words. The accuracy score of the overall. The accuracy simultaneously grasp the depiction of hoax, authors, and
of the model utilizing the nave-bayes classifier was around subjects. Fake Detector covers two primary components:
70%. representation feature learning and credibility label
inference, which will be combined to form the Fake
In their study, Mykhailo Granik et al. [3] provide a simple Detector deep diffusive network model. Good deception
strategy for detecting fake news using a naïve-Bayes. This modelling algorithms (bag of words, rhetorical algorithm)
method was turned into a software system and put to the give robust and solid architecture, and the author focuses
test on a collection of Facebook news posts. They came on specialized linguistics content.
from three major Facebook pages on the right and left, as
well as three major mainstream political news sites The authors of [7] have developed a model that uses
(Politico, CNN, and ABC News). They were able to attain Machine Learning techniques. They used a variety of
a classification accuracy of around 74%. The accuracy of machine learning algorithms to improve accuracy,
fake news classification is slightly lower. This could be including the Linear Support Vector Machine, Multilayer
due to the dataset's skewness: only 4.9 percent of it is Perceptron, Random Forest Algorithm, and KNN. Using
bogus news. random forest, the author was able to obtain a 98 percent
accuracy. In this research paper, the author's goal is to
During and during the 2016 U.S. presidential election, the examine small sentences and news in concern and
authors of [4] saw frequency or retweeting of roughly 14 construct the reliability count with the news by putting
million data is about 400,000 times on Twitter. Bots are serially feature extraction and credibility scores reliability
running for president and will be elected. The If there are count with the news by putting serially feature extraction
methods to categories the posts spread by bots. and credibility scores in this research paper, the author's
goal is to examine small sentences and news in concern
Linguistic Cue Approaches with Machine Learning, BOW and construct the reliability count with the news by putting
Approach, Rhetorical Structure and Discourse Analysis serially feature extraction and credibility scores in this
are all described by the authors in [5]. research paper, the author's goal is to examine small
sentences construct the reliability count.
The authors of [6] suggested a Fake detector automatic
hoax spotting model depending on textual categorization

III. Methodology:
III.1 Approach
Recognizing the category of news is difficult due to the
multi-dimensional nature of fake news. It is self-evident Eq. (1)
that a realistic approach is required. To be effective, a
technique must include a variety of viewpoints precisely We can use highest likelihood or MAP estimation to
deal with the problem The reason behind this is the estimate these parameters from (labelled) data. We can
proposed technique is a combination of Naïve bayes, label new examples after studying a Naive classifier from
Support vector machine and Logistic Regression. input by determining the class label c* with the highest
Artificial Intelligence (AI) was used to create this as the anterior probability by monitoring sx1..., sxn.
deadline approaches, it is critical to place precise orders
rather than distinguishing between the real and the
counterfeit, using computations that aren't able to reflect a
person's feelings capacities. The three-part strategy is a
Eq. (2)
hybrid between the calculations of Machine Learning
Graphical representation of above equation is as follows:
divide into procedures for supervised learning, and
preparing a distinctive language technique.
II.1.1 Naive Bayes:

As shown in Eq. (1), a Bayesian network lends to a naive


Bayes classifier. For the sake of clarity, attributes are
discrete). Let c stand for a class label and xi for a value of
the Xi property. A distribution is induced using a naive
Bayes algorithm:

Page 3 of 42
X is the input variable in the logistic function equation.
Let's input the logistic function values ranging from 20 to
20.

Following figure is the graphical view of logistic function


and curve hence formed is called sigmoid curve that plots
classification probabilities of various incidents taken into
account.

Figure 3

II.1.2 Support Vector Machine:


The Support Vector Machine (SVM) was created with the
goal of determining the best boundary for classifying
positive and negative data points. The original data point
can be mapped into a high-dimensional
vector space using a well-defined kernel function,
allowing features to be extracted; thus, SVM is regarded
as an important machine learning technique of regression
and classification. [9]. Graphical representation of SVM is
shown in below figure:

Figure 5

IV. System Architecture

First, we gather news from various sources that is being


circulating in media then we classify the news as social,
controversial etc. tags depending on the nature of news.
Then we performed data analysis by using algorithms like
SVI, Naïve bayes logistic regression. By considering
output probabilities from these algorithms, we can classify
news as hoax and genuine

Figure 4

III.1.3 Logistic Regression:


Another excellent method for categorizing issues is
logistic regression. The term "logistic regression" talks
about the use to solve classification hurdles. Explain
below is function used to depict a variable in logistic
regression. The main difference amongst linear and
logistic is that later has only range (0, 1). In addition,
unlike linear regression, logistic do not require a linearity
between the I/O variables. It is accurate and for better
clarification refer to Figure-2.

Page 2 of 42
Figure 6. Describe the Proposed System Methodology
This is the overall system architecture of our model which
Figure 6 shows the flow of system methodology where help us to determine and news filtrations.
Figure 7 is describing the proposed model architecture and
classified each domain role briefly. V. IMPLEMENTATION AND RESULTS

For the implementation purpose, the 3 existing approaches


are considered. The dataset [10] comprises 4897 rows
with labeling 1 as true and 0 as False id represents the id
of text and text column comprises text of articles. Figure
8 and Figure 9 shows the implementation.

Figure 9
Figure 7. Proposed Model

Table 5.1 shows algorithm accuracy


Output of models
IMPLEMEN ACCURACY
TATION
MODEL

SUPPOR .94
VECTOR
MACHINE

NAÏVE .81
BAYES

LOGISTIC .85
REGRESSIO
Figure 10
N

Page 3 of 42
V. CONCLUSION
Finding the accuracy of news that is available on the
internet is critical. The components for spotting fake
Confusion Matrix: news is listed in the study discussed. A realization
that not all, if not all, are phony Web-based
networking will spread the word about the media.
Currently, the recommended solution is being tested.
SVM and the Naïve Bayes classifier technique for
Logistic Regression are employed. The resulting
algorithm could be useful in the future. Hybrid
techniques yield superior results when the same goal
is achieved. Any user will find this handy. The
efficiency of the system will improve in the future,
and the prototype's precision can be improved to a
certain extent, as well as the user interface. There are
numerous outstanding issues in the detection of fake
news that researchers must address. Identifying
essential aspects involved in the distribution of news,
for example, is a vital step in reducing the spread of
Figure 11
fake news. Graph theory and machine learning
approaches can be used to identify the primary
sources engaged in the dissemination of fake news.
Real-time fake news detection in videos could also
be a promising future option.

References

[1] J. A. Bapaye and A. Bapaye, “Impact of WhatsApp as a [6] [6] S. I. Manzoor, J. Singla, and Nikita, “Fake news
Source of Misinformation for the Novel Coronavirus detection using machine learning approaches: A
Pandemic in a Developing Country: Cross-Sectional systematic review,” in Proceedings of the International
Questionnaire Study”, Jun. 2021 Conference on Trends in Electronics and Informatics,
ICOEI 2019, Apr. 2019.
[2] S. B. Parikh and P. K. Atrey, “Media-Rich Fake News
Detection: A Survey,” in Proceedings - IEEE 1st [7] [7] I. Ahmad, M. Yousaf, S. Yousaf, and M. O. Ahmad,
Conference on Multimedia Information Processing and “Fake News Detection Using Machine Learning
Retrieval, MIPR 2018, Jun. 2018 Ensemble Methods,” Complexity, vol. 2020, 2020.

[3] M. Granik and V. Mesyura, “FAKE STATEMENTS [8] [8] P. Kaviani and M. S. Dhotre, “Short Survey on Naive
DETECTION WITH ENSEMBLE OF MACHINE Bayes Algorithm,” International Journal of Advance
LEARNING ALGORITHMS,” Problems of Information Engineering and Research Development, vol. 4, no. 11,
Technology, vol. 09, no. 2, pp. 48–52, Jul. 2018 2017.

[4] C. Shao, G. L. Ciampaglia, O. Varol, K. Yang, A. [9] [9] A. Jain, A. Shakya, H. Khatter, and A. K. Gupta, “A
Flammini, and F. Menczer, “The spread of low-credibility smart System for Fake News Detection Using Machine
content by social bots,” Jul. 2017. Learning,” Sep. 2019.

[5] N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic [10 [10] Dataset link-
Deception Detection: Methods for Finding Fake News,” https://www.kaggle.com/c/fakenewskdd2020/data?select
2015. =train.csv access date 2 march 2022

Page 2 of 42
[11] Felix Simon, Dr. Philip N, Howard, Prof Rasmus Kelis [12 [12] Melisa Basol , Manon Berrich, Fatih Uenal, Sander
Nielsen , “Types, sources, and claims of COVID-19 Van Der Linden, “ Towards psychological herd immunity:
misinformation”, 7 april 2020 Cross-cultural evidence for two prebunking interventions
against COVID-19 misinformation”, may 2021

Page 1 of 42
Page 1 of 42

You might also like