0% found this document useful (0 votes)
11 views14 pages

Machine Learning Fake News Blocking

This paper explores the use of Natural Language Processing techniques to identify fake news by building classifiers using a corpus of labeled articles. Four classification models were evaluated, with the Long Short-Term Memory (LSTM) model achieving the highest accuracy of 94.53%. The study emphasizes the importance of source identification in predicting the reliability of news articles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

Machine Learning Fake News Blocking

This paper explores the use of Natural Language Processing techniques to identify fake news by building classifiers using a corpus of labeled articles. Four classification models were evaluated, with the Long Short-Term Memory (LSTM) model achieving the highest accuracy of 94.53%. The study emphasizes the importance of source identification in predicting the reliability of news articles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning: Fake News Blocking

Harshvardhan Singh
Department of Engineering and Technology,
SRM Institute of Science and Technology, Kattankulathur,
Kancheepuram Dist., India, 603203
e-mail: [email protected]

Abstract In this paper, the applications


Keywords Feature extraction, Pre-processing,
of Natural Language Processing Naive Bayes, Support Vector Machine, Feed-
techniques are explored to identify when forward Neural Network, Long Short-Term
a news source may be producing fake Memory, Confusion matrix

news. For which, a corpus of labelled real


and fake news articles is used to build a Introduction Fake news, defined as a made-
up story with an intention to deceive, has
classifier that can make decisions about been widely cited as a contributing factor to
information based on the content from the outcome of the 2020 United States
presidential election. While Mark Zuckerberg,
the corpus. Here I use a text classification
Facebook’s CEO, made a public statement
approach, using four different denying that Facebook had an effect on the
classification models, and analyse the outcome of the election, Facebook and other
online media outlets have begun to develop
results. The best performing model is the
strategies for identifying fake news and
LSTM implementation. The model mitigating its spread. Zuckerberg admitted
focuses on identifying fake news sources, identifying fake news is difficult, writing, “This
is an area where I believe we must proceed
based on multiple articles originating
very carefully though. Identifying the truth is
from a source. Once a source is labelled complicated.” Fake news is increasingly
as a producer of fake news, it could be becoming a menace to our society. It is
typically generated for commercial interests
predicted with high confidence that any
to attract viewers and collect advertising
future articles from that source would revenue. However, people and groups with
also be fake news. Focusing on sources potentially malicious agendas have been
known to initiate fake news in order to
widens our article misclassification influence events and policies around the
tolerance, because we then have multiple world. It is also believed that circulation of
data points coming from each source. fake news had material impact on the
outcome of the 2016 US Presidential Election.

pg. 1
2

Data The datasets used for this project Dataset Description In a dataset, a
were drawn from Kaggle. The training training set is implemented to build up a
dataset has about 16600 rows of data model, while a test (or validation) set is to
from various articles on the internet. validate the model built. Data points in the
Quite a bit of pre-processing of the data training set are excluded from the test
had to be done, as is evident from our (validation) set. Usually, a dataset is
source code, in order to train our models. divided into a training set, a validation set
(some people use ‘test set’ instead) in
A full training dataset has the following each iteration, or divided into a training
attributes: set, a validation set and a test set in each
iteration. The research dataset includes
1. id: unique id for a news article the following.

2. title: the title of a news article  data.csv: A full training dataset


with the following attributes:
3. author: author of the news article
o id
4. text: the text of the article; incomplete o title
in some cases o author
o text
5. label: a label that marks the article as
o label
potentially unreliable
 test.csv: A testing training dataset
• 1: unreliable with all the same attributes
asdata.csv without the label.
• 0: reliable

Fig. 1 First 5 records from the data frame

pg. 2
3

Feature extraction and Pre-processing with the assumption that all features are
The embeddings used for the majority of conditionally independent given the class
the modeling are generated using the label. As with the other models, I used the
Doc2Vec model. The goal is to produce a Doc2Vec embeddings described above.
vector representation of each article. The Naive Bayes Rule is based on the
Before applying Doc2Vec, we perform Bayes’ theorem
some basic pre-processing of the data.
This includes removing stop words,
deleting special characters and
(
punctuation, and converting all text to
1
lowercase. This produces a comma-
separated list of words, which can be
input into the Doc2Vec algorithm to
Above,
produce a 300-length embedding vector
for each article.
 P(c|x) is the posterior probability
Doc2Vec is a model developed in 2014 of class (c, target)
based on the existing Word2Vec model, given predictor (x, attributes).
 P(c) is the prior probability of class.
which generates vector representations
 P(x|c) is the likelihood which is the
for words. Word2Vec represents
documents by combining the vectors of probability of predictor given class.
 P(x) is the prior probability
the individual words, but in doing so it
loses all word order information. Doc2Vec of predictor.
expands on Word2Vec by adding a
“document vector” to the output
representation, which contains some Parameter estimation for naive Bayes
information about the document as a models uses the method of maximum
whole, and allows the model to learn likelihood. The advantage here is that it
some information about word order. requires only a small amount of training
Preservation of word order information data to estimate the parameters.
makes Doc2Vec useful for our application,
as we are aiming to detect subtle Let’s understand it using an example.
differences between text documents. Below I have a training data set of
weather and corresponding target
Models The following learning algorithms variable ‘Play’ (suggesting possibilities of
are used in conjunction with the proposed playing). Now, we need to classify
methodology to evaluate the whether players will play or not based on
performance of fake news detection weather condition. Let’s follow the below
classifiers. steps to perform it.

Naive Bayes In order to get a baseline Step 1: Convert the data set into a
accuracy rate for our data, I implemented frequency table
a Naive Bayes classifier. Specifically, I used
the scikit-learn implementation of Step 2: Create Likelihood table by finding
Gaussian Naive Bayes. This is one of the the probabilities like Overcast probability
simplest approaches to classification, in = 0.29 and probability of playing is 0.64.
which a probabilistic approach is used,

pg. 3
4

Fig 3. Frequency Fig 4. Likelihood Table


Table

Naive Bayes uses this method to predict


Fig 2. Training data the probability of different class based on
Step 3: Now, use Naive Bayesian equation various attributes. This algorithm is mostly
to calculate the posterior probability for used in text classification and with
each class. The class with the highest problems having multiple classes.
posterior probability is the outcome of
prediction.
Support Vector Machine The original
Support Vector Machine (SVM) was
proposed by Vladimir N. Vapnik and
Alexey Ya. Chervonenkis in 1963. But that
model can only do linear classification so
it doesn’t suit for most of the practical
problems. Later in 1992, Bernhard E.
Boser, Isabelle M. Guyon and Vladimir N.
Vapnik introduced the kernel trick which
enables the SVM for non-linear
classification. That makes the SVM much
powerful. The objective of the support
vector machine algorithm is to find a
hyperplane in an N-dimensional space (N
— the number of features) that distinctly
classifies the data points.

Fig 5. Naive-bayes.py
Accuracy- 72.94%
Fig 6. Possible hyper planes

pg. 4
5

To separate the two classes of data points, We use the theory introduced in to
there are many possible hyperplanes that
could be chosen. Our objective is to find a
plane that has the maximum margin, i.e
the maximum distance between data
points of both classes. Maximizing the
margin distance provides some
reinforcement so that future data points implement the SVM. The main idea of the
can be classified with more confidence. SVM is to separate different classes of
data by the widest “street”. This goal can
Support vectors are data points that are be represented as the optimization
problem
(3)
Then we use the Lagrangian function to
get rid of the constraints.

closer to the hyperplane and influence the


Fig 7. Support Vectors
position and orientation of the
hyperplane. Using these support vectors,
we maximize the margin of the classifier.
Deleting the support vectors will change
the position of the hyperplane. These are
the 2points that help us build our SVM.
We use the Radial Basis Function kernel in
our project. The reason we use this kernel
is that two Doc2Vec feature vectors will
be close to each other if their
corresponding documents are similar, so
the distance computed by the kernel
function should still represent the original
dista
nce.
Since
the
Radi
al Basis Function is
(2)
It correctly represents the relationship we (4)
desire and it is a common kernel for SVM.

pg. 5
6

Finally, we solve this optimization (5)


problem using the convex optimization
(6)
tools provided by Python package
CVXOPT. (7)
Fig 8. SVM.py Fig 9 (i)(ii) neural-net-keras.py
Accuracy- 88.42% Fig 10(i). neural-net-tf.py
Feed-forward Neural Network Here I
implemented two feed-forward neural
network models, one using Tensor flow
and one using Keras. Neural networks are
commonly used in modern NLP
applications, in contrast to older
approaches which primarily focused on
linear models such as SVM’s and logistic
regression. The neural network
implementations use three hidden layers.
In the Tensor flow implementation, all
layers had 300 neurons each, and in the
Keras implementation used, layers of size
256, 256, and 80, interspersed with
dropout layers to avoid overfitting. For
the activation function, we chose the
Rectified Linear Unit (ReLU), which has
been found to perform well in NLP

applications.

pg. 6
7

Long Short-Term Memory The Long-Short


Term Memory (LSTM) unit was proposed
by Hochreiter and Schmidhuber. It is good
at classifying serialized objects because it
will selectively memorize the previous
input and use that, together with the
current input, to make prediction. The
news content (text) in our problem is
inherently serialized. The order of the
words carries the important information
of the sentence. So, the LSTM model suits
for our problem.
Since the order of the words is important
for the LSTM unit, we cannot use the
Doc2Vec for pre-processing because it will
transfer the entire document into one
vector and lose the order information. To
prevent that, we use the word embedding
instead. We first clean the text data by
removing all characters which are not
letters nor numbers. Then we count the
frequency of each word appeared in our
training dataset to find 5000 most
common words and give each one a
unique integer ID. For example, the most

Fig 10(ii)(iii)(iv) neural-net-tf.py


pg. 7
8

common word will have ID 0, and the delete the data with only a few words
second most common one will have 1, etc. since they don’t carry enough information
After that we replace each common word for training. By doing this, we transfer the
with its assigned ID and delete all original text string to a fixed length
uncommon words. integer vector while preserving the words
order information. Finally, we use word
embedding to transfer each word ID to a
32-dimension vector.
The word embedding will train each word
vector based on word similarity. If two
words frequently appear together in the
text, they are thought to be more similar
and the distance of their corresponding
vectors is small.
The pre-processing transfers each news in
Fig 11. Frequency of top common words raw text into a fixed size matrix. Then we
feed the processed training data into the
LSTM unit to train the model. The LSTM is
still a neural network. But different from
the fully connected neural network, it has
cycle in the neuron connections.
So,
the
previ

Fig 12. Length of the news

Notice that the 5000 most common words


cover the most of the text, as shown in
Figure 11, so we only lose little ous state (or memory) of the LSTM
information but transfer the string to a list unit ct will play a role in new
of integers. Since the LSTM unit requires a prediction ht.
fixed input vector length, we truncate the
list longer than 500 numbers because
more than half of the news is longer than
500 words as shown in Figure 12. Then for
the list shorter than 500 words, we pad
0’s at the beginning of the list. We also

pg. 8
9

Fig 13(i)(ii)(iii)(iv) LSTM.py


Accuracy- 94.53%

Confusion matrix A confusion matrix is a


table that is often used to describe the
performance of a classification model (or
"classifier") on a set of test data for which
the true values are known.
The output is known as the confusion
matrix, the left diagonal will give all the
correctly predicted results from the
dataset and the right diagonal will give all
the incorrectly predicted results.
For example,

pg. 9
10

 True Negative Rate: When it's actually


no, how often does it predict no?
o TN/actual no = 50/60 = 0.83
o equivalent to 1 minus False Positive
Rate
o also known as "Specificity"
 Precision: When it predicts yes, how
often is it correct?
o TP/predicted yes = 100/110 = 0.91
Fig 14. Confusion matrix for a binary  Prevalence: How often does the yes
classifier condition actually occur in our
The four values from the confusion matrix sample?
contain the following- o actual yes/total = 105/165 = 0.64

 true positives (TP): These are cases


which were predicted as positive (and Confusion matrices for our research
were actually positive). models are as follows-
 true negatives (TN): These are cases
which were predicted as negative (and
were actually negative).
 false positives (FP): These are cases
which were predicted as positive (but
were actually negative). (Also known
as a "Type I error.")
 false negatives (FN): These are cases in
which we predicted as negative (but
were actually positive). (Also known as
a "Type II error.")
Fig 15. Confusion matrix for Naive Bayes
The following rates are often computed
Accuracy- (1188+1803)/4153
from a confusion matrix-
 Accuracy: Overall, how often is the = 72.94%
classifier correct?
o (TP+TN)/total = (100+50)/165 = 0.91
 Misclassification Rate: Overall, how
often is it wrong?
o (FP+FN)/total = (10+5)/165 = 0.09
o equivalent to 1 minus Accuracy
o also known as "Error Rate"
 True Positive Rate: When it's actually
yes, how often does it predict yes?
o TP/actual yes = 100/105 = 0.95
o also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually
no, how often does it predict yes?
o FP/actual no = 10/60 = 0.17
Fig 16. Confusion matrix for SVM

pg. 10
11

Accuracy- (1693+1969)/4153
= 88.42%

Fig 19. Confusion matrix for LSTM


Accuracy- (1982+1920)/4153
= 94.53%
Fig 17. Confusion matrix for Neural
Network using Tensor Flow
Conclusion In this paper, the comparison
Accuracy- (1452+1947)/4153 of various Natural Language Processing
techniques are made which are used to
=81.42%
detect if a news is fake or genuine. The
following results can be drawn from the
models which conclude the research.
(i). A comparison of the models using their
Confusion Matrices to calculate the

Precision, Recall and the F1 scores.


Fig 18. Confusion matrix for Neural
Network using Keras Fig 20. Model performance on the test set

Accuracy- (1529+1540)/4153 (ii). Comparison of the accuracies of the


models
= 92.62%

Fig 21.

pg. 11
12

Accuracy table for the models, shows Schmidhuber (1997). "Long short-term
highest for LSTM memory". [11]Senior, Andrew; Beaufays,
Francoise (2014). "Long Short-Term
Acknowledgement This work benefitted
Memory recurrent neural network
from the invaluable guidance from Ms. C
architectures for large scale
Fancy, who provided valuable feedback
acousticmodeling". [12]Xiangang Wu,
during the final drafting of the paper, her
Xihong (2014-10-15). "Constructing Long
support is gratefully acknowledged.
Short-Term Memory based Deep
References Recurrent Neural Networks for Large
Vocabulary Speech Recognition".[13] Sepp
[1] Datasets, Kaggle,
Hochreiter; Jürgen
https://www.kaggle.com/c/fake-
Schmidhuber (1997). "LSTM can Solve
news/data, February, 2018.[2]Sepp
Hard Long Time Lag
Hochreiter; Jürgen Schmidhuber (21
Problems".[14]KlausGreff; Rupesh Kumar
August 1995), Long Short Term
Srivastava; Jan Koutník; Bas R.
Memory[3]Allcott, H., and Gentzkow, M.,
Steunebrink; Jürgen Schmidhuber (2015).
Social Media and Fake News in the 2016
"LSTM: A Search Space Odyssey". IEEE
Election, https://web.
Transactions on Neural Networks and
stanford.edu/œgentzkow/research/faken
Learning Systems. [15]Beaufays, Françoise
ews.pdf, January, 2017.[4] Quoc, L.,
(August 11, 2015). "The neural networks
Mikolov, T., Distributed Representations
behind Google Voice
of Sentences and Documents,
transcription". Research Blog. Retrieved 2017-
https://arxiv. org/abs/1405.4053, May,
06-27.[16]Sak, Haşim; Senior, Andrew; Rao,
2014.[5] Christopher, M. Bishop, Pattern
Kanishka; Beaufays, Françoise; Schalkwyk,
Recognition and Machine Learning,
Johan (September 24, 2015). "Google
http://users.isr.ist.
voice search: faster and more
utl.pt/˜wurmd/Livros/school/Bishop%20-
accurate". Research Blog. Retrieved 2017-
%20Pattern%20Recognition%20And%
06-27.[17]Cortes, Corinna; Vapnik,
20Machine%20Learning%20-
Vladimir N. (1995). "Support-vector
%20Springer%20%202006.pdf, April,
networks" .[18]Ben-Hur, Asa; Horn, David;
2016. [6] Goldberg, Y., A Primer on Neural
Siegelmann, Hava; Vapnik, Vladimir N.
Network Models for Natural Language
""Support vector clustering"
Processing, https://arxiv.
(2001);". [19] "1.4. Support Vector
org/pdf/1510.00726.pdf, October, 2015.
Machines — scikit-learn 0.20.2
[7]Hochreiter, S., Jrgen, S., Long short-
documentation". Archived from the
term memory. http://www.bioinf.jku.at/
original on 2017-11-08. Retrieved 2017-
publications/older/2604.pdf, October,
11-08.
1997.[8]Bishop, C. M. (2006), Pattern
Recognition and Machine Learning, [20]Hastie, Trevor; Tibshirani,
Springer, Robert; Friedman, Jerome (2008). The
[9]Machine learning and pattern Elements of Statistical Learning : Data
recognition "can be viewed as two facets Mining, Inference,
andPrediction [21]Press, William H.;
of the same field." [10] Sepp
Teukolsky, Saul A.; Vetterling, William T.;
Hochreiter; Jürgen

pg. 12
13

Flannery, Brian P. (2007). "Section 16.5.


Support Vector Machines".[22]Joachims,
Thorsten (1998). "Text categorization with
Support Vector Machines: Learning with
many relevant features".[23]Pradhan,
Sameer S., et al. "Shallow semantic
parsing using support vector machines."
[24]Vapnik, Vladimir N.: Invited Speaker.
IPMU Information Processing and
Management 2014).[25] Barghout,
Lauren. "Spatial-Taxon Information
Granules as Used in Iterative Fuzzy-
Decision-Making for Image
Segmentation". [26]A. Maity (2016).
"Supervised Classification of RADARSAT-2
Polarimetric Data for Different Land
Features". [27]DeCoste, Dennis
(2002). "Training Invariant Support Vector
Machines" [28]Maitra, D. S.;
Bhattacharya, U.; Parui, S. K. (August
2015). "CNN based common approach to
handwritten character recognition of
multiplescripts"., [29]Bilwaj; Davatzikos,
Christos; "Analytic estimation of statistical
significance maps for support vector
machine based multi-variate image
analysis and classification".[30]Cuingnet,
Rémi; Rosso, Charlotte; Chupin, Marie;
Lehéricy, Stéphane; Dormont, Didier;
Benali, Habib; Samson, Yves; and Colliot,
Olivier; "Spatial regularization of SVM for
the detection of diffusion alterations
associated with stroke outcome"

[31]Statnikov, Alexander; Hardin, Douglas;


&Aliferis, Constantin; (2006); "Using SVM
weight-based methods to identify causally
relevant and non-causally relevant
variables", Sign, 1, 4.[32] Boser, Bernhard
E.; Guyon, Isabelle M.; Vapnik, Vladimir N.
(1992). "A training algorithm for optimal
margin classifiers".

pg. 13
14

Copyright protected @ ENGPAPER.COM and


AUTHORS

https://www.engpaper.com

pg. 14

You might also like