Machine Learning Fake News Blocking
Machine Learning Fake News Blocking
Harshvardhan Singh
Department of Engineering and Technology,
SRM Institute of Science and Technology, Kattankulathur,
Kancheepuram Dist., India, 603203
e-mail: [email protected]
pg. 1
2
Data The datasets used for this project Dataset Description In a dataset, a
were drawn from Kaggle. The training training set is implemented to build up a
dataset has about 16600 rows of data model, while a test (or validation) set is to
from various articles on the internet. validate the model built. Data points in the
Quite a bit of pre-processing of the data training set are excluded from the test
had to be done, as is evident from our (validation) set. Usually, a dataset is
source code, in order to train our models. divided into a training set, a validation set
(some people use ‘test set’ instead) in
A full training dataset has the following each iteration, or divided into a training
attributes: set, a validation set and a test set in each
iteration. The research dataset includes
1. id: unique id for a news article the following.
pg. 2
3
Feature extraction and Pre-processing with the assumption that all features are
The embeddings used for the majority of conditionally independent given the class
the modeling are generated using the label. As with the other models, I used the
Doc2Vec model. The goal is to produce a Doc2Vec embeddings described above.
vector representation of each article. The Naive Bayes Rule is based on the
Before applying Doc2Vec, we perform Bayes’ theorem
some basic pre-processing of the data.
This includes removing stop words,
deleting special characters and
(
punctuation, and converting all text to
1
lowercase. This produces a comma-
separated list of words, which can be
input into the Doc2Vec algorithm to
Above,
produce a 300-length embedding vector
for each article.
P(c|x) is the posterior probability
Doc2Vec is a model developed in 2014 of class (c, target)
based on the existing Word2Vec model, given predictor (x, attributes).
P(c) is the prior probability of class.
which generates vector representations
P(x|c) is the likelihood which is the
for words. Word2Vec represents
documents by combining the vectors of probability of predictor given class.
P(x) is the prior probability
the individual words, but in doing so it
loses all word order information. Doc2Vec of predictor.
expands on Word2Vec by adding a
“document vector” to the output
representation, which contains some Parameter estimation for naive Bayes
information about the document as a models uses the method of maximum
whole, and allows the model to learn likelihood. The advantage here is that it
some information about word order. requires only a small amount of training
Preservation of word order information data to estimate the parameters.
makes Doc2Vec useful for our application,
as we are aiming to detect subtle Let’s understand it using an example.
differences between text documents. Below I have a training data set of
weather and corresponding target
Models The following learning algorithms variable ‘Play’ (suggesting possibilities of
are used in conjunction with the proposed playing). Now, we need to classify
methodology to evaluate the whether players will play or not based on
performance of fake news detection weather condition. Let’s follow the below
classifiers. steps to perform it.
Naive Bayes In order to get a baseline Step 1: Convert the data set into a
accuracy rate for our data, I implemented frequency table
a Naive Bayes classifier. Specifically, I used
the scikit-learn implementation of Step 2: Create Likelihood table by finding
Gaussian Naive Bayes. This is one of the the probabilities like Overcast probability
simplest approaches to classification, in = 0.29 and probability of playing is 0.64.
which a probabilistic approach is used,
pg. 3
4
Fig 5. Naive-bayes.py
Accuracy- 72.94%
Fig 6. Possible hyper planes
pg. 4
5
To separate the two classes of data points, We use the theory introduced in to
there are many possible hyperplanes that
could be chosen. Our objective is to find a
plane that has the maximum margin, i.e
the maximum distance between data
points of both classes. Maximizing the
margin distance provides some
reinforcement so that future data points implement the SVM. The main idea of the
can be classified with more confidence. SVM is to separate different classes of
data by the widest “street”. This goal can
Support vectors are data points that are be represented as the optimization
problem
(3)
Then we use the Lagrangian function to
get rid of the constraints.
pg. 5
6
applications.
pg. 6
7
common word will have ID 0, and the delete the data with only a few words
second most common one will have 1, etc. since they don’t carry enough information
After that we replace each common word for training. By doing this, we transfer the
with its assigned ID and delete all original text string to a fixed length
uncommon words. integer vector while preserving the words
order information. Finally, we use word
embedding to transfer each word ID to a
32-dimension vector.
The word embedding will train each word
vector based on word similarity. If two
words frequently appear together in the
text, they are thought to be more similar
and the distance of their corresponding
vectors is small.
The pre-processing transfers each news in
Fig 11. Frequency of top common words raw text into a fixed size matrix. Then we
feed the processed training data into the
LSTM unit to train the model. The LSTM is
still a neural network. But different from
the fully connected neural network, it has
cycle in the neuron connections.
So,
the
previ
pg. 8
9
pg. 9
10
pg. 10
11
Accuracy- (1693+1969)/4153
= 88.42%
Fig 21.
pg. 11
12
Accuracy table for the models, shows Schmidhuber (1997). "Long short-term
highest for LSTM memory". [11]Senior, Andrew; Beaufays,
Francoise (2014). "Long Short-Term
Acknowledgement This work benefitted
Memory recurrent neural network
from the invaluable guidance from Ms. C
architectures for large scale
Fancy, who provided valuable feedback
acousticmodeling". [12]Xiangang Wu,
during the final drafting of the paper, her
Xihong (2014-10-15). "Constructing Long
support is gratefully acknowledged.
Short-Term Memory based Deep
References Recurrent Neural Networks for Large
Vocabulary Speech Recognition".[13] Sepp
[1] Datasets, Kaggle,
Hochreiter; Jürgen
https://www.kaggle.com/c/fake-
Schmidhuber (1997). "LSTM can Solve
news/data, February, 2018.[2]Sepp
Hard Long Time Lag
Hochreiter; Jürgen Schmidhuber (21
Problems".[14]KlausGreff; Rupesh Kumar
August 1995), Long Short Term
Srivastava; Jan Koutník; Bas R.
Memory[3]Allcott, H., and Gentzkow, M.,
Steunebrink; Jürgen Schmidhuber (2015).
Social Media and Fake News in the 2016
"LSTM: A Search Space Odyssey". IEEE
Election, https://web.
Transactions on Neural Networks and
stanford.edu/œgentzkow/research/faken
Learning Systems. [15]Beaufays, Françoise
ews.pdf, January, 2017.[4] Quoc, L.,
(August 11, 2015). "The neural networks
Mikolov, T., Distributed Representations
behind Google Voice
of Sentences and Documents,
transcription". Research Blog. Retrieved 2017-
https://arxiv. org/abs/1405.4053, May,
06-27.[16]Sak, Haşim; Senior, Andrew; Rao,
2014.[5] Christopher, M. Bishop, Pattern
Kanishka; Beaufays, Françoise; Schalkwyk,
Recognition and Machine Learning,
Johan (September 24, 2015). "Google
http://users.isr.ist.
voice search: faster and more
utl.pt/˜wurmd/Livros/school/Bishop%20-
accurate". Research Blog. Retrieved 2017-
%20Pattern%20Recognition%20And%
06-27.[17]Cortes, Corinna; Vapnik,
20Machine%20Learning%20-
Vladimir N. (1995). "Support-vector
%20Springer%20%202006.pdf, April,
networks" .[18]Ben-Hur, Asa; Horn, David;
2016. [6] Goldberg, Y., A Primer on Neural
Siegelmann, Hava; Vapnik, Vladimir N.
Network Models for Natural Language
""Support vector clustering"
Processing, https://arxiv.
(2001);". [19] "1.4. Support Vector
org/pdf/1510.00726.pdf, October, 2015.
Machines — scikit-learn 0.20.2
[7]Hochreiter, S., Jrgen, S., Long short-
documentation". Archived from the
term memory. http://www.bioinf.jku.at/
original on 2017-11-08. Retrieved 2017-
publications/older/2604.pdf, October,
11-08.
1997.[8]Bishop, C. M. (2006), Pattern
Recognition and Machine Learning, [20]Hastie, Trevor; Tibshirani,
Springer, Robert; Friedman, Jerome (2008). The
[9]Machine learning and pattern Elements of Statistical Learning : Data
recognition "can be viewed as two facets Mining, Inference,
andPrediction [21]Press, William H.;
of the same field." [10] Sepp
Teukolsky, Saul A.; Vetterling, William T.;
Hochreiter; Jürgen
pg. 12
13
pg. 13
14
https://www.engpaper.com
pg. 14