Identifying Fake News
Identifying Fake News
Akshay Tarate1, Shubham Aglawe 2, Aditya Hirve 3, Anil Rathod4, Trupti Dange5
(Computer Engineering, RMDSSOE, SPPU, India)
ABSTRACT :
The rapid development of the Internet allows for the rapid dissemination of information via social media and
websites.Social Media plays a vital role in the public dissemination of information about events nowadays. Without
the concern about the credibility of the information, the unverified or fake news is spread in social networks and
reaches thousands of users. Fake news is typically generated for commercial and political interests to mislead and
attract readers. The spread of fake news has raised a big challenge to society. Automatic credibility analysis of news
articles is a current research interest. Deep learning models are widely used for linguistic modeling. Typical deep
learning models such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) can detect
complex patterns in textual data. Long Short-Term Memory (LSTM) is a tree-structured recurrent neural network
used to analyze variable-length sequential data. LSTM allows looking at particular sequences both from
front-to-back as well as from back-to-front. The paper presents a fake news detection model based on the
LSTM-recurrent neural network. Two publicly available unstructured news articles datasets are used to assess the
performance of the model. The result shows the superiority in terms of accuracy of LSTM model over other methods
namely CNN, vanilla RNN and unidirectional LSTM for fake news detection.
Keywords - Deep learning; Convolutional Neural Network; Recurrent Neural Network;Long Short-Term Memory;
LSTM
I. INTRODUCTION
Fake news is a yellow press that is intentionally misinforming or smearing through both conventional print
media and modern online social media.There are some barriers to fake news recognition in social media. Firstly,
fake news data can hardly be collected. In addition, the manual labelling of false news is difficult. As they are
published deliberately to confuse viewers, simply by news content it is difficult to identify them. In addition, closed
messaging applications are Facebook, Whatsapp, and Twitter. Consequently, it is difficult to accept the
disinformation distributed by reliable newsagents or their friends and family as false. The credibility of fresh and
time-bound news is not easy to check because it is not enough for the application data set to be trained.
The topic of disinformation on social media can be addressed in a number of ways. Statistical methods are
used to determine the relationship between different aspects of the information, analyse the information's originator,
and examine distribution patterns. Untrustworthy content is classified using machine learning algorithms, and the
accounts that post it are investigated. Various methods concentrate on the development of strategies for knowledge
authentication as well as case studies.
We define the fake news detection issue as a problem of credibility reduction, in which real ones are more
credible, whereas unauthentic ones are less credible.
When the user asks a query to the system, Naive Bayes classifies the query and LSTM gives the probability
score.
Doc2Vec is based on the Word2Vec model. It is used to preserve word order information. Extracts
Word2Vec features and adds an additional “document vector” with information about the entire document.
V. Algorithms
The Naive Bayes model is simple to construct and is especially good for huge data sets. Naive Bayes is
renowned to outperform even the most advanced classification systems due to its simplicity.The Bayes theorem
allows you to calculate posterior probability P(c|x) from P(c), P(x), and P(x|c) using P(c), P(x), and P(x|c). Consider
the following equation:
2. LSTM
LSTMs are specifically developed to prevent the problem of long-term dependency. They don't have to work
hard to remember knowledge for lengthy periods of time; it's like second nature to them!
All recurrent neural networks are made up of a series of repeated neural network modules. This repeating
module in ordinary RNNs will have a relatively simple structure, such as a single tanh layer. LSTMs have a
chain-like structure as well, but the repeating module is different.
Instead of a single neural network layer, there are four, each of which interacts in a unique way.
Each line in the figure above transmits a full vector from one node's output to the inputs of others. The pink
circles denote pointwise operations, such as vector addition, and the yellow boxes denote learnt neural network
layers. Concatenation occurs when lines merge, whereas forking occurs when a line's content is replicated and
the copies are sent to various locations.
3. SVM
Each data item is plotted as a point in n-dimensional space (where n is the number of features you have), with
the value of each feature being the value of a certain coordinate in the SVM algorithm. Then we accomplish
classification by locating the hyper-plane that clearly distinguishes the two classes (look at the below snapshot).
Individual observation coordinates are what Support Vectors are. The SVM classifier is a frontier that separates
the two classes (hyper-plane/line) as well as possible.
4. Logistic Regression
Under Supervised Learning approaches, one of the most common Machine Learning algorithms is logistic
regression.It can be used for both classification and regression problems, though it is more commonly employed
for classification.With the help of independent factors, logistic regression is utilised to predict the categorical
dependent variable.Only 0 and 1 can be the outcome of a Logistic Regression problem.
When the probabilities between two classes must be calculated, logistic regression can be utilised. For example,
if it will rain today or not, 0 or 1, true or false, and so on.
5. Random Forest
Random forests, also known as random decision forests, are an ensemble learning method for classification
and other tasks that work by building a large number of decision trees during training and then outputting
the class that is the mode of the classes (classification) or mean/average prediction (regression) of the
individual trees. Random forests outperform decision trees in general, but their accuracy is low.
VII. PROPOSED MODEL
System Architecture
● We start from collecting the data to train our model first step include Data pre processing, In this
step we clean the data i.e remove all blank spaces ,noise,etc
● The processed data is tokenized,tagged ,after this we collect the verbs and topics to which the
news is related.
● After this we vectorize the data using doc to vec model which makes it easy for the next process.
● This data in vector form is then processed through our classification models which classify the
news according to their topics.
● There are 4 types of classification used i.e Naive Bayes,SVM,Logistic Regression and Random
Forest,all these models run simultaneously using Pipelining.
● Whichever model provides the best and fast result is then selected and processed into the main
model .i.e LSTM.
● Using this model it checks the data using web search and is scored accordingly.
● The score is divided into 3 category -
1. Time Credibility: It checks how much time was required for the data to be available,It
accounts for 40% of the score.Real news won't take a lot of time to be searched.
2. Website Credibility: This checks the website url if it is a trusted website or not.It
accounts for 40% of the score.
3. Data Credibility: In this we check the headline,data and compare it with the entered
data.This accounts for 20% of the score.
● After this process the score is summed up and then we take an average score.
● If the score is above 60 we treat it as real news and if the score is less than 60 we search this news
on social media app Twitter.
● After getting the score from this social media module we add up this score with the previously
fetched web search score and take the average frt is termed as fake newsom this score if the score
is above 60 we term it as Real news and if the score is below 60 it is termed as fake news.
VIII. Fake News Detection system can be divided into three major parts -
1. Web Application is the front end from which users can ask questions that can be related
to anything from politics, financial to other general news which they got from whatsapp
or any social media site. Here User is provided with information on whether news is fake
or Truthful.
2. In processing part the gathered information is processed in which we remove noisy
data.The processed data is tokenized and tagged. Classifiers classify news according to
the topic. There are four type of classifier used i.e Naive Bayes, SVM, Logistic
Regression and Random Forest, all these models run simultaneously using pipelining.
3. The LSTM model checks data using web search and it is scored accordingly. The score
is divided into 3 categories: time credibility, website credibility and data credibility.
Average of these scores is accepted. If the score is above 60 it is real news and if the
score is less than 60 it is fake news.
IX. CONCLUSION
Much of the jobs will be completed digitally in the 21st century. Applications like Facebook, Twitter and
news articles that had previously been preferred as hardcopies are now being replaced. The increasing issue of fake
news just complicates matters and seeks to modify or impede people's views and attitudes towards digital
technology usage. Thus Google and Facebook take action to discourage the dissemination of false news in order to
stop the phenomenon. Our systems enter a URL or an existing database and mark it as valid or incorrect. Different
Algorithms and Machine Learning techniques must be used to implement this.
For the balanced and imbalanced high-dimensional news data collection, the proposed model works well.
In the future, more in-depth research will be needed to better understand how a deep learning model with attention
will aid in the automated credibility analysis of news.
REFERENCES
[2] CNN Business April 21, 2020, https://edition.cnn.com/2020/04/21/tech/sri-lanka-blocks-social-media/index.html, last accessed 2019/07/13.
[4] William Kai Shu, Amy Sliva , Suhang Wang , Jiliang Tang and Huan Liu. (2019). “Fake News Detection on Social Media: A Data Mining
Perspective”, SIGKDD Explorations: 19(1).
[5] H. Allcott and M. Gentzkow.(2019) “Social Media and Fake News in the 2018 Election,” Journal of Economic Perspectives, 31(2): 211–236.
[6]Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. (2019). “Comparative Study of CNN and RNN for Natural Language
Processing”.
[7] Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao. (2019). “Recurrent Convolutional Neural Networks for Text Classification”, Proceedings of the
Twenty-Ninth AAAI Conference on Artificial Intelligence
[8]Granik, Mykhailo ,Volodymyr Mesyura.(2019). “Fake News Detection using Naive Bayes Classifier.” IEEE First Ukraine Conference
on Electrical and Computer Engineering (UKRCON):900-903.
[9]S. Gilda. (2019). “Evaluating Machine Learning Algorithms for Fake News Detection,” IEEE 15th Student Conference on Research
and Development (SCOReD), Putrajaya: 110-115.
[10]Bourgonje, Peter, Moreno Schneider, Julian and Rehm, Georg. (2019). “From Clickbait to Fake News Detection: An Approach
based on Detecting the Stance of Headlines to Articles”. Proceedings of the 2019 EMNLP Workshop: Natural Language Processing
meets Journalism:84 -89.
[11] Wang, William Yang. (2019). “"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.” ACL.
[12]Liu, Yang & Han, Kun & Tan, Zhao & Lei, Yun.(2019). “Using Context Information for Dialog Act Classification in DNN
Framework”, Proceedings of the Conference on Empirical Methods in Natural Language Processing: 2170–2178.
[13] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Jim Jansen,Kam-Fai Wong and Meeyoung Cha. (2018). “Detecting Rumors from
Microblogs with Recurrent Neural Networks”, Proceedings of the Twenty-Fifth International Joint Conference on Artificial
Intelligence:3818-3824.
[14]Lakkaraju, Himabindu, Richard Socher, and Chris Manning, (2018). “Aspect Specific Sentiment Analysis using Hierarchical Deep
Learning”, NIPS Workshop on deep learning and representation learning.
[15]Kim, Yoon. (2018). “Convolutional Neural Networks for Sentence Classification”. Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing. 10.3115/v1/D14-1181.
[16] GloVe: Global Vectors for Word Representation, https://nlp.stanford.edu/projects/glove/, last accessed 2019/07/13 .
[17] Hassan and A. Mahmood. (2019). “Convolutional Recurrent Deep Learning Model for Sentence Classification.” IEEE Access
6:13949-13957
[19] K. Greff, R. K.Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber. (2019). “LSTM: A search space odyssey.”IEEE Transactions on
Neural Networks and Learning Systems.
[20] real_or_fake, https://www.kaggle.com/rchitic17/real-or -fake, last accessed 2020/07/13. [21] Fake News detection,
https://www.kaggle.com/jruvika/fake-news-detection, last accessed 2020/07/13.