Fake News Project

Fake News Detection using Natural Language Processing
CHAPTER 1
INTRODUCTION
1.1 Overview
The rapid growth of fake news, especially in social media has become a challenging problem
that has negative social impacts on a global scale. The ever-growing volume of fake news has
turned into a significant global problem, as it is difficult to make the difference between the
genuine and fake news. Hence, fake news detection has become a very important task, but yet
technically very challenging.
In the recent years, online content has been playing a significant role in users decisions and
opinions. Opinions such as online reviews are the main source of information for e-commerce
customers to help with gaining insight into the products they are planning to buy. Recently it
has become apparent that opinion spam does not only exist in product reviews and customers’
feedback. In fact, fake news and misleading articles is another form of opinion spam, which
has gained traction. Some of the biggest sources of spreading fake news are social media
websites such as Google Plus.
Even though the problem of fake news is not a new issue, detecting fake news is believed to
be a complex task given that humans tend to believe misleading information and the lack of
control of the spread of fake content. Fake news has been getting more attention in the last
couple of years, especially since the US election in 2016. It is tough for humans to detect fake
news. It can be argued that the only way for a person to manually identify fake news is to
have a vast knowledge of the covered topic. Even with the knowledge, it is considerably hard
to successfully identify if the information in the article is real or fake. The open nature of the
web and social media in addition to the recent advance in computer science simplify the
process of creating and spreading fake news.
In general, Fake news could be categorized into three groups.
 The first group is fake news, which is news that is completely fake and is made up by
the writers of the articles.
 The second group is fake satire news, which is fake news whose main purpose is to
provide humor to the readers.
 The third group is poorly written news articles, which have some degree of real news,
but they are not entirely accurate. In short, it is news that uses, for example, quotes
from political figures to report a fully fake story. Usually, this kind of news is
designed to promote certain agenda or biased opinion.
Dept. of IS&E Page 1 AIT,CKM

1.2Overview of Algorithms
1.2.1 Natural Language Processing
Natural Language Processing (NLP) is a branch of artificial intelligence that helps computers

understand, interpret and manipulate human language. NLP draws from many disciplines,
including computer science and computational linguistics, in its pursuit to fill the gap
between human communication and computer understanding.
Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-of-speech

tagging, language detection and identification of semantic relationships. If you ever
diagrammed sentences in grade school, you’ve done these tasks manually before. In general
terms, NLP tasks break down language into shorter, elemental pieces, try to understand
relationships between the pieces and explore how the pieces work together to create meaning.
These underlying tasks are often used in higher-level NLP capabilities, such as:
 Content categorization. A linguistic-based document summary, including search and

indexing, content alerts and duplication detection.
 Topic discovery and modelling. Accurately capture the meaning and themes in text
collections, and apply advanced analysis to text, like optimization and forecasting.
 Contextual extraction. Automatically pull structured information from text-based
sources.
 Sentiment analysis. Identifying the mood or subjective opinions within large
amounts of text, including average sentiment and opinion mining.
 Speech-to-text and text-to-speech conversion. Transforming voice commands into
written text, and vice versa.
 Document summarization. Automatically generating synopses of large bodies of
text.
 Machine translation. Automatic translation of text or speech from one language to
another.

1.2.2 Term Frequency-Inverse Document Frequency(TF-IDF)
Term Frequency-Inverse Document Frequency is a very common algorithm to transform text

into a meaningful representation of numbers. This technique is widely used to extract features
across various NLP applications.
The most popular method to calculate the word frequencies is TF-IDF. These are the
components of the resulting scores assigned to each word.
 Term Frequency: This summarizes how often a given word appears within a
document.
 Inverse Document Frequency: These downscale words that appear a lot across
documents.
TFIDF Vectorizer creates a term matrix using the following logic:
wi,j  tfi,j * log(N/dfi)
where,
wi,j = weight of the cell in matrix which signifies how important the word
is for a particular context
tfi,j = number of times term i occurs in j divided by total number of terms
in j
dfi = number of documents containing i in it.
N = total number of documents
The TFIDF Vectorizer will tokenize documents, learn the vocabulary and inverse document
frequency weightings, allow you to encode new documents. Alternately, if you already have a
learned Count Vectorizer, you can use it with a TFIDF Transformer to just calculate the
inverse document frequencies and start encoding documents.

CHAPTER 2
OBJECTIVES
 The main objective is to detect the fake news, which is a classic text classification
problem with a straight forward proposition. It is needed to build a model that can
differentiate between “Real” news and “Fake” news.
 This project proposes a feasible method, which contain several aspects to accurately
tackle the fake news detection issue. Thus it is a combination of semantic analysis
using techniques of NLP.
 The proposed method is entirely composed of Artificial Intelligence approaches,
which is critical to accurately classify between the real and the fake.
 The three-part method is a combination between Machine Learning algorithms that
subdivide into supervised learning techniques, and natural language processing
methods.
 Although each of the above mentioned approaches can be solely used to classify and
detect fake news, in order to increase the accuracy and be applicable to the social
media domain, they have been combined into an integrated algorithm as a method for
fake news detection.

CHAPTER 3
PROBLEM DEFINITION
3.1 Fake News Detection Using Natural Language Processing

Social media for news consumption is a double-edged sword. On one hand, its low cost, easy
access, and rapid dissemination of information lead people to seek out and consume news
from social media. On the other hand, it enables the wide spread of “fake news”, i.e., low
quality news with intentionally false information.
While it is easier to understand and trace the intention and the impact of fake reviews, the
intention, and the impact of creating propaganda by spreading fake news cannot be measured
or understood easily. For instance, it is clear that fake review affects the product owner,
customer and online stores; on the other hand, it is not easy to identify the entities affected by
the fake news. This is because identifying these entities requires measuring the news
propagation, which has shown to be complex and resource intensive.
The extensive spread of fake news has the potential for extremely negative impacts on
individuals and society. Therefore, fake news detection on social media has recently become
an emerging research that is attracting tremendous attention.
3.2 Existing System
Research on fake news detection is still at an early stage, as this is a relatively recent
phenomenon, at least regarding the interest raised by society. There exists a large body of
research on the topic of machine learning methods for deception detection; most of it has
been focusing on classifying online reviews and publicly available social media posts.
Particularly since late 2016 during the American Presidential election, the question of
determining 'fake news' has also been the subject of particular attention within the literature.
In [4], Shloka Gilda presented concept approximately how NLP is relevant to stumble on
fake information. They have used Count Vectorizer of bi-grams and probabilistic context free
grammar (PCFG) for deception detection. They have examined their dataset over more than
one class algorithms to find out the great model. They locate that CV of bi-grams fed right
into a Stochastic Gradient Descent model identifies non-credible resources with an accuracy
of 71.2%.
The lack of available corpora for predictive modeling is an important limiting factor in
designing effective models to detect fake news.

Disadvantages:
 The accuracy dropped to 71.2% when predicting fake news against real news.
 In CountVectorizer counts only, the number of times a word appears in the
document which results in biasing in favour of most frequent words. This ends up in
ignoring rare words which could have helped is in processing our data more
efficiently.
3.3 Proposed System

In this project, a model is built based on the TF-IDF matrix word tallies relative to how often
they are used in other articles in dataset. Since this problem is a kind of text classification,
implementing different classifiers and contrasting their results will be the best way, as this is
standard for text-based processing.
Machine learning technique is used to detect fake news, which consists of using text analysis
based on classification techniques. Experimental evaluation is conducted using a dataset
compiled from real and fake news websites, yielding very encouraging results.
The actual goal is in developing a model which has the text transformation and choosing
which type of text to use (headlines versus full text). Now the next step is to extract the most
optimal features for TF-IDF Vectorizer, this is done by using a n-number of the most used
words, and/or phrases, lower casing or not, mainly removing the stop words which are
common words such as “the”, “when”, and “there” and only using those words that appear at
least a given number of times in a given text dataset.
The three-part method is a combination between Machine Learning algorithms that subdivide
into supervised learning techniques, and natural language processing methods.Although each
of the above mentioned approaches can be solely used to classify and detect fake news, in
order to increase the accuracy and be applicable to the social media domain, they have been
combined into an integrated algorithm as a method for fake news detection.
Advantages:
 High accuracy on detection.
 Fake news can be detected using machine learning techniques.
 In Tf-idf Vectorizer considers overall document weightage of a word. It helps in

dealing with most frequent words. Tf-idf Vectorizer weights the word counts by a
measure of how often they appear in the documents.

CHAPTER 4
LITERATURE SURVEY
In December 2016, the First Fake News Challenge dataset was launched. This dataset
contains article bodies and headlines from news articles. The stance detection task extends
the work of Emergent dataset, estimating the stance between the body texts relative to a
headline. The published fnc-1 dataset contains four possible classes: agree, disagree, discuss
or unrelated. Challenge winners reported an accuracy of 82.02%. Recently there have been
several works related to fake news.
A comprehensive review on fake news detection is discussed below:
Golbeck et al. [1] presented a dataset of fake news and satirical stories that are hand-coded,
verified, and in the case of fake news, include rebutting stories. The dataset contains 283 fake
news stories and 203 satirical stories chosen from a diverse set of sources.
Shloka Gilda [4] presented concept approximately how NLP is relevant to stumble on fake
information. They have used time period frequency-inverse record frequency (TF-IDF) of
bigrams and probabilistic context free grammar detection.
Shu. K., Sliva A., Wang S., Tang J., & Liu H [5] Social media for news consumption is a
double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of
information lead people to seek out and consume news from social media. On the other hand,
it enables the wide spread of \fake news", i.e., low quality news with intentionally false
information.
Long et al. [7] proposes a hybrid attention based Long Short Time Memory (LSTM) model
that analyses the profile of the speakers, specifically considering the speaker’scredit history
(declared statements in the past) reported an accuracy of 41.5%.
W. Ferreira and A. Vlachos [8] “Emergent: dataset was collected and annotated by
journalists. The task involves stance detection, i.e. estimating the relative perspective of two
pieces of text relative to a topic, claim or issue. It contains 300 rumoured claims and 2,595
associated news articles, categorized in 3 classes: true, false or unverified.
Conroy, Rubin, Cornwell and Chen [9] provide a conceptual overview of satire and humor,
elaborating and illustrating the unique features of satirical news, which mimics the format
and style of journalistic reporting. Satirical news stories were carefully matched and
examined in contrast with their legitimate news counterparts in 12 contemporary news topics
in 4 domains (civics, science, business, and “soft” news).

CHAPTER 5
SYSTEM DESIGN
5.1 SOFTWARE REQUIREMENT AND SPECIFICATION

The purpose of system requirement specification is to produce the specification
analysis of the task and also to establish complete information about the requirement,
behavior and other constraints such as functional performance and so on. The goal of system
requirement specification is to completely specify the technical requirements for the product
in a concise and unambiguous manner.
HARDWARE REQUIREMENTS:
 Processor - Any processor above 500 MHz
 RAM - 4GB
 Hard Disk - 250GB
SOFTWARE REQUIREMENTS:
 Operating system - Windows 10
 Programming Language - Python 3.5
 Packages - Numpy, Pandas, Sklearn, Keras, Scipy, Scikit-learn,
Gensim, Shutil, Pillow, Tensorflow, Nltk
 Platform - PyCharm

CHAPTER 6
METHODOLOGY
Methodology is the systematic, theoretical analysis of the methods applied to a field of

study. It comprises the theoretical analysis of the body of methods and principles
associated with a branch of knowledge.
The basic idea of our project is to build a model that can predict the credibility of news
events. As shown in Fig6.1, the proposed framework consists of five major steps: Data
acquisition, Data pre-processing, Feature extraction, Model construction and Model
evaluation. In the first step key phrases of the news event is taken as an input that the
individual need to authenticate. After that data is collected from a repository. The data
preprocessing unit is responsible for preparing a data for further processing. Feature
extraction is based on NLP techniques. A classification model is built using Naïve Bayes
Classifier, Support vector machine and long short term memory. By doing the evaluation
of, effects acquired from classification and analysis models using accuracy and confusion
matrix, it is able to decide the piece of news being fake or real.
Data Data Feature

Acquisition Pre-processing Extraction
Model Model
Construction Evaluation
Fig6.1 Flow of the module in Fake News Detection System

Step 1: Data Acquisition
There are two parts in the data-acquisition process, “fake news” and “real news”. Collecting
the fake news was easy as Kaggle released a fake news dataset consisting of 13,000 articles
published during the 2016 election cycle. Now the later part is very difficult. That is to get
the real news for the fake news dataset. It requires huge work around many Sites because it
was the only way to do web scraping thousands of articles from numerous websites. With the
help of web scraping a total of 5279 articles, real news dataset was generated, mostly from
media organizations.
This project includes the news samples as datasets from a repository such as scikit learn.
Dataset includes body of the news article, the headline of the news article, and the label for
relatedness of an article and headline. In this project, we have used various natural language
processing techniques and machine learning algorithms to classify fake news articles using
scikit libraries from python.
Step 2: Data Pre-processing
Text data requires special pre-processing to implement machine learning algorithms on them.
This process is also called Data Cleaning. There are various techniques widely used to
convert text data into a form that is ready for modeling. The data pre-processing steps that are
outlined below are applied to both the headlines and the news articles.
1. Stop Word Removal :
We start with the removal of stop words from the text data available. Stops Words (most
common words in a language which do not provide much context) can be processed and
filtered from the text as they are more common and hold less useful information. Stop words
acts more like a connecting part of the sentences, for example, conjunctions like “and”, “or”
and “but”, prepositions like “of”, “in”, “from”, “to”, etc. and the articles “a”, “an”, and “the”.
Such stop words which are of less importance may take up valuable processing time, and
hence removing stop words as a part of data pre-processing is a key first step in natural
language processing. We used Natural Language Toolkit – (NLTK) library to remove stop
word. Figure 6.2 illustrates an example of stop word removal.

Fig.6.2 Example for Stop Word Removal
2. Punctuation Removal :
Punctuation in natural language provides the grammatical context to the sentence.

Punctuations such as a comma, might not add much value in understanding the meaning of
the sentence. Figure 6.3 shows an example of Punctuation removal process.
Fig.6.3 Example for Punctuation Removal

3. Stemming :
Stemming is a technique to remove prefixes and suffixes from a word, ending up with the
stem or root word. Using stemming we can reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form. Figure 6.4shows the example
of stemming technique.
Fig.6.4 Example Stemming
Step 3: Feature Extraction
News content features describe the meta information related to a piece of news. Lists of
representative news content attributes are listed below:
 Source: Author or publisher of the news article.

 Headline: Short title text that aims to catch the attention of readers and describes the
main topic of the article.
 Body Text: Main text that elaborates the details of the news story; there is usually a
major claim that is specifically highlighted and that shapes the angle of the
publisher.
 Image/Video: Part of the body content of a news article that provides visual cues to
frame the story.
Based on these raw content attributes, different kinds of feature representations can be built
to extract discriminative characteristics of fake news. Typically, most of the news contents
are linguistic-based and visual-based.
In this dataset the feature extraction and selection methods are from scikit and python. To
perform feature selection, a method called tf–idf is used. Project also uses word to vector to
extract the features, also pipelining has been used to ease the code.

Step 4: Model Construction
I. Naive Bayes Classifier:
Most of the approaches consider the fake news problem as a classification problem that
predicts whether a news article is fake or not. A Naive Bayes Classifier is a probabilistic
machine learning model that’s used for classification task.
The crux of Bayesian Classifier is Bayes theorem:
 P(A) = Prior probability, probability that the prediction A holds.

 P(B) = Probability that the given data B is observed without any prediction on them.
 P(B|A) = Likelihood probability, probability of of observing data B given some
instance in which prediction A holds.
 P(A|B) = Posterior probability, probability that the prediction A holds good after
observing the data B.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect
the other. Hence it is called naive.
Each observed training example can incrementally decrease or increase the estimated
probability that a piece of article taken as data is correct. This provides a more flexible
approach to learning, than algorithms that completely eliminate the data if it is found to be
inconsistent with any single example.
P(Word) = Word count +1/ (total number of words+ No. of unique words)
By using this formula one can find the accuracy of the news.
New instances are classified by combining the predictions of multiple, previously classified
datasets which are weighted by their probabilities. The classification of the data is done in
two parts that is test data and train data and the train dataset are classified into groups with
similar entities. Later the test data is matched, and the group is assigned to whichever it
belongs to and then further the Naïve Bayes classifier is applied and the probability of each
and every word is calculated individually.
II. Support Vector Machine (SVM):

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be

used for both classification and regression challenges. However, it is mostly used in
classification problems. In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the value of each feature
being the value of a particular coordinate. Then, we perform classification by finding the
hyper-plane that differentiates the two classes very well.
SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable. A separator between the
categories is found, and then the data are transformed in such a way that the separator could
be drawn as a hyperplane.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier.
In Python, scikit-learn is a widely used library for implementing machine

learning algorithms. SVM is also available in the scikit-learn library and we follow the same
structure for using it (Import library, object creation, fitting model and prediction).
III. Long Short -Term Memory (LSTM):
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)

architecture] used in the field of deep learning. Unlike standard feed forward neural networks,
LSTM has feedback connections. It can not only process single data points (such as images),
but also entire sequences of data (such as speech or video). For example, LSTM is applicable
to tasks such as unsegmented, connected handwriting recognition, speech recognition and
anomaly detection in network traffic or IDS's (intrusion detection systems).
An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks.
These blocks can be thought of as a differentiable version of the memory chips in a digital
computer. Each one contains one or more recurrently connected memory cells and three
multiplicative units – the input, output and forget gates – that provide continuous analogues
of write, read and reset operations for the cells.
LSTM holds promise for any sequential processing task in which we suspect that a
hierarchical decomposition may exist, but do not know in advance what this decomposition
is.
Step 5: Model Evaluation

Fig.6.5 Flow of evaluation on train and test data
 True Positive (TP): when predicted fake news pieces are actually annotated as fake
news.
 True Negative (TN): when predicted true news pieces are actually annotated as true
news.
 False Negative (FN): when predicted true news pieces are actually annotated as fake
news.
 False Positive (FP): when predicted fake news pieces are actually annotated as true
news.
By formulating this as a classification problem, we can define following metrics,
a) Precision=|T P||T P|+|F P|

b) Recall=|T P||T P|+|F N|
c) F1 = 2·Precisionn Recall Precision +Recall
d) Accuracy=|T P|+|T N||T P|+|TN|+|F P|+|F N|
These metrics are commonly used in the machine learning community and enable us to
evaluate the performance of a classifier from different perspectives. Specifically, accuracy
measures the similarity between predicted fake news and real fake news.

CHAPTER 7
IMPLEMENTATION
The implementation plan for a project refers to a detailed description of actions that
demonstrate how to implement an activity within the project in the context of achieving
project objectives, addressing requirements, and meeting expectations. The implementation
phase represents the work done to meet the requirements of the scope of work and fulfill the
charter. During the implementation phase, the project team accomplished the work defined in
the plan.
In this project, different machine learning algorithms are implemented using python
programming language on pycharm platform.
‘Python’ is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s
elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
 Python also offers much more error checking than C, and, being a very-high-level
language, it has high-level data types built in, such as flexible arrays and dictionaries.
 Python allows you to split your program into modules that can be reused in other
python programs.
 It comes with a large collection of standard modules that you can use as the basis of
your programs. Some of these modules provide things like file I/O, system calls,
sockets, and even interfaces to graphical user interface toolkits like Tk.
 Python is an interpreted language, which saves considerable time during program
development because no compilation and linking is necessary. The interpreter can be
used interactively, which makes it easy to experiment with features of the language,
or to test functions during bottom-up program development.
‘PyCharm’ is an integrated development environment (IDE) used in computer programming,

specifically for the Python language. It is developed by the Czech company JetBrains. It
provides code analysis, a graphical debugger, an integrated unit tester, integration
with version control systems (VCSes), and supports web development with Django as well as
Data Science with Anaconda PyCharm across platform, with windows, macOS, Linux
versions. PyCharm supports web frameworks: Django, web2py and Flask, Integrated Python
debugger and also supports scientific tools like matplotlib, numpy and scipy.

7.1 Algorithm for Data Pre-processing

Step1: Start
Step 2: read the data from csv file
data  pd.read_csv(‘dataset/train.csv’)
Step 3: remove the non-letter and non-number characters
text  re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
Step 4: convert the entire text into lower case
text  text.lower().split()
Step 5: collecting the stop words defined
stops  set(stopwords.words("english"))
Step 6: remove the stop words from text
text  [w for w in text if not w in stops]
Step 7: collecting the cleaned text
text = " ".join(text)
Step 8: comparing the cleaned text with data and finding missing rows
missing_row  [ ]
FOR i in range(len(data)) Do
IF data.loc[i, 'text'] != data.loc[i, 'text'] Then
missing_rows.append(i)
END IF
data  data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)
Step 9: Stop

7.2 Algorithm for Feature extraction using nltk

Step 1: Start
Step 2: importing the tfidf vectorizer from feature_extraction package
from sklearn.feature_extraction.text import TfidfVectorizer
Step 3: importing the nltk package
import nltk
nltk.download()
Step 4: segmenting the document to get sentences
sentences  nltk.sent_tokenize(document)
Step 3: tokenizing the words in the sentence
sentences  [nltk.word_tokenize(sent) for sent in sentences]
Step 4: tagging parts of speech in the word
sentences  [nltk.pos_tag(sent) for sent in sentences]
Step 5: converting processed word into matrix form
vector  TfidfVectorizer(min_df=2,max_df=0.5,range=(1,2))
features  vector.fit_transform(sentences)
tfidf creates a term matrix using the following logic:
wi,j  tfi,j * log(N/dfi)
where,
wi,j = weight of the cell in matrix which signifies how important the word
is for a particular context
tfi,j = number of times term i occurs in j divided by total number of terms
in j
dfi = number of documents containing i in it.
N = total number of documents
Step 6: Stop

7.3 Algorithm for SVM

Step 1: Start
Step 2: Load Pandas library and the dataset using Pandas
datasetpd.read_csv(‘article.csv’)
Step 3: Define the features and the target

Xdataset.drop(‘prediction’,axis=1)
Ydataset(‘Prediction’)
Step 4: Split the dataset into train and test using sklearn before building the SVM algorithm
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_testtrain_test_split(x,y,test_size=0.20)
Step 5: Import the support vector classifier function or SVC function from Sklearn SVM
module. Build the Support Vector Machine model with the help of the SVC function
from sklearn.svm import SVC
svclassifierSVC(kernel=’linear’)
svclassifier.fit(x_train,y_train)
Step 6: Predict values using the SVM algorithm model

y_predictsvclassifier.predict(x_test)
Step 7: Evaluate the Support Vector Machine model

from sklearn.metrics import classification_report,confusion_matrix
print confusion_matrix(y_test,y_predict)
print classification_report(y_test,y_predict)
Step 8: Stop

CHAPTER 8
CONCLUSION
Fake news is nowadays of major concern. With more and more users consuming news from
their social networks, such as Facebook and Twitter, and with an ever-increasing frequency
of content available, the ability to question the content instead of instinctively sharing or
liking it is becoming rare.
The goal has been to comprehensively and extensively review, summarize, compare and
evaluate the current research on fake news, which includes;
 The qualitative and quantitative analysis of fake news, as well as detection and
intervention strategies for fake news.
 The method infers that term frequency is potentially predictive of fake news - an
important first step toward using machine classification for identification.
 The highlight of the proposed approach is the decision-making model which consists of
multiple machine learning algorithms that consider their classification probability.

REFERENCES
[1] J. Golbeck, M. Mauriello, B. Auxier, K. H. Bhanushali, C. Bonk, M. A. Bouzaghrane, C.

Buntain, R. Chanduka, P. Cheakalos, J. B. Everett, W. Falak, C. Gieringer, J. Graney, K. M.
Hoffman, L. Huth, Z. Ma, M. Jha, M. Khan, V. Kori, E. Lewis, G. Mirano, W. T. Mohn IV,
S. Mussenden, T. M. Nelson, S. Mcwillie, A. Pant, P. Shetye, R. Shrestha, A. Steinheimer, A.
Subramanian, and G. Visnansky, “Fake news vs satire: A dataset and analysis,” in
Proceedingsof the 10th ACM Conference on Web Science, ser. WebSci ’18. New York, NY,
USA: ACM, 2018, pp. 17–21.
[2]Shaban Shabani, Maria Sokhn,Hybrid Machine-Crowd Approach for Fake News

Detection, 2018 IEEE 4th International Conference on Collaboration and Internet Computing.
[3] Edell, A. (2018). I trained fake news detection ai with >85% accuracy, and almost went
crazy.
[4]Shlok Gilda, Department of Computer Engineering, Evaluating Machine Learning

Algorithms for Fake News Detection, 2017 IEEE 15th Student Conference on Research and
Development (SCOReD).
[5] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detectionon social media: A
data mining perspective,” vol. 19, no. 1. NewYork, NY, USA: ACM, Sep. 2017, pp. 22–36.
[6] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection through multi-
perspective speaker profiles,” in Proceedingsof the Eighth International Joint Conference on
Natural LanguageProcessing (Volume 2: Short Papers). Asian Federation of Natural
Language Processing, 2017, pp. 252–256.
[7] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection through multi-
perspective speaker profiles,” in Proceedingsof the Eighth International Joint Conference on
Natural LanguageProcessing (Volume 2: Short Papers). Asian Federation of Natural
Language Processing, 2017, pp. 252–256.
[8] W. Ferreira and A. Vlachos, “Emergent: a novel data-set for stance classification,” in
Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for
Computational Linguistics:Human Language Technologies.
[9] Rubin, V., Conroy, N., Chen, Y., & Cornwell, S. Proceedings of the Second Workshop on
Computational Approaches to Deception Detection, 2016.

Fake News Project

Uploaded by

Copyright:

Available Formats

Fake News Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fake News Project

Uploaded by

Copyright:

Available Formats

Fake News Detection using Natural Language Processing

In general, Fake news could be categorized into three groups.

Dept. of IS&E Page 1 AIT,CKM

1.2.1 Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence that helps computers

Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-of-speech

 Content categorization. A linguistic-based document summary, including search and

Dept. of IS&E Page 2 AIT,CKM

1.2.2 Term Frequency-Inverse Document Frequency(TF-IDF)

Term Frequency-Inverse Document Frequency is a very common algorithm to transform text

TFIDF Vectorizer creates a term matrix using the following logic:

wi,j  tfi,j * log(N/dfi)

N = total number of documents

Dept. of IS&E Page 3 AIT,CKM

Dept. of IS&E Page 4 AIT,CKM

3.1 Fake News Detection Using Natural Language Processing

3.2 Existing System

Dept. of IS&E Page 5 AIT,CKM

3.3 Proposed System

 High accuracy on detection.

 Fake news can be detected using machine learning techniques.

 In Tf-idf Vectorizer considers overall document weightage of a word. It helps in

Dept. of IS&E Page 6 AIT,CKM

A comprehensive review on fake news detection is discussed below:

Dept. of IS&E Page 7 AIT,CKM

5.1 SOFTWARE REQUIREMENT AND SPECIFICATION

Dept. of IS&E Page 8 AIT,CKM

Methodology is the systematic, theoretical analysis of the methods applied to a field of

Data Data Feature

Fig6.1 Flow of the module in Fake News Detection System

Dept. of IS&E Page 9 AIT,CKM

Step 1: Data Acquisition

Step 2: Data Pre-processing

1. Stop Word Removal :

Dept. of IS&E Page 10 AIT,CKM

Fig.6.2 Example for Stop Word Removal

Punctuation in natural language provides the grammatical context to the sentence.

Fig.6.3 Example for Punctuation Removal

Dept. of IS&E Page 11 AIT,CKM

Fig.6.4 Example Stemming

Step 3: Feature Extraction

 Source: Author or publisher of the news article.

Dept. of IS&E Page 12 AIT,CKM

Step 4: Model Construction

I. Naive Bayes Classifier:

The crux of Bayesian Classifier is Bayes theorem:

 P(A) = Prior probability, probability that the prediction A holds.

II. Support Vector Machine (SVM):

Dept. of IS&E Page 13 AIT,CKM

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be

In Python, scikit-learn is a widely used library for implementing machine

III. Long Short -Term Memory (LSTM):

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)

Step 5: Model Evaluation

Dept. of IS&E Page 14 AIT,CKM

Fig.6.5 Flow of evaluation on train and test data

By formulating this as a classification problem, we can define following metrics,

a) Precision=|T P||T P|+|F P|

Dept. of IS&E Page 15 AIT,CKM