Fake News Project
Fake News Project
Fake News Project
CHAPTER 1
INTRODUCTION
1.1 Overview
The rapid growth of fake news, especially in social media has become a challenging problem
that has negative social impacts on a global scale. The ever-growing volume of fake news has
turned into a significant global problem, as it is difficult to make the difference between the
genuine and fake news. Hence, fake news detection has become a very important task, but yet
technically very challenging.
In the recent years, online content has been playing a significant role in users decisions and
opinions. Opinions such as online reviews are the main source of information for e-commerce
customers to help with gaining insight into the products they are planning to buy. Recently it
has become apparent that opinion spam does not only exist in product reviews and customers’
feedback. In fact, fake news and misleading articles is another form of opinion spam, which
has gained traction. Some of the biggest sources of spreading fake news are social media
websites such as Google Plus.
Even though the problem of fake news is not a new issue, detecting fake news is believed to
be a complex task given that humans tend to believe misleading information and the lack of
control of the spread of fake content. Fake news has been getting more attention in the last
couple of years, especially since the US election in 2016. It is tough for humans to detect fake
news. It can be argued that the only way for a person to manually identify fake news is to
have a vast knowledge of the covered topic. Even with the knowledge, it is considerably hard
to successfully identify if the information in the article is real or fake. The open nature of the
web and social media in addition to the recent advance in computer science simplify the
process of creating and spreading fake news.
The first group is fake news, which is news that is completely fake and is made up by
the writers of the articles.
The second group is fake satire news, which is fake news whose main purpose is to
provide humor to the readers.
The third group is poorly written news articles, which have some degree of real news,
but they are not entirely accurate. In short, it is news that uses, for example, quotes
from political figures to report a fully fake story. Usually, this kind of news is
designed to promote certain agenda or biased opinion.
1.2Overview of Algorithms
The most popular method to calculate the word frequencies is TF-IDF. These are the
components of the resulting scores assigned to each word.
Term Frequency: This summarizes how often a given word appears within a
document.
Inverse Document Frequency: These downscale words that appear a lot across
documents.
where,
wi,j = weight of the cell in matrix which signifies how important the word
is for a particular context
tfi,j = number of times term i occurs in j divided by total number of terms
in j
dfi = number of documents containing i in it.
The TFIDF Vectorizer will tokenize documents, learn the vocabulary and inverse document
frequency weightings, allow you to encode new documents. Alternately, if you already have a
learned Count Vectorizer, you can use it with a TFIDF Transformer to just calculate the
inverse document frequencies and start encoding documents.
CHAPTER 2
OBJECTIVES
The main objective is to detect the fake news, which is a classic text classification
problem with a straight forward proposition. It is needed to build a model that can
differentiate between “Real” news and “Fake” news.
This project proposes a feasible method, which contain several aspects to accurately
tackle the fake news detection issue. Thus it is a combination of semantic analysis
using techniques of NLP.
The proposed method is entirely composed of Artificial Intelligence approaches,
which is critical to accurately classify between the real and the fake.
The three-part method is a combination between Machine Learning algorithms that
subdivide into supervised learning techniques, and natural language processing
methods.
Although each of the above mentioned approaches can be solely used to classify and
detect fake news, in order to increase the accuracy and be applicable to the social
media domain, they have been combined into an integrated algorithm as a method for
fake news detection.
CHAPTER 3
PROBLEM DEFINITION
While it is easier to understand and trace the intention and the impact of fake reviews, the
intention, and the impact of creating propaganda by spreading fake news cannot be measured
or understood easily. For instance, it is clear that fake review affects the product owner,
customer and online stores; on the other hand, it is not easy to identify the entities affected by
the fake news. This is because identifying these entities requires measuring the news
propagation, which has shown to be complex and resource intensive.
The extensive spread of fake news has the potential for extremely negative impacts on
individuals and society. Therefore, fake news detection on social media has recently become
an emerging research that is attracting tremendous attention.
Research on fake news detection is still at an early stage, as this is a relatively recent
phenomenon, at least regarding the interest raised by society. There exists a large body of
research on the topic of machine learning methods for deception detection; most of it has
been focusing on classifying online reviews and publicly available social media posts.
Particularly since late 2016 during the American Presidential election, the question of
determining 'fake news' has also been the subject of particular attention within the literature.
In [4], Shloka Gilda presented concept approximately how NLP is relevant to stumble on
fake information. They have used Count Vectorizer of bi-grams and probabilistic context free
grammar (PCFG) for deception detection. They have examined their dataset over more than
one class algorithms to find out the great model. They locate that CV of bi-grams fed right
into a Stochastic Gradient Descent model identifies non-credible resources with an accuracy
of 71.2%.
The lack of available corpora for predictive modeling is an important limiting factor in
designing effective models to detect fake news.
Disadvantages:
The accuracy dropped to 71.2% when predicting fake news against real news.
In CountVectorizer counts only, the number of times a word appears in the
document which results in biasing in favour of most frequent words. This ends up in
ignoring rare words which could have helped is in processing our data more
efficiently.
Machine learning technique is used to detect fake news, which consists of using text analysis
based on classification techniques. Experimental evaluation is conducted using a dataset
compiled from real and fake news websites, yielding very encouraging results.
The actual goal is in developing a model which has the text transformation and choosing
which type of text to use (headlines versus full text). Now the next step is to extract the most
optimal features for TF-IDF Vectorizer, this is done by using a n-number of the most used
words, and/or phrases, lower casing or not, mainly removing the stop words which are
common words such as “the”, “when”, and “there” and only using those words that appear at
least a given number of times in a given text dataset.
The three-part method is a combination between Machine Learning algorithms that subdivide
into supervised learning techniques, and natural language processing methods.Although each
of the above mentioned approaches can be solely used to classify and detect fake news, in
order to increase the accuracy and be applicable to the social media domain, they have been
combined into an integrated algorithm as a method for fake news detection.
Advantages:
CHAPTER 4
LITERATURE SURVEY
In December 2016, the First Fake News Challenge dataset was launched. This dataset
contains article bodies and headlines from news articles. The stance detection task extends
the work of Emergent dataset, estimating the stance between the body texts relative to a
headline. The published fnc-1 dataset contains four possible classes: agree, disagree, discuss
or unrelated. Challenge winners reported an accuracy of 82.02%. Recently there have been
several works related to fake news.
Golbeck et al. [1] presented a dataset of fake news and satirical stories that are hand-coded,
verified, and in the case of fake news, include rebutting stories. The dataset contains 283 fake
news stories and 203 satirical stories chosen from a diverse set of sources.
Shloka Gilda [4] presented concept approximately how NLP is relevant to stumble on fake
information. They have used time period frequency-inverse record frequency (TF-IDF) of
bigrams and probabilistic context free grammar detection.
Shu. K., Sliva A., Wang S., Tang J., & Liu H [5] Social media for news consumption is a
double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of
information lead people to seek out and consume news from social media. On the other hand,
it enables the wide spread of \fake news", i.e., low quality news with intentionally false
information.
Long et al. [7] proposes a hybrid attention based Long Short Time Memory (LSTM) model
that analyses the profile of the speakers, specifically considering the speaker’scredit history
(declared statements in the past) reported an accuracy of 41.5%.
W. Ferreira and A. Vlachos [8] “Emergent: dataset was collected and annotated by
journalists. The task involves stance detection, i.e. estimating the relative perspective of two
pieces of text relative to a topic, claim or issue. It contains 300 rumoured claims and 2,595
associated news articles, categorized in 3 classes: true, false or unverified.
Conroy, Rubin, Cornwell and Chen [9] provide a conceptual overview of satire and humor,
elaborating and illustrating the unique features of satirical news, which mimics the format
and style of journalistic reporting. Satirical news stories were carefully matched and
examined in contrast with their legitimate news counterparts in 12 contemporary news topics
in 4 domains (civics, science, business, and “soft” news).
CHAPTER 5
SYSTEM DESIGN
HARDWARE REQUIREMENTS:
Processor - Any processor above 500 MHz
RAM - 4GB
Hard Disk - 250GB
SOFTWARE REQUIREMENTS:
Operating system - Windows 10
Programming Language - Python 3.5
Packages - Numpy, Pandas, Sklearn, Keras, Scipy, Scikit-learn,
Gensim, Shutil, Pillow, Tensorflow, Nltk
Platform - PyCharm
CHAPTER 6
METHODOLOGY
The basic idea of our project is to build a model that can predict the credibility of news
events. As shown in Fig6.1, the proposed framework consists of five major steps: Data
acquisition, Data pre-processing, Feature extraction, Model construction and Model
evaluation. In the first step key phrases of the news event is taken as an input that the
individual need to authenticate. After that data is collected from a repository. The data
preprocessing unit is responsible for preparing a data for further processing. Feature
extraction is based on NLP techniques. A classification model is built using Naïve Bayes
Classifier, Support vector machine and long short term memory. By doing the evaluation
of, effects acquired from classification and analysis models using accuracy and confusion
matrix, it is able to decide the piece of news being fake or real.
Model Model
Construction Evaluation
There are two parts in the data-acquisition process, “fake news” and “real news”. Collecting
the fake news was easy as Kaggle released a fake news dataset consisting of 13,000 articles
published during the 2016 election cycle. Now the later part is very difficult. That is to get
the real news for the fake news dataset. It requires huge work around many Sites because it
was the only way to do web scraping thousands of articles from numerous websites. With the
help of web scraping a total of 5279 articles, real news dataset was generated, mostly from
media organizations.
This project includes the news samples as datasets from a repository such as scikit learn.
Dataset includes body of the news article, the headline of the news article, and the label for
relatedness of an article and headline. In this project, we have used various natural language
processing techniques and machine learning algorithms to classify fake news articles using
scikit libraries from python.
Text data requires special pre-processing to implement machine learning algorithms on them.
This process is also called Data Cleaning. There are various techniques widely used to
convert text data into a form that is ready for modeling. The data pre-processing steps that are
outlined below are applied to both the headlines and the news articles.
We start with the removal of stop words from the text data available. Stops Words (most
common words in a language which do not provide much context) can be processed and
filtered from the text as they are more common and hold less useful information. Stop words
acts more like a connecting part of the sentences, for example, conjunctions like “and”, “or”
and “but”, prepositions like “of”, “in”, “from”, “to”, etc. and the articles “a”, “an”, and “the”.
Such stop words which are of less importance may take up valuable processing time, and
hence removing stop words as a part of data pre-processing is a key first step in natural
language processing. We used Natural Language Toolkit – (NLTK) library to remove stop
word. Figure 6.2 illustrates an example of stop word removal.
2. Punctuation Removal :
3. Stemming :
Stemming is a technique to remove prefixes and suffixes from a word, ending up with the
stem or root word. Using stemming we can reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form. Figure 6.4shows the example
of stemming technique.
News content features describe the meta information related to a piece of news. Lists of
representative news content attributes are listed below:
Based on these raw content attributes, different kinds of feature representations can be built
to extract discriminative characteristics of fake news. Typically, most of the news contents
are linguistic-based and visual-based.
In this dataset the feature extraction and selection methods are from scikit and python. To
perform feature selection, a method called tf–idf is used. Project also uses word to vector to
extract the features, also pipelining has been used to ease the code.
Most of the approaches consider the fake news problem as a classification problem that
predicts whether a news article is fake or not. A Naive Bayes Classifier is a probabilistic
machine learning model that’s used for classification task.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect
the other. Hence it is called naive.
Each observed training example can incrementally decrease or increase the estimated
probability that a piece of article taken as data is correct. This provides a more flexible
approach to learning, than algorithms that completely eliminate the data if it is found to be
inconsistent with any single example.
P(Word) = Word count +1/ (total number of words+ No. of unique words)
By using this formula one can find the accuracy of the news.
New instances are classified by combining the predictions of multiple, previously classified
datasets which are weighted by their probabilities. The classification of the data is done in
two parts that is test data and train data and the train dataset are classified into groups with
similar entities. Later the test data is matched, and the group is assigned to whichever it
belongs to and then further the Naïve Bayes classifier is applied and the probability of each
and every word is calculated individually.
SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable. A separator between the
categories is found, and then the data are transformed in such a way that the separator could
be drawn as a hyperplane.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier.
An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks.
These blocks can be thought of as a differentiable version of the memory chips in a digital
computer. Each one contains one or more recurrently connected memory cells and three
multiplicative units – the input, output and forget gates – that provide continuous analogues
of write, read and reset operations for the cells.
LSTM holds promise for any sequential processing task in which we suspect that a
hierarchical decomposition may exist, but do not know in advance what this decomposition
is.
True Positive (TP): when predicted fake news pieces are actually annotated as fake
news.
True Negative (TN): when predicted true news pieces are actually annotated as true
news.
False Negative (FN): when predicted true news pieces are actually annotated as fake
news.
False Positive (FP): when predicted fake news pieces are actually annotated as true
news.
These metrics are commonly used in the machine learning community and enable us to
evaluate the performance of a classifier from different perspectives. Specifically, accuracy
measures the similarity between predicted fake news and real fake news.
CHAPTER 7
IMPLEMENTATION
The implementation plan for a project refers to a detailed description of actions that
demonstrate how to implement an activity within the project in the context of achieving
project objectives, addressing requirements, and meeting expectations. The implementation
phase represents the work done to meet the requirements of the scope of work and fulfill the
charter. During the implementation phase, the project team accomplished the work defined in
the plan.
In this project, different machine learning algorithms are implemented using python
programming language on pycharm platform.
‘Python’ is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s
elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
Python also offers much more error checking than C, and, being a very-high-level
language, it has high-level data types built in, such as flexible arrays and dictionaries.
Python allows you to split your program into modules that can be reused in other
python programs.
It comes with a large collection of standard modules that you can use as the basis of
your programs. Some of these modules provide things like file I/O, system calls,
sockets, and even interfaces to graphical user interface toolkits like Tk.
Python is an interpreted language, which saves considerable time during program
development because no compilation and linking is necessary. The interpreter can be
used interactively, which makes it easy to experiment with features of the language,
or to test functions during bottom-up program development.
data pd.read_csv(‘dataset/train.csv’)
text text.lower().split()
stops set(stopwords.words("english"))
Step 8: comparing the cleaned text with data and finding missing rows
missing_row [ ]
FOR i in range(len(data)) Do
missing_rows.append(i)
END IF
data data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)
Step 9: Stop
import nltk
nltk.download()
sentences nltk.sent_tokenize(document)
vector TfidfVectorizer(min_df=2,max_df=0.5,range=(1,2))
features vector.fit_transform(sentences)
where,
wi,j = weight of the cell in matrix which signifies how important the word
is for a particular context
tfi,j = number of times term i occurs in j divided by total number of terms
in j
dfi = number of documents containing i in it.
Step 6: Stop
Step 4: Split the dataset into train and test using sklearn before building the SVM algorithm
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_testtrain_test_split(x,y,test_size=0.20)
Step 5: Import the support vector classifier function or SVC function from Sklearn SVM
module. Build the Support Vector Machine model with the help of the SVC function
from sklearn.svm import SVC
svclassifierSVC(kernel=’linear’)
svclassifier.fit(x_train,y_train)
Step 8: Stop
CHAPTER 8
CONCLUSION
Fake news is nowadays of major concern. With more and more users consuming news from
their social networks, such as Facebook and Twitter, and with an ever-increasing frequency
of content available, the ability to question the content instead of instinctively sharing or
liking it is becoming rare.
The goal has been to comprehensively and extensively review, summarize, compare and
evaluate the current research on fake news, which includes;
The qualitative and quantitative analysis of fake news, as well as detection and
intervention strategies for fake news.
The method infers that term frequency is potentially predictive of fake news - an
important first step toward using machine classification for identification.
The highlight of the proposed approach is the decision-making model which consists of
multiple machine learning algorithms that consider their classification probability.
REFERENCES
[3] Edell, A. (2018). I trained fake news detection ai with >85% accuracy, and almost went
crazy.
[5] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detectionon social media: A
data mining perspective,” vol. 19, no. 1. NewYork, NY, USA: ACM, Sep. 2017, pp. 22–36.
[6] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection through multi-
perspective speaker profiles,” in Proceedingsof the Eighth International Joint Conference on
Natural LanguageProcessing (Volume 2: Short Papers). Asian Federation of Natural
Language Processing, 2017, pp. 252–256.
[7] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection through multi-
perspective speaker profiles,” in Proceedingsof the Eighth International Joint Conference on
Natural LanguageProcessing (Volume 2: Short Papers). Asian Federation of Natural
Language Processing, 2017, pp. 252–256.
[8] W. Ferreira and A. Vlachos, “Emergent: a novel data-set for stance classification,” in
Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for
Computational Linguistics:Human Language Technologies.
[9] Rubin, V., Conroy, N., Chen, Y., & Cornwell, S. Proceedings of the Second Workshop on
Computational Approaches to Deception Detection, 2016.