Fake News Detection using Machine Learning
Fake News Detection using Machine Learning
In
By
I
CANDIDATE DECLARATION
_________________________________________________
I hereby declare that the work presented in this report entitled “Fake News
Detection Using Machine Learning” in partial fulfillment of the requirements
for the award of the degree of Bachelor of Technology in Computer Science and
Engineering/Information Technology submitted in the department of Computer
Science & Engineering and Information Technology, Jaypee University of
Information Technology Waknaghat, is an authentic record of my own work carried
out over a period from July 2022 to May 2023 under the supervision of Dr. Emjee
Puthooran (Associate Professor in Electronics and Communication Department)
and Co-Supervisor Mr. Praveen Modi (Assistant Professor (Grade 1) in CSE & IT
Department.
I also authenticate that I have carried out the above-mentioned project work under
the proficiency stream Data Science.
The matter embodied in the report has not been submitted for the award of any other
degree or diploma.
This is to certify that the above statement made by the candidate is true to the best
of my knowledge.
II
PLAGIARISM CERTIFICATE
III
ACKNOWLEDGEMENT
________________________________________________________
In the beginning, we'd like to express our gratitude to our supervisor, Dr. Emjee
Puthooran, Associate Professor, Department of Electronics & Communication
Engineering, and co-supervisor, Mr. Praveen Modi, Assistant Professor (Grade-I),
Department of Computer Science & Engineering/Information Technology at
Jaypee University of Information Technology (JUIT), for their invaluable support
and direction throughout the project's implementation.
We wish to express our sincere thanks and gratitude to our project guide, Dr. Emjee
Puthooran, Associate Professor, Department of Electronics and Communication
Engineering, and Mr. Praveen Modi, Assistant Professor (Grade I), Department of
Computer Science & Engineering/Information Technology at Jaypee University of
Information Technology (JUIT), for the stimulating discussions, in analyzing
problems associated with our project work, and for guiding us throughout the
project. Project meetings were highly informative. We express our warm and
sincere thanks for the encouragement, untiring guidance, and confidence she has
shown in us. We are immensely indebted to her for her valuable guidance
throughout our project.
IV
TABLE OF CONTENT
________________________________________________________
ABSTRACT --------------------------------------------------------------------------- XI
INTRODUCTION -------------------------------------------------------------------- 1
Introduction -------------------------------------------------------------------- 1
Natural Language Processing -------------------------------------- 2
Fake News Detection ------------------------------------------------ 3
Problem Statement------------------------------------------------------------- 4
Objectives ---------------------------------------------------------------------- 5
Methodology ------------------------------------------------------------------- 6
Dataset ----------------------------------------------------------------- 6
Flowchart -------------------------------------------------------------- 10
Algorithm ------------------------------------------------------------- 11
LITERATURE SURVEY------------------------------------------------------------13
CONCLUSIONS -----------------------------------------------------------------------50
Conclusions ---------------------------------------------------------------------50
V
Future Scope --------------------------------------------------------------------51
REFERENCES ------------------------------------------------------------------------- 52
APPENDICES--------------------------------------------------------------------------- 54
VI
LIST OF ABBREVIATIONS
________________________________________________________
DT = Decision Tree
LR = Logistic Regression
CV = Count Vectorizer
FIG = Figure
VII
LIST OF FIGURES
____________________________________________
Fig. 1: Deep Learning vs Machine Learning vs Artificial Intelligence
Fig. 2: Comparison of Fake and Real news
Fig. 3: Flowchart
Fig. 4: Fake.csv and True.csv
Fig. 5: Design of the Project
Fig. 6: Importing Libraries
Fig. 7: Mounting Google Drive
Fig. 8: Fake.csv
Fig. 9: True.csv
Fig. 10: Comparing Fake and True Dataset
Fig. 11: Describing Fake and True Dataset
Fig. 12: Inserting a column “Outcome”
Fig. 13: Removing last 10 rows from both dataset for manual testing
Fig. 14: Merging the manual data frame
Fig. 15: Manual testing dataset
Fig. 16: Merging the main fake and true data frame
Fig. 17: Whitespace Tokenizer
Fig. 18: Checking the columns
Fig. 19: Removing “title”, “subject” and “date” columns
Fig. 20: Randomly Shuffling the data frame
Fig. 21: Count Vectorizer
Fig. 22: Pre-processing task of words
Fig. 23: Train-Test Split
Fig. 24: Importing for Confusion Matrix
Fig. 25: Logistic Regression
Fig. 26: Support Vector Machine
Fig. 27: Decision Tree Classifier
Fig. 28: Gradient Boosting Classifier
Fig. 29: Random Forest Classifier
Fig. 30: Testing
Fig. 32: Support Vector Machine (SVM)
Fig. 33: Confusion Matrix from Support Vector Machine
Fig. 34: Logistic Regression
Fig. 35: Confusion matrix from Logistic Regression
Fig. 36: Decision Tree
Fig. 37: Confusion Matrix from Decision Tree Classification
Fig. 38: Confusion Matrix from Gradient Boosting Classifier
Fig. 39: Confusion Matrix from Random Forest Classifier
Fig. 40: Web Browser Output
VIII
LIST OF GRAPHS
____________________________________________
IX
LIST OF TABLES
____________________________________________
X
ABSTRACT
____________________________________________
Fake News has become one of the major problems in the existing society. Fake
News has high potential to change opinions, facts and can be the most dangerous
weapon in influencing society.
The proposed project uses NLP techniques for detecting the 'fake news', that is,
misleading news stories which come from non-reputable sources. By building a
model based on a K-Means clustering algorithm, the fake news can be detected.
The data science community has responded by taking actions against the problem.
It is impossible to determine whether the news was real or fake accurately. So, the
proposed project uses the datasets that are trained using the count vectorizer method
for the detection of fake news and its accuracy will be tested using machine learning
algorithms.
In this research, we concentrate on how to spot fake news in internet news sources.
We are dedicated in two ways. In order to determine the percentage of correct news
that is phony, we will use multiple datasets of actual and fake news. We provide a
thorough description of the selection, justification, and approval process as well as
a few exploratory analyses on the observable evidence of etymological differences
in false and legitimate news material. In order to create precise false news
identifiers, we focus a lot of learning studies. Additionally, we provide close
examinations of the automatic and manual evidence of bogus news. Python can be
used to spot fake news posted on social media.
XI
CHAPTER-1
INTRODUCTION
___________________________________________
1.1) Introduction
Machine learning (ML) is the study of the statistical models and methods used by
computers to do certain tasks devoid of explicit instructions and in favour of
patterns and inference. As part of artificial intelligence, it is viewed. Without
explicit instructions, machine learning algorithms construct a mathematical model
using sample data, or "training data," in order to provide predictions or judgements.
Computational statistics, which focuses on computer-aided prediction, and machine
learning have a lot in common. Machine learning may benefit from the ideas,
practises, and fields of application that come from the study of mathematical
optimisation. s
The quantity of modifications that the data goes through is referred to as "deep
learning" in this context. The credit assignment path (CAP) depth is significant,
especially for deep learning systems. The series of changes that take place from
input to output make up the CAP. CAPs define the possible causal connections
between input and outcome. For a feed-forward neural network, the depth of the
CAPs is equal to the depth of the network plus one, given that the output layer is
also parameterized. Since a signal can pass through a layer more than once in
recurrent neural networks, the CAP depth may be limitless.
1
ccurate. Fake news contains verifiable erroneous information. Many significant
companies, even government agencies, are working to address issues related to
false news. However, given that millions of articles are produced or purged every
minute in this age, they are neither responsible nor humanely feasible because they
rely on manual human detection. A machine learning algorithm that creates a
trustworthy automated index score or rating for the authenticity of various
publications and can assess whether the news is true or misleading may provide a
solution to this problem.
The study of how computers interact with human (natural) languages is known as
natural language processing, or NLP, and it is a branch of computer science and
artificial intelligence that focuses on instructing computers to efficiently analyse
massive volumes of natural language data. In the fields of linguistics, computer
science, information engineering, and artificial intelligence, natural language
processing (NLP) studies how computers interact with human (natural) languages.
Its major goal is to instruct computer programmers in how to study and analyse vast
amounts of natural language.
2
1.1.2) Fake News Detection
With the rising use of social media platforms, false news has become a severe
problem in recent years. Finding fake news is a difficult problem that necessitates
the use of several computer techniques, such as data mining, machine learning, and
natural language processing. In this abstract, the current state of false news
detection will be discussed, along with its challenges and potential solutions.
Finally, it will consider how cutting-edge technology like blockchain and artificial
intelligence may be used in the future to improve the efficiency and precision of
fake news detection.
As a result, there is a larger than ever need for accurate and reliable techniques to
distinguish fake news. The field of fake news detection has rapidly evolved as a
result of researchers and engineers developing a number of techniques and tactics
to identify and combat misleading information. These methods include human fact-
checking by educated professionals as well as sophisticated computers that use
machine learning to examine and classify news content. Automated processes are
also a part of them.
It is important to research and create fake news detection, but it is also a challenging
and complex problem. The ability to recognise fake news requires knowledge of
linguistic nuance, social and cultural contexts, and the complex network dynamics
of online communication. Despite these challenges, work has been done to establish
effective methods for spotting false news, and the area is still developing as new
tools and technology are created.
3
1.2) Problem Statement
Both benefits and drawbacks come with reading the news. On the other hand, news
is actively sought for and consumed since it is easily available, inexpensive, and
quickly spread. It makes it possible for "fake news," or negative news with blatantly
inaccurate material, to be widely disseminated.
As a result, research into the detection of bogus news has recently made significant
strides. First off, identifying fake news just on the basis of the content is challenging
and nontrivial since it is purposefully designed to lead people to accept incorrect
information.
1.3) Objective
Our project's primary goal is to determine the veracity of news in order to determine
if it is real or phoney. the development of a machine learning model that would
allow us to recognise bogus information.
It can be difficult and difficult to identify fake news only based on its content since
it is intentionally produced to influence readers to believe false information.
4
1.4) Methodology
1.4.1) Dataset
Two datasets are available. a mix of the two. There are 44898 news stories total in
the csv file, which is a sizable quantity. While the true dataset only comprises
21417, the fraudulent dataset has 23481. This data collection is accessible at:
First of all, the dataset is quite balanced, as we have shown. There are 21417
accurate news items and 23481 false news pieces in it. This is a beneficial feature
of the dataset.
5
It will aid models in making objective judgments.
The dataset has undergone some processing, and as was indicated, stop terms have
been included. The most common words in the dataset are "the," "to," "of," "and,"
etc.
The top 20 terms in the sample were as follows before stop words were eliminated:
6
Fake.csv
7
The terms "said," "mr," "trump," "new," "people," and "year," which are now the
most popular ones, can provide the models important information.
We also examined the bigrams in the dataset to have a better understanding of the
news story subjects. Before stop words are removed, the topics of the news stories
are not at all clear. As a result, removing stop words makes it simpler to comprehend
the news reports' themes.
The graph below displays the top 20 bigrams from the dataset before stop words
are removed. As one can see, often used phrases like "of the," "in," and "to the" do
not help one comprehend the content of the story.
8
To display the data, we plotted the frequencies of subject of the news:
9
1.4.2) Flowchart:
Fig. 3: Flowchart
10
1.4.3) Algorithm for The Proposed System
Step 1: Pre-processing
▪ Load the dataset of news items with their labels, whether they are true or
false;
▪ Clean the text by eliminating punctuation and stopwords;
▪ Divide the dataset into training and testing sets.
11
▪ Determine each model's accuracy score using the actual and projected
labels.
Step 6: Accuracy
▪ Determine each model's accuracy by comparing its predicted labels to its
actual labels.
▪ The accuracy measures the proportion of news stories that were accurately
identified as being true or false.
▪ Evaluate the accuracy of various models to find which one is most effective
at spotting fake news.
12
CHAPTER-2
LITERATURE SURVEY
___________________________________________
J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble learning
with context and attention mechanism,"[3] For their experiments, the authors
employ two datasets: the Celebrity dataset and the LIAR dataset. To capture both
local and global aspects of news items, the proposed model combines convolutional
13
neural networks (CNNs) with recurrent neural networks (RNNs). The experimental
findings demonstrate that the suggested model outperforms numerous baseline
models and reaches an accuracy of up to 73.7%, reaching state-of-the-art
performance on both the LIAR and Celebrity datasets.
14
CHAPTER-3
SYSTEM DEVELOPMENT
___________________________________________
Run this project using standard hardware. We utilised an Intel I5 CPU with 8 GB
of RAM, a 2 GB Nvidia graphics processor, and 2 cores that have a frequency of
1.7 GHz and 2.1 GHz, respectively, to complete the project. The test phase, which
follows the training phase and lasts for around 10-15 minutes, allows for predictions
to be made and accuracy to be determined quickly.
Missing values in datasets can be a difficulty for some machine learning techniques.
Therefore, any missing values in each column of the input data must be found and
replaced before we model the prediction problem. Missing The use of data
assignment or assignment is made for this.A space (' ') should be used in place of
the null value for each attribute. Use this approach instead of removing tuples
containing null values.
Stop words like "if," "the," "is," "a," and "an," among others, shouldn't be given
much weight by a machine learning model because they are common English
expressions and don't increase the novelty or believability of any tale. Being present
in the dataset may have an impact on the model's forecast because they are often
used.
15
Removal of Special Characters
Lemmatization
The word "play" serves as the origin for other words, including "playing" and
"plays." It is possible to carry out a more extensive examination of the term's
frequency by swapping out the term's core word with words in other tenses and
participles. As a result, we substitute that word for any phrase that only has one
source word.
Count Vectorization
For machine learning algorithms to accept the preprocessed text as input, it must
next be encoded as integers or floating-point values. The phrase used to describe
this method is feature extraction (or vectorization).
If a vocabulary word is present in the text data, we will add one to the corresponding
vector's dimension, which will have the same number of dimensions as our
vocabulary. We will add one to the total for each additional instance of that term,
leaving zeros in the spots where we didn't see it even once.
16
TF-IDF Transformation
In order to create a matrix with TF-IDF values for each feature, we utilise the count
vectorized matrix as a transformation.
IDF, or Inverse Document Frequency, or Term Frequency (TF), which is identical
to what we previously saw in the Count Vectorizer
17
3.3) Design of Project
Dataset: The first step is to collect or obtain a dataset of news articles, labeled as
"fake" or "real". This dataset will be used to train and evaluate the performance of
different fake news detection models.
Train-Test Split: Once we have the BOW matrix, we can split the data into training
and testing sets. The training set will be used to train the fake news detection model,
while the testing set will be used to evaluate the model's performance on new,
unseen data.
18
Models: After obtaining the numerical features from the text data, several machine
learning methods such as logistic regression, decision trees, or neural networks can
be employed to train a fake news detection model. The objective of the model is to
learn a function that can accurately classify news stories as either "real" or "fake"
based on the derived attributes from the text.
Accuracy and Confusion Matrix: It's crucial to assess the false news detection
model's performance on the testing set after we've trained it. By assessing its
accuracy, precision, recall, and F1 score, we may do this. To see how many true
positives, true negatives, false positives, and false negatives the model produces,
we may also develop a confusion matrix.
Testing: We may use the model to categorise fresh and previously unheard news
pieces as "real" or "fake" after assessing the model's performance. This entails
applying the same feature extraction and preprocessing operations to the fresh data
that we did during training. After that, we can apply the trained model to the
cleaned-up data to provide a categorization label.
Result: Streamlit library of python is used to represent the result in web browser
where user input the news and algorithm tell that the news is “Real” or “Fake”.
19
3.4) Sample Code
20
Fig. 8: Fake.csv
Fig. 9: True.csv
21
Fig. 10: Comparing Fake and True Dataset
22
Pre-processing of Dataset
Fig. 13: Removing last 10 rows from both dataset for manual testing
23
Fig. 14: Merging the manual data frame
24
Fig. 16: Merging the main fake and true dataframe
25
Graph 5: Frequency of subject of the news
26
Graph 6: Fake and Real News
27
Fig. 17: WhitespaceTokenizer
28
Fig. 18: Checking the columns
29
Fig. 20: Randomly Shuffling the data frame
30
Fig. 22: Pre-processing task of words
Train-Test Split
31
Fig. 23: Train-Test Split
32
Models
33
Fig. 26: Support Vector Machine
34
Fig. 27: Decision Tree Classifier
35
Fig. 28: Gradient Boosting Classifier
36
Fig. 29: Random Forest Classifier
37
Graph 8: Comparison of the accuracies of different models
Testing
38
Sample Input
39
CHAPTER-4
RESULTS AND EXPERIMENTAL ANALYSIS
___________________________________________
40
Below are the Results from applying Support Vector Machine model:
Confusion Matrix:
41
Logistic regression
42
Below are the Results from applying Logistic Regression model:
Confusion Matrix:
43
Decesion Tree Classification
44
Below are the Results from applying Decision Tree Classification model:
Confusion Matrix:
45
Gradient Boosting Classifier
Below are the Results from applying Gradient boosting classifier model:
46
Confusion Matrix:
47
Below are the Results from applying Random Forest Classifier model:
48
Confusion Matrix:
Sample Input:
49
CHAPTER-5
CONCLUSIONS
___________________________________________
5.1) Conclusions
Considering the accuracy scores, we were able to establish for the various models,
it appears that all of the models are doing a good job of identifying false news items.
The SVM, Decision Tree, and Gradient Boosting classifiers notably achieved a very
high accuracy of 99.5%, although the Random Forest Classifier performed just
slightly lower, at 98.71%.
All things considered, these results suggest that a range of classifiers may be used
with equal success rates and that machine learning techniques may be extremely
successful in spotting bogus news. It's important to keep in mind that accuracy is
only one measure and that the models should be evaluated using multiple metrics
including precision, recall, and F1-score in addition to factors like interpretability,
scalability, and processing requirements. Investigating different feature extraction
and selection methods, classifier types, and ensemble approaches may also be
useful to see whether even better results may be produced.
We utilised the datasets real and fake, each of which had 21417 and 23481 entries,
respectively. We converted text into a numerical model using TF-ID F Vectorizer
and utilised the following models:
Accuracy of 99.31% for support vector machines
Decision Tree: 99.5% precision
Classifier using Gradient Boosting: Accuracy = 99.5%
Accuracy of 98.7% for the random forest classifier
50
5.2) Future Scope
Future research and advancement in the field of false news detection are abundantly
possible. Future efforts to identify bogus news may go in the following directions:
Including more varied and subtle aspects: For the most part, current methods for
detecting false news rely on simple text-based traits like TF-IDF vectors or bag-of-
words. Research in the future could concentrate on more complex and diverse
aspects, such sentiment analysis, network analysis, or multimedia analysis (for
instance, identifying false images or videos).
Creating more interpretable models: Existing methods for spotting fake news
sometimes rely on complex machine learning algorithms that might be difficult to
comprehend. In the future, it would be beneficial to develop more intelligible
models that might provide more information on how people make decisions.
REFERENCES
51
___________________________________________
[2] S. Asghar, S. Mahmood, and H. Kamran, "Fake news detection using machine
learning: A survey," IEEE Access, vol. 9, pp. 57613-57639, 2021. doi:
10.1109/ACCESS.2021.3075392
[3] J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble
learning with context and attention mechanism," IEEE Access, vol. 9, pp. 27569-
27579, 2021. doi: 10.1109/ACCESS.2021.3057736
[6]https://www.google.co.in/imgres?imgurl=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fdata-
flair.training%2Fblogs%2Fwpcontent%2Fuploads%2Fsites%2F2%2F2019%2F07
%2FintroductiontoSVM.png&tbnid=p7ua2IdzmLsjqM&vet=12ahUKEwjf26Kfru
DAhW6JrcAHdMIAagQMygCegUIARDlAQ..i&imgrefurl=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fd
ata-flair.training%2Fblogs%2Fsvm-support-vector-machine-
52
tutorial%2F&docid=7oy5_irTaN4UfM&w=801&h=420&q=svm&ved=2ahUKE
wjf26KfruD-AhW6JrcAHdMIAagQMygCegUIARDlAQ
[7]https://www.google.co.in/imgres?imgurl=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fstatic.javatpoint.c
om%2Ftutorial%2Fmachine-learning%2Fimages%2Flogistic-regression-
inmachinelearning.png&tbnid=LuaHnfur76i8eM&vet=12ahUKEwjFoPGSruDAh
VNnNgFHUjLCl8QMygCegUIARDjAQ..i&imgrefurl=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.j
avatpoint.com%2Flogisticregressioinmachinelearning&docid=makIlDmuc8naW
M&w=500&h=300&itg=1&q=logistic%20regression&ved=2ahUKEwjFoPGSru
D-AhVNnNgFHUjLCl8QMygCegUIARDjAQ
[8]https://www.google.co.in/url?sa=i&url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.geeksforgeek
s.org%2Fdecision-tree%2F&psig=AOvVaw0sYuRq-TZe0WWhW-
9YQUnl&ust=1683450911500000&source=images&cd=vfe&ved=0CBEQjRxqF
woTCLDwi7qt4P4CFQAAAAAdAAAAABAE
53
APPENDICES
___________________________________________
▪ Sources
▪ Evidence Experts
▪ Research Statistics Facts
▪ Data Quotes
▪ Corroborate Verification
▪ Objective
▪ Impartial
▪ Reliable
▪ Credible
▪ Transparency
▪ Context Timeliness
▪ Accuracy
▪ impartial reporting
▪ several perspectives
54
Certainly, here are some commonly used words and phrases that
may indicate the presence of fake news:
▪ Allegedly
▪ Supposedly
▪ Claims
▪ "Fake news" or "hoax"
▪ Conspiracy
▪ Unverified
▪ Sensational
▪ Emotional
▪ Outrageous
▪ Shocking
▪ Clickbait
▪ Exaggerated
▪ Biased
▪ Partisan
▪ Misleading
▪ Inaccurate
▪ Unsubstantiated
▪ Rumors
▪ Speculation
▪ Opinions presented as facts.
55