0% found this document useful (0 votes)

8 views11 pages

AI - Phase 4

Uploaded by

r.dhiraviyamahadevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

AI - Phase 4

Uploaded by

r.dhiraviyamahadevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Fake News Detection using NLP and Machine Learning in

Python

Step by Step guide for fake news detection using machine learning, natural language processing in
python

In this post, we will be discussing fake news detection using machine learning and will start to
understand what is fake news, and what is the source of its generation?

What is Fake news?

Fake news is a form of news that consists of falsified information or hoaxes to deceive users for clickbait.
Clickbait is the technique for grabbing the attention of users with flashy headlines which makes them
click the link with the purpose of generating revenue by showing different advertisements. Today, with
the increasing usage of social media, spreading fake news over the internet, increases manifold and the
major source of spreading fake news is online news portals which makes it really difficult to distinguish
between real and fake news. In this case study, we will discuss how we can detect fake news from news
headlines using natural language processing (NLP) and machine learning-based techniques. The full code
used in this post is available in my Github repo.

About Data

The dataset used in this case study is the ISOT Fake News Dataset. The dataset contains two types of
articles fake and real news. This dataset was collected from real-world sources; the truthful articles were
obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were
collected from different sources. The fake news articles were collected from unreliable websites that
were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains
different types of articles on different topics, however, the majority of articles focus on political and
world news topics.

The dataset consists of two CSV files. The first file named contains more than 12,600 articles from
reuter.com. The second file named contains more than 12,600 articles from different fake news outlet
resources. Each article contains the following information:

Article title (News Headline),

Text,
Label (REAL or FAKE)

Exploratory data analysis

In this case study, we have extracted interesting patterns from the news headline text using NLP and
perform exploratory data analysis to provide useful insights about real and fake news headlines.

Snapshot of dataset

Figure 1 shows the top 5 entries of the actual dataset used in the case study.

Figure 1. Snapshot of the actual dataset used in the case study

Distribution of Fake news

Firstly, we check the distribution of fake and true news in the dataset by plotting the bar graph as shown
in Figure 2.

Figure 2. Distribution of Real and Fake news in the dataset

As we have seen from the above figure, the dataset is balanced having 21417 true news and 23481 fake
news.
Distribution of the number of characters

Next, we have checked the distribution of the number of characters in the fake and true news titles.
From the below figure it is evident that the average number of characters is higher in case of fake news
in comparison to true news.

Figure 3. Distribution of the number of characters in fake news (left) and true news (right)

Distribution of unique words

Further, we have checked the distribution of unique words used in both fake news and true news titles.
As per the figure, it is observed that fake news consists of more unique words in comparison to true
news as its main objective is to deceive users with the use of attention-grabbing words in the headlines.

Figure 4. distribution of unique words used in fake news (left) and true news (right)

Distribution of special characters

In the end, we have checked the distribution of special characters used in both fake and true news and
found that more special characters are used in fake news in comparison to true news.

Figure 5. distribution of special characters in fake news (left) and true news (right)

Word Cloud to plot the most frequent words

So, next, we will plot the most frequent words in fake news and real news using the word cloud. Word
cloud is a technique for visualizing most frequent words in a text corpus where the size of the words
represents their frequency. For plotting word cloud we have used word cloud python library.

Figure 6. Word cloud representation of fake news and real news

As we can see from the above figure, most frequent words in fake news are Video, Obama, Hillary, Trump
and Republican whereas Real news comprises Trump, White House, North Korea, China, etc.

Text pre-processing

After analyzing the data, we move towards text pre-processing before building machine learning models.
The text pre-processing consists of the following steps:

Step 1: Lower casing

In the lower casing, we convert all the words in the text in lower case as words VIDEO and video are the
same in contextual meaning but it represents different words in vector space resulting in more
dimensions.

So, as we can see from the above data frame, all the text in the title column is now converted to lower
case. Here df is the pandas data frame using which we have loaded the dataset.

Step 2: Stop word removal

Stop words are the most common words used in regular conversation and it does not add significant
value in the conversation. The examples of stop words are a, an, the, he, she, it, etc. So In text
classification tasks, we use to remove such stop words by importing stop words defined in nltk corpus.
The python code for removing stop words are shown below:
Step 3: Special character removal

In this step, we remove all types of special characters in the news title. The special characters are also
not significant for text classification and it only increases the total dimension in vector space, so we filter
them out before building the machine learning model. The code for filtering out special characters is
shown below.

It is important to note that, stemming and lemmatization are considered important steps but for this
case study we have not performed any of them either, but you are free to try them and see how the final
result changes.

Train Test Split

In this step, we split the data into train and test set in the ratio of 75:25 i.e., 75% of the data used in
training the model and rest 25% used for testing the model. The code for splitting data is shown below.

Model building

First, we have built our baseline model with a count vectorizer. Count vectorizer converts the text
document to a vector of token counts. The actual working of the count vectorizer is shown in the below
figure.
So, as we can see from the below figure, the sentence “The quick brown fox jumps over the lazy dog” is
converted into a frequency table in which column represents tokens in the sentence and rows represents
the corresponding frequency of those tokens.

So, the code for applying the count vectorizer to the text document is shown below. Code vectorizer is
defined in sklearn python library

Next, with count vectorizer, we have used three machine-learning algorithms Multinomial Naïve Bayes
which generally performs better in case of text classification, Passive Aggressive classifier, and Logistic
regression algorithms. The results of these algorithms in terms of the confusion matrix are below.

Figure 7. Figure showing the confusion matrix of

Multinomial Naïve Bayes (Left), Passive-aggressive classifier (Middle) and Logistic regression (Right) using
count vectorizer

As we can see from Figure 7, the highest true positives and true negatives detected by the logistic
regression algorithm while the second best is a passive-aggressive classifier.
To improve the performance of our machine learning algorithms we have used the TFIDF vectorizer.
TFIDF vectorizer converts the text document into a matrix of TFIDF features. Now lest see what is TFIDF ?
and why it performs better than count vectorizer?

Now, let’s break down TFIDF into TF i.e., Term Frequency and IDF i.e., Inverse Document Frequency.

Term Frequency (TF)

Term Frequency represents the number of times a word appears in a document divided by the total
number of words in the document. The formulae of Term frequency is mathematically shown as below.

Inverse Document Frequency (IDF)

It represents the log of the number of documents divided by the number of documents containing the
word w. Inverse data frequency used to weight the rare words across all documents in the corpus.

The reason why TFIDF performs better than count vectorizer is that TFIDF gives higher weight to rare
words across all the documents whereas count vectorizer gives importance to common words which are
of very little importance in text classification.

The code for Implementing TFIDF is defined in the form of a library that is defined under sklearn.
TFIDF Hyperparameters

Now let’s talk about its hyper-parameter first one is stopwords which is defined for making aware that
stopwords used in the text are of English language, max_df is used for removing terms that appear too
frequently, In our case, we have taken its value as 0.8 which means remove those words which appear in
more than 80% of the documents and the last hyper-parameter is ngram_range which is set as (1,2) i.e.,
it will allow both unigrams and bigrams.

Next, with the TFIDF vectorizer, we have used the same machine-learning algorithms i.e., Multinomial
Naïve Bayes, Passive Aggressive classifier, and Logistic regression algorithms. The results of these
algorithms in terms of the confusion matrix is shown below.

Figure 8. Figure showing the confusion matrix of Multinomial Naïve Bayes (Left), Passive-aggressive
classifier (Middle) and Logistic regression (Right) using TFIDF vectorizer

Result Analysis

Now let’s see detailed results of both count vectorizer and TFIDF vectorizer and compare which one
performs better. Now, let’s compare both the techniques and find which one performs better.
Findings and discussion

As per the above Table, Passive Aggressive classifier with TFIDF outperforms others in terms of accuracy,
precision, and F1-score whereas the highest recall achieved by Multinomial Naïve Bayes which is really
surprising. The important thing to notice that after applying TFIDF significant improvement observed in
the performance of all three algorithms in comparison to the count vectorizer. In the case of count
vectorizer, Logistic Regression outperforms others in all the performance measures except recall.

Feature Importance

Now, let’s see what are the most informative features in predicting Fake and True news with the TFIDF
vectorizer. The top 10 key features in predicting fake news are shown below.

The top 10 key features in predicting true news are shown below.

As we can see from the above figures, the first column is the label the second column consists of
coefficient and the last one are the top contributing features. From the figure, it is evident that top
features in the case of Fake News are the lowest negative value whereas in the case of true news highest
positive value.
In most of the top fake and true news, only unigrams are present while in case of true news one bigram
feature is also present i.e., Islamic state.

Probing Understanding
No ratings yet
Probing Understanding
30 pages
Detecting of Fake News With Python and ML
57% (7)
Detecting of Fake News With Python and ML
17 pages
BarraOne Performance Attribution - Brinson Vs Factor Based
No ratings yet
BarraOne Performance Attribution - Brinson Vs Factor Based
14 pages
Fake News Detection Using Machine Learning Models
No ratings yet
Fake News Detection Using Machine Learning Models
5 pages
Fake News Detection Project Report
100% (1)
Fake News Detection Project Report
8 pages
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
100% (1)
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
97 pages
Aeroballistics of A Terminally Corrected Spinning Projectile (TCSP)
No ratings yet
Aeroballistics of A Terminally Corrected Spinning Projectile (TCSP)
7 pages
Sustancias Puras Termo1 Est
No ratings yet
Sustancias Puras Termo1 Est
0 pages
Fake News Detection Using Machine Learning Algorithm
No ratings yet
Fake News Detection Using Machine Learning Algorithm
7 pages
Aircraft Routing, and Crew Scheduling
No ratings yet
Aircraft Routing, and Crew Scheduling
195 pages
Fake News Detection Natural Language Processing
No ratings yet
Fake News Detection Natural Language Processing
62 pages
Fake News Detection
No ratings yet
Fake News Detection
14 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
2023-24 Physics Lab Manual Class 12
No ratings yet
2023-24 Physics Lab Manual Class 12
294 pages
Lesson 9 5 Multiplication Division of Radical Expressions
100% (1)
Lesson 9 5 Multiplication Division of Radical Expressions
17 pages
Lock in Amplifiers Applications
No ratings yet
Lock in Amplifiers Applications
198 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
7 pages
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
No ratings yet
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
57 pages
Synopsis
No ratings yet
Synopsis
8 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
6 pages
What Is KMC
No ratings yet
What Is KMC
2 pages
Fakenews
No ratings yet
Fakenews
5 pages
Machine Learning For The Classification of Fake News
No ratings yet
Machine Learning For The Classification of Fake News
4 pages
Report Se
No ratings yet
Report Se
4 pages
Fake News Detection With Semantic Features and Text Mining
No ratings yet
Fake News Detection With Semantic Features and Text Mining
6 pages
Fake News Mini PDF
No ratings yet
Fake News Mini PDF
12 pages
A Tool For Fake News Detection: September 2018
No ratings yet
A Tool For Fake News Detection: September 2018
9 pages
Fake News Detection: 2018 IEEE International Students' Conference On Electrical, Electronics and Computer Sciences
No ratings yet
Fake News Detection: 2018 IEEE International Students' Conference On Electrical, Electronics and Computer Sciences
5 pages
Fake News Detec-WPS Office
No ratings yet
Fake News Detec-WPS Office
4 pages
ML Project Report PDF
No ratings yet
ML Project Report PDF
26 pages
Project Report
No ratings yet
Project Report
6 pages
CT605A-N Soft Computing
No ratings yet
CT605A-N Soft Computing
3 pages
Fake News Detection Using Python and Machine Learning
No ratings yet
Fake News Detection Using Python and Machine Learning
6 pages
Find The LCM of The Following Numbers
No ratings yet
Find The LCM of The Following Numbers
1 page
Cyber Security: PROJECT: Fake News Detection
No ratings yet
Cyber Security: PROJECT: Fake News Detection
8 pages
SE IT CGL Lab Manual
No ratings yet
SE IT CGL Lab Manual
96 pages
Machine Learning Techniques For The Classification of Fake News
No ratings yet
Machine Learning Techniques For The Classification of Fake News
5 pages
Fake News - 01
No ratings yet
Fake News - 01
5 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
Module 4 - Lecture Notes Engineering Design-Pages-15-18,3-13,1
No ratings yet
Module 4 - Lecture Notes Engineering Design-Pages-15-18,3-13,1
16 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Pandey 2022 J. Phys. Conf. Ser. 2161 012027
No ratings yet
Pandey 2022 J. Phys. Conf. Ser. 2161 012027
13 pages
Multiple Correct Questions 1. Physics: Paper-1 JEE-Advanced - FT-02 - Sample Paper
No ratings yet
Multiple Correct Questions 1. Physics: Paper-1 JEE-Advanced - FT-02 - Sample Paper
12 pages
CHAPTER 4 and 5 New Hate Speech
No ratings yet
CHAPTER 4 and 5 New Hate Speech
21 pages
Fake News Detection With Different Model
No ratings yet
Fake News Detection With Different Model
15 pages
Detection of Fake News
No ratings yet
Detection of Fake News
17 pages
An Enhanced Method For Detecting Fake Ne
No ratings yet
An Enhanced Method For Detecting Fake Ne
19 pages
Fake News Detection Using Python
No ratings yet
Fake News Detection Using Python
11 pages
QSPM 1
No ratings yet
QSPM 1
4 pages
Math 9 - Q1 - Week 5 - Module 6 - Quadratic Inequalities - Reproduction
No ratings yet
Math 9 - Q1 - Week 5 - Module 6 - Quadratic Inequalities - Reproduction
34 pages
2 Mesh Analysis
No ratings yet
2 Mesh Analysis
16 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
9 pages
Headline Detecting Fake News With M
No ratings yet
Headline Detecting Fake News With M
3 pages
Fake News Pred
No ratings yet
Fake News Pred
8 pages
Engineering Economics Formulas
No ratings yet
Engineering Economics Formulas
2 pages
Fake News Detection Using Machine Learning12 2
No ratings yet
Fake News Detection Using Machine Learning12 2
65 pages
Masters Thesis Revised
No ratings yet
Masters Thesis Revised
4 pages
Project Documentation
No ratings yet
Project Documentation
6 pages
Icoase51841 2020 9436605
No ratings yet
Icoase51841 2020 9436605
7 pages
Synthetic Image Transformation
No ratings yet
Synthetic Image Transformation
13 pages
ME 2016 Spring 24 Homework 3
No ratings yet
ME 2016 Spring 24 Homework 3
4 pages
FAke News Report
No ratings yet
FAke News Report
16 pages
Alasaad2018 8
No ratings yet
Alasaad2018 8
8 pages
Project Synopsis Report Format
No ratings yet
Project Synopsis Report Format
9 pages
Kumarjain2020 6
No ratings yet
Kumarjain2020 6
6 pages
IR - MINIPROJECT Final
No ratings yet
IR - MINIPROJECT Final
15 pages
JPNR 2022 04 140
No ratings yet
JPNR 2022 04 140
7 pages
1 Multiple Choice Questions (MCQ) : Week-3
No ratings yet
1 Multiple Choice Questions (MCQ) : Week-3
15 pages
Fake News - Machine Learning
No ratings yet
Fake News - Machine Learning
6 pages
Fakenewsdetection1 221201164106 6ffc274e
No ratings yet
Fakenewsdetection1 221201164106 6ffc274e
10 pages
GATE 2024 Mining Engineering MN Solutions
No ratings yet
GATE 2024 Mining Engineering MN Solutions
8 pages
Artificial Neural Network Proposal
No ratings yet
Artificial Neural Network Proposal
5 pages
DPP 1 NLM
No ratings yet
DPP 1 NLM
31 pages
Machine Learning Fake News Blocking
No ratings yet
Machine Learning Fake News Blocking
14 pages
Mid Term Last Year
No ratings yet
Mid Term Last Year
4 pages
2009 - Ukmt
No ratings yet
2009 - Ukmt
17 pages
Fake News Detection Using Machine Learning - IEEE Conference Publication - IEEE Xplore
No ratings yet
Fake News Detection Using Machine Learning - IEEE Conference Publication - IEEE Xplore
8 pages
Methodology
No ratings yet
Methodology
9 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Write A Shell Script To Find Whether An Input Integer Is Even or Odd
No ratings yet
Write A Shell Script To Find Whether An Input Integer Is Even or Odd
3 pages
Lec 8
No ratings yet
Lec 8
8 pages
Heim - Theory - Reconstructed-10-05-2025-Toward Experimental Validation
No ratings yet
Heim - Theory - Reconstructed-10-05-2025-Toward Experimental Validation
23 pages
Project Report
No ratings yet
Project Report
12 pages
AI Phase2
No ratings yet
AI Phase2
6 pages
Towards Robust Models For Fake News Detection in Spanish - Gómez González, Coll Ardanuy y Rosso
No ratings yet
Towards Robust Models For Fake News Detection in Spanish - Gómez González, Coll Ardanuy y Rosso
13 pages
Fake News Detection
No ratings yet
Fake News Detection
5 pages
Kiangsu-Chekiang College (Shatin) F.5 Final Examination 2023-24 MATHEMATICS Compulsory Part Paper 1 Question-Answer Book June 18, 2024 (Tuesday)
No ratings yet
Kiangsu-Chekiang College (Shatin) F.5 Final Examination 2023-24 MATHEMATICS Compulsory Part Paper 1 Question-Answer Book June 18, 2024 (Tuesday)
17 pages
AI Phase4
No ratings yet
AI Phase4
6 pages
Fake News Detection
No ratings yet
Fake News Detection
15 pages
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)

AI - Phase 4

Uploaded by

AI - Phase 4

Uploaded by

Fake News Detection using NLP and Machine Learning in

What is Fake news?

Article title (News Headline),

Exploratory data analysis

Figure 1. Snapshot of the actual dataset used in the case study

Distribution of Fake news

Figure 2. Distribution of Real and Fake news in the dataset

Distribution of unique words

Distribution of special characters

Word Cloud to plot the most frequent words

Figure 6. Word cloud representation of fake news and real news

Step 1: Lower casing

Step 2: Stop word removal

Train Test Split

Figure 7. Figure showing the confusion matrix of

Term Frequency (TF)

Inverse Document Frequency (IDF)

You might also like