Project Report
Project Report
News is important because it keeps the public aware of developments and happenings in their
local world and beyond. According to reports, the majority of adults prefer to get their news from
digital sources like social media and traditional search engines rather than traditional media.
Fake news can be defined as news articles that are intentionally, verifiably false, and could
mislead readers
The NLTK toolkit which contains libraries set and many programs oriented to NLP is utilized.
Even the algorithms of machine learning for clustering of data, regression and its classification
i.e., Scikit learn have been imported. These three libraries are important factors in the program
which is designed in combination with others libraries such as SciPy and NumPy.
The data set has been taken from kaggle website. After getting the dataset, methodology is built
in three phases: the first phase is of data pre-processing, this elaborates the changing of datasets
from .csv file to a python object that belongs to Pandas to define data frames which shall help in
handling the date more proficiently.
In the subsequent phase the data is divided into two data frames, one being labeled as false and
other one as true based on the information known beforehand. In the later phase, tokenization
algorithms have been performed on these data frames to get clean data which is further divided
into training and test datasets and fed to supervised algorithms belonging to the Scikit Learn
package to achieve an array which helps us to analyze the accuracy of the classifiers.
Now finally starting off with our code, you can either write it in your Jupyter Notebook or Google
Colab or any other platform you like.i would prefer google collab it is easy to code here as well to
edit the code and comment as well.
Workflow Chart
The workflow of fake news prediction using logistic regression is
Importing libraries/dependencies
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Covering the importance of each library/module/function that we imported:
1. NumPy : It is a general-purpose array and matrices processing package.
2. Pandas : It allows us to perform various operations on datasets.
3. re : It is a built-in RegEx package, which can be used to work with Regular Expressions.
4. NLTK : It is a suite of libraries and programs for symbolic and statistical natural language
processing (NLP).
5. nltk.corpus : This package defines a collection of corpus reader classes, which can be used to
access the contents of a diverse set of corpora.
6. stopwords : The words which are generally filtered out before processing a natural language
are called stop words. These are actually the most common words in any language (like
articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the
text. (Example-and, of, are etc.)
7. PorterStemmer : A package to help us with stemming of words. (More about stemming in
the Data Preprocessing section)
8. Sci-kit Learn (sklearn) : It provides a selection of efficient tools for machine learning and
statistical modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python.
9. feature_extraction.text : It is used to extract features in a format supported by machine
learning algorithms from datasets consisting of text.
10. TfidfVectorizer : It transforms text to feature vectors that can be used as input to estimator.
(More about TfidfVectorizer in the Data Preprocessing section)
11. train_test_split : It is a function in Sklearn model selection for splitting data arrays into two
subsets - for training data and for testing data.
12. LogisticRegression : A pretty self explanatory part of the code, used to import the Logistic
Regression Classifier.
13. metrics and accuracy_score : To import Accuracy classification score from the metrics
module.
Dataset
The dataset used to implement our model has been taken from the Kaggle website public repository
https://www.kaggle.com/c/fake-news/data?select=train.csv which consists of news articles (20800-
English language). Data set include id, title, author, text and label either 1 or 0 as true or false news.
For loading the data set we used
data = pd.read_csv('fakenews.csv')
data.head()
Here Label indicates whether a news article is fake or not, 0 denotes that it is real and 1 denotes
that it is Fake
Data preprocessing
After importing our libraries and the dataset, it is important to preprocess the data before we train
our ML model since there might be some anomalies and missing data points which might make
our predictions a bit skewed from the actual values.
Now, we can check the size of the data frame/table as it would decide whether we can drop the
rows with null values without affecting the size of our dataset or not. This gives us (20800,
5) which means that we have 20800 number of entries and 5 columns (features).
Missing Values
Checking the total number of missing values in each of the columns we got tittle has missing
values that are 558 similarly authors has missing values which are 1957 and text has 39 missing
values. From this we can see that we will have to delete a minimum of 1957 lines to remove all
the null values so it would be better to fill these null values with an empty string. For that we can
use fillna.
After this step we no longer have any missing data points, you can check that using the is null ().
sum ()
Reducing Column
Now, we’ll try to reduce those 5 columns to only 2 columns since it will be easier for us to train
the model. For that we’ll combine the title and the author columns into one, naming it
as content. We can drop the other columns as they don’t have much effect on determining
whether the article is fake or not. This step will leave us with 2 columns - content and label.
Stemming
This is the next process in normalization of text which is to convert the tokens to their equivalent
basic/root words. This process is referred to as Stemming. It is used to reduce the forms of words
in data. Stemming does this by changing the fix of words. Snowball Stemmer Algorithm has
been adapted in this model as it works better than portal stemmer. It converts words like extreme,
extremely to extreme, minister changes to minist in the data set. In the dataset the word
„secretory‟ was most commonly used and hence this algorithm was applied mostly on this word.
Stemmer = PorterStemmer()
We create a new Porter stemmer for us so that we can use the function without explicitly typing
Porter Stemmer () every time.
def stemming(content):
stemmed_content = re.sub('[^a-zA-Z]',' ', content) #1
stemmed_content = stemmed_content.lower() #2
stemmed_content = stemmed_content.split() #3
stemmed_content = [stemmer.stem(word) for word in stemmed_content if not
word in stopwords.words('english')] #4
stemmed_content = ' '.join(stemmed_content) #5
return stemmed_content #6
Okay, so let’s go in depth and see what this function actually does. I have numbered each line
from 1 to 6 so that you can easily distinguish between different lines of code and understand each
line’s use.
1. First we use the re package and remove everything that is not a letter (lower or uppercase
letters).
4. Then we use the stemmer and stem each word which exists in the column and remove every
5. We then join all these words which were present in the form of a list and convert them back
into a sentence.
6. Finally we return the stemmed content which has been preprocessed.
df1['content']=df1['content'].apply(stemming)
df1['content'].head()
y = df1.label.values
X=TfidfVectorizer().fit_transform(X)
print(X)
Output of this code should be
Our last preprocessing step would be to transform our textual X to numerical so that our ML
model can understand it and can work with it. This is where Tf idf Vectorizer comes into play.
Tf idf Vectorizer
TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a
word in documents, we generally compute a weight to each word which signifies the importance of the
word in the document and corpus. This method is a widely used technique in Information Retrieval and
Text Mining. TF is individual to each document and word, hence we can formulate TF as follows.
count of t∈ d
Tf idf =
numbers of words∈d
This measures the importance of document in whole set of corpus, this is very similar to TF. The only
difference is that TF is frequency counter for a term t in document d, whereas DF is the count of
occurrences of term t in the document set N
df ( t)=occurrence of t∈ documents
IDF is the inverse of the document frequency which measures the in formativeness of term t. When we
calculate IDF, it will be very low for the most occurring words such as stop words (because stop words
such as “is” is present in almost all of the documents, and N/df will give a very low value to that word).
This finally gives what we want, a relative weightage.
N
Tf idf ( t )=
df
Now that we have the X in our desired form, we can move onto the next step
Splitting the data set
This means that we have divided our dataset into 80% as training set and 20% as test set. Stratify
= y implies that we have made sure that the division into train-test sets have around equal
distribution of either classes (0 and 1 or Real and Fake). random state = 2 will guarantee that the
split will always be the same.
The LR algorithm uses a logistic function for binary classification, that is, it uses linear
regression to recognize the type of classification category, such as (1 or 0, True or False, etc.),
but if it is used for multi-class classification, it approximates this classification to the structure of
the binary classification, That is, it takes one of the categories and decides whether the data
belongs to this category or the rest of the categories. In this paper, the algorithm used binary
classification, meaning that the news is either real or fake.
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no
(binary classification), based on prior observations of a data set. It is a supervised statistical
technique to find the probability of dependent variable. The graph shown below is a Sigmoid
Function, which we also call as a Logit. This function converts the probabilities into binary
values which could be further used for predictions.
According to this graph, if we obtain the probability value to be less than 0.5, then it is
considered to be of the Class 0 and if the value is more than 0.5, then it would be a part of Class
1
Model Training
We have used Logistic Regression from sci-kit learn. 1. After the data is properly prepared, the
machine learning model is ready to be trained. In this Model Training stage, overall approaches
were considered and a learning task is decided which is a prediction task. Available features in
the training data set is then studied. Then, appropriate algorithm is selected to train the model. In
this case, Logistic Regression is chosen. Dataset is fit into the algorithm for training and testing
purposes.
model = LogisticRegression()
model.fit(X_train, y_train)
Testing
Once the model has been trained the testing needs to be done.
Logistic Regression uses the 80% data was used for training and 20% data was used for testing
on the logistic regression
Evaluation
X_train_prediction=model.predict(X_train)
training_accuracy=accuracy_score(X_train_prediction,y_train)
print(training_accuracy)
So I got about 98.66%, which is pretty good. Similarly for the test dataset.
X_test_prediction = model.predict(X_test)
testing_accuracy = accuracy_score(X_test_prediction, y_test)
print(testing_accuracy)
Test accuracy is pretty good which 97 % on test data.So we will successfully train our machine
learning model
Finally to make this model useful we need to make a system. Taking a sample out of the test-set (I
took the first sample),
X_sample = X_test[0]
prediction = model.predict(X_sample)
if prediction == 0:
print('The NEWS is Real!')
else:
print('The NEWS is Fake!')
With this we have built a system as well. Now if you want to take it a step further, try inputting a
textual sample and predict using that. You can now give yourself a pat on the back as you now
know how to detect a Fake News article using Logistic Regression only.
The result of the system shows good accuracy in detecting and separating the fake and true news. The
results showed that the logistic regression algorithm is excellent in detecting fake news, as it achieved
an accuracy of 98%.The accuracy achieved by logistic regression model of the training data set is 98 % &
the accuracy of test data set is 97 % which is pretty good .if we compare our accuracy score with
different models we get different accuracies using different models and different data set so let see the
difference
Table nu 1
The accuracy achieved by each algorithm on the four considered datasets. It is evident that the
maximum accuracy achieved on DS1 (ISOT Fake News Dataset) is 99%, achieved by random
forest algorithm and Perez-LSVM. Linear SVM, multilayer perceptron, bagging classifiers, and
boosting classifiers achieved an accuracy of 98%. Average accuracy attained by ensemble
learners is 97.67% on DS1, whereas the corresponding average for individual learners is 95.25%.
Absolute difference between individual learners and ensemble learners is 2.42% which is not
significant. Benchmark algorithms Wang-CNN and Wang-Bi-LSTM performed poorer than all
other algorithms. On DS2, bagging classifier (decision trees) and boosting classifier (XGBoost)
are the best performing algorithms, achieving an accuracy of 94%. Interestingly, linear SVM,
random forest, and Perez-LSVM performed poorly on DS2. Individual learners reported an
accuracy of 47.75%, whereas ensemble learners’ accuracy is 81.5%. A similar trend is observed
for DS3, where individual learners’ accuracy is 80% whereas ensemble learners’ accuracy is
93.5%. However, unlike DS2, the best performing algorithm on DS3 is Perez-LSVM which
achieved an accuracy of 96%. On DS4 (DS1, DS2, and DS3 combined), the best performing
algorithm is random forest (91% accuracy). On average, individual learners achieved an
accuracy of 85%, whereas ensemble learners achieved an accuracy of 88.16%. ) worst
performing algorithm is Wang-Bi-LSTM which achieved an accuracy of 62%.
Figure 1 Accuracy of different data set