0% found this document useful (0 votes)
63 views6 pages

Fake News Classification - Ipynb - Colaboratory

This document discusses classifying fake news using machine learning models. It performs the following steps: 1. Imports libraries and loads a dataset of news articles labeled as real or fake. 2. Preprocesses the text data by tokenizing, lemmatizing, removing stopwords, and vectorizing into feature vectors. 3. Splits the data into training and test sets and trains a random forest classifier model. 4. Evaluates the model on test data and reports a 93.7% accuracy.

Uploaded by

AYAAN Satkut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views6 pages

Fake News Classification - Ipynb - Colaboratory

This document discusses classifying fake news using machine learning models. It performs the following steps: 1. Imports libraries and loads a dataset of news articles labeled as real or fake. 2. Preprocesses the text data by tokenizing, lemmatizing, removing stopwords, and vectorizing into feature vectors. 3. Splits the data into training and test sets and trains a random forest classifier model. 4. Evaluates the model on test data and reports a 93.7% accuracy.

Uploaded by

AYAAN Satkut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

12/31/23, 4:11 PM Fake News Classification.

ipynb - Colaboratory

keyboard_arrow_down Required Libraries


import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

keyboard_arrow_down 1. Data Gathering


df = pd.read_csv("/content/drive/MyDrive/Fake news detection/News_dataset.csv")
df.head()

id title author text label

House Dem Aide: We


House Dem Aide: We Didn’t
0 0 Darrell Lucus Didn’t Even See Comey’s 1
Even See Comey’s Let...
Let...

FLYNN: Hillary Clinton, Big Ever get the feeling your


1 1 Daniel J. Flynn 0
Woman on Campus - ... life circles the rou...

Why the Truth Might Get You Why the Truth Might Get
2 2 Consortiumnews.com 1
Fired You Fired October 29, ...

keyboard_arrow_down 2. Data Analysis


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20800 non-null int64
1 title 20242 non-null object
2 author 18843 non-null object
3 text 20761 non-null object
4 label 20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB

df['label'].value_counts()

1 10413
0 10387
Name: label, dtype: int64

df.shape

(20800, 5)

df.isna().sum()

id 0
title 558
author 1957
text 39
label 0
dtype: int64

df = df.dropna() #Handled Missing values by droping those rows

df.isna().sum()

id 0
title 0

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 1/6
12/31/23, 4:11 PM Fake News Classification.ipynb - Colaboratory
author 0
text 0
label 0
dtype: int64

df.shape

(18285, 5)

df.reset_index(inplace=True)
df.head()

index id title author text label

House Dem Aide: We House Dem Aide: We


0 0 0 Didn’t Even See Comey’s Darrell Lucus Didn’t Even See 1
Let... Comey’s Let...

FLYNN: Hillary Clinton, Ever get the feeling


1 1 1 Big Woman on Campus - Daniel J. Flynn your life circles the 0
... rou...

Why the Truth Might


Why the Truth Might Get
2 2 2 Consortiumnews com Get You Fired October 1

df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

df = df.drop(['id','text','author'],axis = 1)
df.head()

index title label

0 0 House Dem Aide: We Didn’t Even See Comey’s Let... 1

1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... 0

2 2 Why the Truth Might Get You Fired 1

3 3 15 Civilians Killed In Single US Airstrike Hav... 1

4 4 Iranian woman jailed for fictional unpublished... 1

keyboard_arrow_down 3. Data Preprocessing


keyboard_arrow_down 1.Tokenization
sample_data = 'The quick brown fox jumps over the lazy dog'
sample_data = sample_data.split()
sample_data

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

keyboard_arrow_down 2. Make Lowercase


sample_data = [data.lower() for data in sample_data]
sample_data

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

keyboard_arrow_down 3. Remove Stopwords


nltk.download('stopwords')
stopwords = stopwords.words('english')
print(stopwords[0:10])
print(len(stopwords))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
179
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 2/6
12/31/23, 4:11 PM Fake News Classification.ipynb - Colaboratory
sample_data = [data for data in sample_data if data not in stopwords]
print(sample_data)
len(sample_data)

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


6

keyboard_arrow_down 4. Stemming
ps = PorterStemmer()
sample_data_stemming = [ps.stem(data) for data in sample_data]
print(sample_data_stemming)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']

keyboard_arrow_down 5. Lemmatization
nltk.download('wordnet')
lm = WordNetLemmatizer()
sample_data_lemma = [lm.lemmatize(data) for data in sample_data]
print(sample_data_lemma)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']

lm = WordNetLemmatizer()
corpus = []
for i in range (len(df)):
review = re.sub('^a-zA-Z0-9',' ', df['title'][i])
review = review.lower()
review = review.split()
review = [lm.lemmatize(x) for x in review if x not in stopwords]
review = " ".join(review)
corpus.append(review)

len(corpus)

18285

df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

corpus[0]

'house dem aide: didn’t even see comey’s letter jason chaffetz tweeted'

keyboard_arrow_down 4. Vectorization (Convert Text data into the Vector)


tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],


[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])

y = df['label']
y.head()

0 1
1 0
2 1
3 1
4 1
Name: label, dtype: int64

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 3/6
12/31/23, 4:11 PM Fake News Classification.ipynb - Colaboratory

keyboard_arrow_down Data splitting into the train and test


x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 10, stratify = y )

len(x_train),len(y_train)

(12799, 12799)

len(x_test), len(y_test)

(5486, 5486)

keyboard_arrow_down 5. Model Building


rf = RandomForestClassifier()
rf.fit(x_train, y_train)

▾ RandomForestClassifier
RandomForestClassifier()

keyboard_arrow_down 6. Model Evaluation


y_pred = rf.predict(x_test)
accuracy_score_ = accuracy_score(y_test,y_pred)
accuracy_score_

0.9374772147283995

class Evaluation:

def __init__(self,model,x_train,x_test,y_train,y_test):
self.model = model
self.x_train = x_train
self.x_test = x_test
self.y_train = y_train
self.y_test = y_test

def train_evaluation(self):
y_pred_train = self.model.predict(self.x_train)

acc_scr_train = accuracy_score(self.y_train,y_pred_train)
print("Accuracy Score On Training Data Set :",acc_scr_train)
print()

con_mat_train = confusion_matrix(self.y_train,y_pred_train)
print("Confusion Matrix On Training Data Set :\n",con_mat_train)
print()

class_rep_train = classification_report(self.y_train,y_pred_train)
print("Classification Report On Training Data Set :\n",class_rep_train)

def test_evaluation(self):
y_pred_test = self.model.predict(self.x_test)

acc_scr_test = accuracy_score(self.y_test,y_pred_test)
print("Accuracy Score On Testing Data Set :",acc_scr_test)
print()

con_mat_test = confusion_matrix(self.y_test,y_pred_test)
print("Confusion Matrix On Testing Data Set :\n",con_mat_test)
print()

class_rep_test = classification_report(self.y_test,y_pred_test)
print("Classification Report On Testing Data Set :\n",class_rep_test)

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 4/6
12/31/23, 4:11 PM Fake News Classification.ipynb - Colaboratory
#Checking the accuracy on training dataset

Evaluation(rf,x_train, x_test, y_train, y_test).train_evaluation()

Accuracy Score On Training Data Set : 1.0

Confusion Matrix On Training Data Set :


[[7252 0]
[ 0 5547]]

Classification Report On Training Data Set :


precision recall f1-score support

0 1.00 1.00 1.00 7252


1 1.00 1.00 1.00 5547

accuracy 1.00 12799


macro avg 1.00 1.00 1.00 12799
weighted avg 1.00 1.00 1.00 12799

#Checking the accuracy on testing dataset


Evaluation(rf,x_train, x_test, y_train, y_test).test_evaluation()

Accuracy Score On Testing Data Set : 0.9374772147283995

Confusion Matrix On Testing Data Set :


[[2825 284]
[ 59 2318]]

Classification Report On Testing Data Set :


precision recall f1-score support

0 0.98 0.91 0.94 3109


1 0.89 0.98 0.93 2377

accuracy 0.94 5486


macro avg 0.94 0.94 0.94 5486
weighted avg 0.94 0.94 0.94 5486

keyboard_arrow_down Prediction Pipeline


class Preprocessing:

def __init__(self,data):
self.data = data

def text_preprocessing_user(self):
lm = WordNetLemmatizer()
pred_data = [self.data]
preprocess_data = []
for data in pred_data:
review = re.sub('^a-zA-Z0-9',' ', data)
review = review.lower()
review = review.split()
review = [lm.lemmatize(x) for x in review if x not in stopwords]
review = " ".join(review)
preprocess_data.append(review)
return preprocess_data

df['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'


Preprocessing(data).text_preprocessing_user()

['flynn: hillary clinton, big woman campus - breitbart']

class Prediction:

def __init__(self,pred_data, model):


self.pred_data = pred_data
self.model = model

def prediction_model(self):
preprocess_data = Preprocessing(self.pred_data).text_preprocessing_user()
data = tf.transform(preprocess_data)

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 5/6
12/31/23, 4:11 PM Fake News Classification.ipynb - Colaboratory
prediction = self.model.predict(data)

if prediction [0] == 0 :
return "The News Is Fake"

else:
return "The News Is Real"

data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'


Prediction(data,rf).prediction_model()

'The News Is Fake'

df['title'][3]

'15 Civilians Killed In Single US Airstrike Have Been Identified'

user_data = '15 Civilians Killed In Single US Airstrike Have Been Identified'


Prediction(user_data,rf).prediction_model()

'The News Is Real'

https://colab.research.google.com/drive/196NFXnYInNj9bFCE3lyCRGStXYB71vPs?authuser=2#printMode=true 6/6

You might also like