Fake News Detection EDA Case Study
Fake News Detection EDA Case Study
Develop a machine learning program to identify when an article might be fake news. Run by the UTK Machine
Learning Club.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import re
import os
import pandas as pd
from tqdm import tqdm
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import seaborn as sns
from string import punctuation
import matplotlib.pyplot as plt
import time
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
0 0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn Ever get the feeling your life circles the rou... 0
2 2 Why the Truth Might Get You Fired Consortiumnews.com Why the Truth Might Get You Fired October 29, ... 1
3 3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss Videos 15 Civilians Killed In Single US Airstr... 1
4 4 Iranian woman jailed for fictional unpublished... Howard Portnoy Print \nAn Iranian woman has been sentenced to... 1
df_test=pd.read_csv(r"test.csv",error_bad_lines=False)
df_test.head()
0 20800 Specter of Trump Loosens Tongues, if Not Purse... David Streitfeld PALO ALTO, Calif. — After years of scorning...
1 20801 Russian warships ready to strike terrorists ne... NaN Russian warships ready to strike terrorists ne...
2 20802 #NoDAPL: Native American Leaders Vow to Stay A... Common Dreams Videos #NoDAPL: Native American Leaders Vow to...
3 20803 Tim Tebow Will Attempt Another Comeback, This ... Daniel Victor If at first you don’t succeed, try a different...
4 20804 Keiser Report: Meme Wars (E995) Truth Broadcast Network 42 mins ago 1 Views 0 Comments 0 Likes 'For th...
Observation:
Train dataset contain 20800 rows and 5 columns file size is 812.6+ KB
df_train["label"].value_counts()
1 10413
0 10387
Name: label, dtype: int64
#plotting the pie plot for class label column distribution on train dataset
plt.figure(figsize=(10,5))
plt.pie(df_train["label"].value_counts(),labels=["unreliable/Fake","reliable/Not-Fake"],autopct=lambda p:f'{p:.
plt.show()
Observation:-
In Train dataset we have 50.06% datapoint belong to Fake news articals and 49.94% datapoint belongs to not-fake Articals
id 0
title 558
author 1957
text 39
label 0
dtype: int64
Observation
1. from above we can that in ID columns we have zero null/NAN values
2. In Title column we have 558 rows with NULL/NAN Values
3. In Author column we have 1957 rows with NULL/NAN Values
4. In text column we have 39 rows with NULL/NAN values
5. In label column we have zero NULL/NAN values
# After droping all the NAN rows from the data again cheking if we have removed all the NULL/NAN rows from trai
df_train.isnull().sum()
id 0
title 0
author 0
text 0
label 0
dtype: int64
(18285, 5)
Observation:-
1. After removing NAN rows from train dataset we have lost 2515 rows
2. Also now in our train dataset we dont have any duplicate row or rows with NAN values
3. Now in train dataset we have 18285 rows
#After dropping the NAN rows from data seet plotting the pie plot for class label column to see the both class
plt.figure(figsize=(10,5))
plt.pie(df_train["label"].value_counts(),labels=["unreliable/Fake","reliable/Not-Fake"],autopct=lambda p:f'{p:.
plt.show()
plt.figure(figsize=(10,6))
sns.countplot(x ='label', hue = "label", data = df_train)
plt.title('both Fake and not fake news class count on Train dataset', fontsize=15)
plt.show()
Observation:
From above pie plot we see that after removeing the NAN rows from dataset now we have 56.6% rows as Fake arctical
datapoints and 43.34% rows as Not-Fake artical and stll aur data is almost balanced
Observation:
Test dataset contain 5200 rows and 5 columns file size is 162.6+ KB
id 0
title 122
author 503
text 7
dtype: int64
id 0
title 0
author 0
text 0
dtype: int64
# Checking for duplicate rows in train dataset
df_test.duplicated().sum()
(5200, 4)
Observation:-
1. In testdata set we have removed the rows with NULL/NAN values
2. In test dataset we dont have any duplicate rows
3. After removeing the rows with NAN values we lost 625 rows from test dataset and now we are having 4575 rows
%%time
df_train['num_characters_title'] = df_train['title'].apply(len)
df_train['num_characters_text'] = df_train['text'].apply(len)
Wall time: 16 ms
%%time
df_test['num_characters_title'] = df_test['title'].apply(len)
df_test['num_characters_text'] = df_test['text'].apply(len)
Wall time: 8 ms
%%time
df_train['num_word_title'] = df_train['title'].apply(lambda x:len(nltk.word_tokenize(x)))
df_train['num_word_text'] = df_train['text'].apply(lambda x:len(nltk.word_tokenize(x)))
Wall time: 1min 32s
%%time
df_test['num_word_title'] = df_test['title'].apply(lambda x:len(nltk.word_tokenize(x)))
df_test['num_word_text'] = df_test['text'].apply(lambda x:len(nltk.word_tokenize(x)))
Wall time: 30.1 s
%%time
df_train['num_sentences_title'] = df_train['title'].apply(lambda x:len(nltk.sent_tokenize(x)))
df_train['num_sentences_text'] = df_train['text'].apply(lambda x:len(nltk.sent_tokenize(x)))
Wall time: 17.9 s
%%time
df_test['num_sentences_title'] = df_test['title'].apply(lambda x:len(nltk.sent_tokenize(x)))
df_test['num_sentences_text'] = df_test['text'].apply(lambda x:len(nltk.sent_tokenize(x)))
Wall time: 4.97 s
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1
Lucus
Even Even
See See
Comey’s Comey’s
Let... Let...
FLYNN:
Ever get
Hillary
the
Clinton,
Daniel feeling
Big
1 1 J. your life 0 55 4160 11 822 1
Woman
Flynn circles
on
the
Campus
rou...
- ...
#test dataset
df_test.head(2)
Specter PALO
of Trump ALTO,
Loosens David Calif. —
0 20800 94 8015 19 1588 1
Tongues, Streitfeld After
if Not years of
Purse... scorning...
Russian Russian
warships warships
ready to David ready to
1 20801 55 1559 8 277 1
strike Streitfeld strike
terrorists terrorists
ne... ne...
Univariate Analysis
PDF plot
#PDF plot for num_characters_title feature
sns.FacetGrid(df_train,hue="label" ,height=8).map(sns.distplot,"num_characters_title",hist=False).add_legend();
plt.title("PDF plot for num_characters_title feature")
plt.show()
Observations:-
1. if Number of character in title are more than 200 or less than !10 char then it fake news
2. if Number of character in title is between ~10 to ~180 char then its not fake news
3. This feature can be usefull on seperating both classess
Observations:-
1. if number of words in ttile between ~10 to 20 then its a not fake news
2. if number of words in title are less than ~10 then it can be fake news articles
3. Also we can see that this feature is able to seperate both classes to some extend
4. This feature can be usefull on seperating both classess
Observations:-
1. Above PDF plot are almost overlapping and we can not make any observation from above pdfs
Observations:-
1. Above PDF plot are almost overlapping and we can not make any observation from above pdfs.
plt.xlabel("num_characters_title")
plt.title("CDF plot for num_characters_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost ~85% fake news title constain less than or equal to 100 charaters in fake news title.
2. Almost ~60% not fake news title constain less than 100 characters .
plt.xlabel("num_characters_text")
plt.title("CDF plot for num_characters_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost 98% fake news text contain less than or equal to 20000 characters.
2. Almost ~90% not fake news text contain less than 20000 characters.
plt.xlabel("num_word_title")
plt.title("CDF plot for num_word_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost 80% fake news title contain less than 20 words .
2. Almost ~85% not fake news title contain less than or equal to 20.
plt.xlabel("num_word_text")
plt.title("CDF plot for num_word_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost ~90% fake news text contain less than 5000 words.
2. Almost ~80% not fake news text contain less than 5000 words.
plt.xlabel("num_sentences_title")
plt.title("CDF plot for num_sentences_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost ~99% fake news title contain less than 3 sentences .
2. Almost ~98% not fake news title contain less than or equal to 3 sentences
plt.xlabel("num_sentences_text")
plt.title("CDF plot for num_sentences_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost ~98% fake news text contain less than 200 sentences.
2. Almost ~80% not fake news text contain less than 200 sentences
def count_unique_words(text):
'''this function will count total number of unique words in text and title column'''
word_tokens = nltk.word_tokenize(text)
unique_words=[]
for i in range(len(word_tokens)):
if word_tokens[i] not in unique_words:
unique_words.append(word_tokens[i])
return len(unique_words)
%%time
df_test['Count_unique_words_title'] = df_test['title'].apply(lambda x:count_unique_words(x))
df_test['Count_unique_words_text'] = df_test['text'].apply(lambda x:count_unique_words(x))
Wall time: 33.9 s
def count_stopwords(text):
'''this function will count total number of stop words in text and title column'''
stop_words = set(stopwords.words('english'))
word_tokens = nltk.word_tokenize(text)
c_stopwords = [w for w in word_tokens if w in stop_words]
return len(c_stopwords)
%%time
df_test['Count_Stop_words_title'] = df_test['title'].apply(lambda x:count_stopwords(x))
df_test['Count_Stop_words_text'] = df_test['text'].apply(lambda x:count_stopwords(x))
Wall time: 27.1 s
%%time
#This can be calculated by dividing the counts of characters by counts of words.
df_test['Avg_word_length_title'] = df_test['num_characters_title']/df_test["num_word_title"]
df_test['Avg_word_length_text'] = df_test['num_characters_text']/df_test["num_word_text"]
Wall time: 0 ns
%%time
#This can be calculated by dividing the counts of words by the counts of sentences.
df_test['Avg_sentence_length_title'] = df_test['num_word_title']/df_test["num_sentences_title"]
df_test['Avg_sentence_length_text'] = df_test['num_word_text']/df_test["num_sentences_text"]
Wall time: 7.99 ms
%%time
#This feature is also the ratio of counts of stopwords to the total number of words.
df_test['Stopword_count_ratio_title'] = df_test['Count_Stop_words_title']/df_test["num_word_title"]
df_test['Stopword_count_ratio_text'] = df_test['Count_Stop_words_text']/df_test["num_word_text"]
Wall time: 0 ns
%%time
#This feature is basically the ratio of unique words to a total number of words.
df_test['Unique_words_count_ratio_title'] = df_test['Count_unique_words_title']/df_test["num_word_title"]
df_test['Unique_words_count_ratio_text'] = df_test['Count_unique_words_text']/df_test["num_word_text"]
Wall time: 0 ns
Observations:-
1. both pdfs are in zig zag shape
2. if Count of stop words in title more than 5 then its fake news articles
3. if Count of stop words in title is less than 5 then its not fake news articles
4. This feature can be usefull on seperating both classess
Observations:-
1. Both PDFs are overlapping and we can not make any observation from above pdf as both pdf are very close to each
other
Observations:-
1. Here also both pdfs are overlapping
2. if avg senetence length on title is between ~5 to 20 then is not fake news
3. if avg senetence length on title is less than ~5 then is fake news
4. This feature can be usefull on seperating both classess
Observations:-¶
1. Here also both pdfs are overlapping
2. If avg sentence length on text is more than 100 then its fake news articles
Observations:-
1. both plots are overlapping
2. if unique words count ratio on artical text is more then 1.0 then its a fake news articles
3. if unique words count ratio on artical text is betwwee 0.2 to 0.8 then its a not fake news articles
4. This feature can be usefull on seperating both classess
plt.xlabel("Count_unique_words_title")
plt.title("CDF plot for Count_unique_words_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost 80% fake news artical title have less than 20 unique words
2. Almost ~60% not fake news artical title have less than 20 unique words.
plt.xlabel("Count_unique_words_text")
plt.title("CDF plot for Count_unique_words_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observations:-
1. Almost ~90% fake news artical text have less than or equal to 1000 unique words.
2. Almost ~80% not fake news artical title have less than 1000 unique words.
plt.xlabel("Count_Stop_words_title")
plt.title("CDF plot for Count_Stop_words_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost 80% Fake artical title have less than 5 stopwords.
2. Almost 60% not Fake artical artical title have less than 2 stopwords.
plt.xlabel("Count_Stop_words_text")
plt.title("CDF plot for Count_Stop_words_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost 98% Fake artical text will have less than 2000 stopwords.
2. Almost 98% not Fake artical text will have less than 1000 stopwords.
plt.xlabel("Avg_word_length_title")
plt.title("CDF plot for Avg_word_length_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost 80% Fake news artical title have less than 7.5 avg word lengh
2. Almost ~45% not Fake news artical title have less than or equla to 5 avg word lengh
plt.xlabel("Avg_word_length_text")
plt.title("CDF plot for Avg_word_length_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost 80% Fake news artical text have less than 10 avg word lengh
2. Almost 60% not Fake news artical text have less than 10 avg word length
plt.xlabel("Avg_sentence_length_title")
plt.title("CDF plot for Avg_sentence_length_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost 80% Fake news artical title will have less than 20 avg sentence length.
2. Almost 60% not Fake news artical title have less than 20 avg sentence length
plt.xlabel("Avg_sentence_length_text")
plt.title("CDF plot for Avg_sentence_length_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost ~80% Fake news artical text have less than 50 avg sentence length
2. Almost ~80% not Fake news artical text have less than 50 avg sentence length
plt.xlabel("Stopword_count_ratio_title")
plt.title("CDF plot for Stopword_count_ratio_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost ~90% Fake news artical title have less than 0.3 stopwords count ratio
2. Almost ~80% not Fake news artical title have less than 0.2 stop word count ratio
plt.xlabel("Stopword_count_ratio_text")
plt.title("CDF plot for Stopword_count_ratio_text feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='lower right');
plt.show()
Observation:-
1. Almost ~80% Fake news artical text have less than 0.4 stop word count ratio
2. Almost ~60% not Fake news artical text have less than 0.4 stopwords count ratio
plt.xlabel("Unique_words_count_ratio_title")
plt.title("CDF plot for Unique_words_count_ratio_title feature of Fake and Not_Fake news")
plt.grid()
plt.legend(['Fake news CDF', 'Not fake News CDF'],loc='upper center');
plt.show()
Observation:-
1. Almost ~60% Fake news artical title have less than 1 unique word count ratio
2. Almost ~60% not Fake news artical title have less 1 unique word count ratio
Observation:-
1. Almost ~80% Fake news artical text have less than 0.8 unique word count ratio
2. Almost ~80% not Fake news artical text have less than 0.6 unique word count ratio
final_text=text.lower()
final_text=re.sub(r"[A-Za-z\d\-\.]+@[A-Za-z\.-]+\b", " ", final_text) #remove the email address from text
final_text = re.sub(r'http\S+', '', final_text) # remove http links from text
final_text=re.sub(r"\d+", " ", final_text) # remove any digit from text
final_text=re.sub(r"[^a-zA-Z]+", " ", final_text) # remove anything except Alphabets from text
#defining the dictionary containing all the apostrophe/short words used in english text
CONTRACTION_MAP = {
"ain’t": "is not",
"aren’t": "are not",
"can’t": "cannot",
"can’t’ve": "cannot have",
"’cause": "because",
"could’ve": "could have",
"couldn’t": "could not",
"couldn’t’ve": "could not have",
"didn’t": "did not",
"doesn’t": "does not",
"don’t": "do not",
"hadn’t": "had not",
"hadn’t’ve": "had not have",
"hasn’t": "has not",
"haven’t": "have not",
"he’d": "he would",
"he’d’ve": "he would have",
"he’ll": "he will",
"he’ll’ve": "he he will have",
"he’s": "he is",
"how’d": "how did",
"how’d’y": "how do you",
"how’ll": "how will",
"how’s": "how is",
"I’d": "I would",
"I’d’ve": "I would have",
"I’ll": "I will",
"I’ll’ve": "I will have",
"I’m": "I am",
"I’ve": "I have",
"i’d": "i would",
"i’d’ve": "i would have",
"i’ll": "i will",
"i’ll’ve": "i will have",
"i’m": "i am",
"i’ve": "i have",
"isn’t": "is not",
"it’d": "it would",
"it’d’ve": "it would have",
"it’ll": "it will",
"it’ll’ve": "it will have",
"it’s": "it is",
"let’s": "let us",
"ma’am": "madam",
"mayn’t": "may not",
"might’ve": "might have",
"mightn’t": "might not",
"mightn’t’ve": "might not have",
"must’ve": "must have",
"mustn’t": "must not",
"mustn’t’ve": "must not have",
"needn’t": "need not",
"needn’t’ve": "need not have",
"o’clock": "of the clock",
"oughtn’t": "ought not",
"oughtn’t’ve": "ought not have",
"shan’t": "shall not",
"sha’n’t": "shall not",
"shan’t’ve": "shall not have",
"she’d": "she would",
"she’d’ve": "she would have",
"she’ll": "she will",
"she’ll’ve": "she will have",
"she’s": "she is",
"should’ve": "should have",
"shouldn’t": "should not",
"shouldn’t’ve": "should not have",
"so’ve": "so have",
"so’s": "so as",
"that’d": "that would",
"that’d’ve": "that would have",
"that’s": "that is",
"there’d": "there would",
"there’d’ve": "there would have",
"there’s": "there is",
"they’d": "they would",
"they’d’ve": "they would have",
"they’ll": "they will",
"they’ll’ve": "they will have",
"they’re": "they are",
"they’ve": "they have",
"to’ve": "to have",
"wasn’t": "was not",
"we’d": "we would",
"we’d’ve": "we would have",
"we’ll": "we will",
"we’ll’ve": "we will have",
"we’re": "we are",
"we’ve": "we have",
"weren’t": "were not",
"what’ll": "what will",
"what’ll’ve": "what will have",
"what’re": "what are",
"what’s": "what is",
"what’ve": "what have",
"when’s": "when is",
"when’ve": "when have",
"where’d": "where did",
"where’s": "where is",
"where’ve": "where have",
"who’ll": "who will",
"who’ll’ve": "who will have",
"who’s": "who is",
"who’ve": "who have",
"why’s": "why is",
"why’ve": "why have",
"will’ve": "will have",
"won’t": "will not",
"won’t’ve": "will not have",
"would’ve": "would have",
"wouldn’t": "would not",
"wouldn’t’ve": "would not have",
"y’all": "you all",
"y’all’d": "you all would",
"y’all’d’ve": "you all would have",
"y’all’re": "you all are",
"y’all’ve": "you all have",
"you’d": "you would",
"you’d’ve": "you would have",
"you’ll": "you will",
"you’ll’ve": "you will have",
"you’re": "you are",
"you’ve": "you have"
}
def decontracted(text):
'''this function will Replace all apostrophe/short words from text data'''
for word in text.split():
if word.lower() in CONTRACTION_MAP:
text = text.replace(word, CONTRACTION_MAP[word.lower()])
return text
df_train.head(2)
id title author text label num_characters_title num_characters_text num_word_title num_word_text num_sentences_title ...
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1 ...
Lucus Even
Even
See See
Comey’s Comey’s
Let... Let...
2 rows × 25 columns
df_train.head(2)
id title author text label num_characters_title num_characters_text num_word_title num_word_text num_sentences_title ...
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1 ...
Lucus Even
Even
See See
Comey’s Comey’s
Let... Let...
2 rows × 25 columns
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(x = 'count',
y = 'Common_words',
data = Nuteraltemp)
plt.show()
return filtered_sentence
%%time
df_train["Without_Stopwords_text"]=df_train.apply(lambda x:remove_stopwords(x["cleaned_text"]),axis=1)
Wall time: 41.1 s
%%time
df_test["Without_Stopwords_text"]=df_test.apply(lambda x:remove_stopwords(x["cleaned_text"]),axis=1)
Wall time: 11.6 s
%%time
df_train["Without_Stopwords_title"]=df_train.apply(lambda x:remove_stopwords(x["cleaned_title"]),axis=1)
Wall time: 7.51 s
%%time
df_test["Without_Stopwords_title"]=df_test.apply(lambda x:remove_stopwords(x["cleaned_title"]),axis=1)
Wall time: 2.31 s
df_train.head(1)
id title author text label num_characters_title num_characters_text num_word_title num_word_text num_sentences_title ...
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1 ...
Lucus Even
Even
See See
Comey’s Comey’s
Let... Let...
1 rows × 27 columns
df_test.head(1)
Specter PALO
of Trump ALTO,
Loosens David Calif. —
0 20800 94 8015 19 1588 1
Tongues, Streitfeld After
if Not years of
Purse... scorning...
1 rows × 26 columns
Observation:-
1. From above count plot we can see that in not fake news most common 30 words are Trump people etc
ploting most 30 comman word from fake articles title
column
count_n=Counter(" ".join(df_train[df_train["label"]==1]["Without_Stopwords_title"]).split()).most_common(30)
Nuteraltemp = pd.DataFrame(count_n)
Nuteraltemp.columns = ['Common_words','count']
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(x = 'count',
y = 'Common_words',
data = Nuteraltemp)
plt.show()
<matplotlib.image.AxesImage at 0x2408cca7910>
Observation:-
1. Above word cloud shows that autho who almost everytime publish Fake new articles
2. Auther with bigger text have more number of datapoint where they published fake articles like Eddy Lavine noreply
blogger etc
<matplotlib.image.AxesImage at 0x240373f97c0>
Observation:-
1. Above word cloud show that author who never publish Fake new articals like john hayward Maggie Haberman etc
Word Cloud for text column for Fake news articles on train
dataset
Fake_text = wc.generate(df_train_fake["Without_Stopwords_text"].str.cat(sep=" "))
plt.figure(figsize=(18,10))
plt.axis("off")
plt.imshow(Fake_text)
<matplotlib.image.AxesImage at 0x2410620ff10>
Observation:-
1. Above word cloud shows most frequent words which apears on fake articals on text column like One ,people etc
<matplotlib.image.AxesImage at 0x240947d6280>
Observation:-
1. Above word cloud shows most frequent words which apears on not fake articals on text column like people may united
state etc
<matplotlib.image.AxesImage at 0x2407d3d9940>
Observation:-
1. Above word cloud shows most frequent words which apears on fake articals on tiltle column like fbi ,breaking etc
<matplotlib.image.AxesImage at 0x24053f9ef70>
Observation:-
1. from above word cloud we can see that we have similar author from train dataset who always publish fake artical
some example author noreply blogger and admin present in both terain and testdataset 2.Also we have few author in
test datset which is also presnent in the train dataset and who alway publish artical which are not fake and they
reliable news. e.g pam key and warner todd presnent in both train and test dataset.
<matplotlib.image.AxesImage at 0x240343e3760>
Word cloud on text column
test_text = wc.generate(df_test["cleaned_text"].str.cat(sep=" "))
plt.figure(figsize=(18,10))
plt.axis("off")
plt.imshow(test_text)
<matplotlib.image.AxesImage at 0x240952ffa00>
Bivariate analysis :-
Observations:-
1. Most common 30 bigrams for fake news artical text are include hilary clinton ,donald trum and white house etc.
X_train_PCA.head(2)
id title author text label num_characters_title num_characters_text num_word_title num_word_text num_sentences_title ...
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1 ...
Lucus
Even Even
See See
Comey’s Comey’s
Let... Let...
2 rows × 27 columns
X_train_PCA.shape
(18285, 27)
X_train_PCA.shape
(18285, 18)
X_train_PCA.head(2)
0 81 4930 19 943 1 37
1 55 4160 11 822 1 29
df_train.head(2)
id title author text label num_characters_title num_characters_text num_word_title num_word_text num_sentences_title ...
House House
Dem Dem
Aide: Aide:
We We
Darrell
0 0 Didn’t Didn’t 1 81 4930 19 943 1 ...
Lucus Even
Even
See See
Comey’s Comey’s
Let... Let...
2 rows × 27 columns
X_train_PCA.shape
(18285, 18)
more info
2D Scatter plot on top 2 Principle Component features
30
color
1
0
20
10
y
−10
−20
−10 0 10 20 30 40 50
Observations:-
1. From above plot we are able to seperate our both classess and that shows our manaully engineered feature are usefull for
seperating both classess to some extend
X.shape,Y.shape
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(13713, 20) (13713,)
(4572, 20) (4572,)
X_train.head()
8682 69 287 10 42 1 2
Featurizing our text column data using count vectorizer with max
feature as 3000
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer =CountVectorizer(ngram_range=(1,2),max_features=3000)
vectorizer.fit(X_train["Without_Stopwords_title"].values)
Featurizing our title column data using count vectorizer with max
feature as 3000
%%time
vectorizer_text =CountVectorizer(ngram_range=(1,2),max_features=3000)
vectorizer_text.fit(X_train["Without_Stopwords_text"].values)
X_train.drop(["Without_Stopwords_text","Without_Stopwords_title"],axis=1,inplace=True)
X_train.head()
8682 69 287 10 42 1 2
X_test.drop(["Without_Stopwords_text","Without_Stopwords_title"],axis=1,inplace=True)
X_test.shape,X_train.shape
# Plot the training and the CV AUC scores, for different values of 'alpha', using a 2D line plot
X_train_final = np.hstack((X_train_title, X_train_text,X_train_Num_Std))
X_test_final = np.hstack((X_test_title , X_test_text,X_test_Num_Std))
Final Observation:-
1. In total i have created 18 Features manually from the title and text column of this dataset, Manually engnieered
Features list:-
A. Count_unique_words_title
B. Count_unique_words_text
C. Count_Stop_words_title
D. Count_Stop_words_text
E. Avg_word_length_title
F. Avg_word_length_text
G. Avg_sentence_length_title
H. Avg_sentence_length_text
I. Stopword_count_ratio_title
J. Stopword_count_ratio_text
K. Unique_words_count_ratio_title
L. Unique_words_count_ratio_text
M. Number of characters in title
N. Number of characters in text
O. Number of words in title
P. Number of words in text
Q. Number of Sentences in title
R. Number of Sentences in text
2. I have applied PCA(pricinple component Anaylsis Alogrightm on above 18 manually engineered features. and created
two new Featers based on top two Principal components and plot the 2D scatter plot and after anayzing the plot we
can say that above manually engineered feature are able to seperate both clasess to some extend.
3. Along with 18 features i have applied count vectorization on top of Text feature(without stop words) and title feature
(without stopwords) and created in total 6018 feature (3000 from text vectorization 3000 feature from title feature text
vectorization and 18 above mentioned featured)
4. Created baseline mode Gaussian Naive baye with above 6018 feature and X_train and after fitting the train dataset in
GB naive bayes when we tested our accuracy on test data we have 86% accuracy score on testdataset which is
decent for baseline mode.
5. For model building we will not be using Author column as one of the feature as some time its difficult to find the author
name for any article also there can be case if that for any article author name is fake hence we drop author name
column before model bulding.