0% found this document useful (0 votes)
34 views2 pages

4.10. Text Data Pre-Processing - Use Case - Ipynb - Colaboratory

This document discusses preprocessing data for a fake news detection model. It loads a fake news dataset into a pandas dataframe, prints the first 5 rows to view the data structure. It then separates the text content and labels into variables X and Y. It downloads stopwords from the NLTK corpus and prints them. Stemming is introduced as a preprocessing technique to reduce words to their root form.

Uploaded by

lokesh k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views2 pages

4.10. Text Data Pre-Processing - Use Case - Ipynb - Colaboratory

This document discusses preprocessing data for a fake news detection model. It loads a fake news dataset into a pandas dataframe, prints the first 5 rows to view the data structure. It then separates the text content and labels into variables X and Y. It downloads stopwords from the NLTK corpus and prints them. Stemming is introduced as a preprocessing technique to reduce words to their root form.

Uploaded by

lokesh k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

id title author text label

Importing the Dependencies


0 0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let... 1 Darrell Lu

import numpy as np 1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn Ever get the feeling your life circles the rou... 0 Danie
import pandas as pd 2 2 Why the Truth Might Get You Fired Consortiumnews.com Why the Truth Might Get You Fired October 29, ... 1 Consortium
import re
import nltk 3 3 feature
# separating 15 and
Civilians Killed In Single US Airstrike Hav...
target Jessica Purkiss Videos 15 Civilians Killed In Single US Airstr... 1 Jes
from nltk.corpus import stopwords X = news_data.drop(columns='label', axis =1)
from nltk.stem.porter import PorterStemmer 4 4 Iranian woman jailed for fictional unpublished... Howard Portnoy Print \nAn Iranian woman has been sentenced to... 1 Howa
Y = news_data['label']
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
print(X)

nltk.download('stopwords') id ... content


0 0 ... Darrell Lucus House Dem Aide: We Didn’t Even S...
[nltk_data] Downloading package stopwords to /root/nltk_data... 1 1 ... Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
[nltk_data] Package stopwords is already up-to-date! 2 2 ... Consortiumnews.com Why the Truth Might Get You...
True 3 3 ... Jessica Purkiss 15 Civilians Killed In Single ...
4 4 ... Howard Portnoy Iranian woman jailed for fictio...
... ... ... ...
# printing the stopwords 20795 20795 ... Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
print(stopwords.words('english')) 20796 20796 ... Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797 20797 ... Michael J. de la Merced and Rachel Abrams Macy...
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourse 20798 20798 ... Alex Ansary NATO, Russia To Hold Parallel Exer...
20799 20799 ... David Swanson What Keeps the F-35 Alive

[20800 rows x 5 columns]


Data Pre-Processing
print(Y)
# load the data to a pandas dataframe
0 1
news_data = pd.read_csv('/content/drive/MyDrive/Datasets/Fake News Dataset/train.csv')
1 0
2 1
# first 5 rows of the dataset 3 1
4 1
news_data.head()
..
20795 0
id title author text label 20796 0
20797 0
0 0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let... 1 20798 1
20799 1
1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn Ever get the feeling your life circles the rou... 0 Name: label, Length: 20800, dtype: int64
2 2 Why the Truth Might Get You Fired Consortiumnews.com Why the Truth Might Get You Fired October 29, ... 1

3 3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss Videos 15 Civilians Killed In Single US Airstr... 1 Stemming:

4 4 Iranian woman jailed for fictional unpublished... Howard Portnoy Print \nAn Iranian woman has been sentenced to... 1 Stemming is the process of reducing a word to its Root Word

0 --> Real News port_stem = PorterStemmer()

1 --> Fake News


def stemming(content):
stemmed_content = re.sub('[^a-zA-Z]',' ',content)
news_data.shape stemmed_content = stemmed_content.lower()
stemmed_content = stemmed_content.split()
(20800, 5)
stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
stemmed_content = ' '.join(stemmed_content)
# checking for missing values return stemmed_content
news_data.isnull().sum()
news_data['content'] = news_data['content'].apply(stemming)
id 0
title 558
author 1957
print(news_data['content'])
text 39
label 0
0 darrel lucu hous dem aid even see comey letter...
dtype: int64
1 daniel j flynn flynn hillari clinton big woman...
2 consortiumnew com truth might get fire
# replacing the missing values with null string 3 jessica purkiss civilian kill singl us airstri...
4 howard portnoy iranian woman jail fiction unpu...
news_data = news_data.fillna('')
...
20795 jerom hudson rapper trump poster child white s...
20796 benjamin hoffman n f l playoff schedul matchup...
# merging th eauthor name and news title
20797 michael j de la merc rachel abram maci said re...
news_data['content'] = news_data['author']+' '+news_data['title']
20798 alex ansari nato russia hold parallel exercis ...
20799 david swanson keep f aliv
Name: content, Length: 20800, dtype: object
# first 5 rows of the dataset
news_data.head()
X = news_data['content'].values
Y = news_data['label'].values

print(X)
['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
'daniel j flynn flynn hillari clinton big woman campu breitbart'
'consortiumnew com truth might get fire' ...
'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
'alex ansari nato russia hold parallel exercis balkan'
'david swanson keep f aliv']

print(Y)

[1 0 1 ... 0 1 1]

Y.shape

(20800,)

# converting the textual data to feature vectors


vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

print(X)

(0, 15686) 0.28485063562728646


(0, 13473) 0.2565896679337957
(0, 8909) 0.3635963806326075
(0, 8630) 0.29212514087043684
(0, 7692) 0.24785219520671603
(0, 7005) 0.21874169089359144
(0, 4973) 0.233316966909351
(0, 3792) 0.2705332480845492
(0, 3600) 0.3598939188262559
(0, 2959) 0.2468450128533713
(0, 2483) 0.3676519686797209
(0, 267) 0.27010124977708766
(1, 16799) 0.30071745655510157
(1, 6816) 0.1904660198296849
(1, 5503) 0.7143299355715573
(1, 3568) 0.26373768806048464
(1, 2813) 0.19094574062359204
(1, 2223) 0.3827320386859759
(1, 1894) 0.15521974226349364
(1, 1497) 0.2939891562094648
(2, 15611) 0.41544962664721613
(2, 9620) 0.49351492943649944
(2, 5968) 0.3474613386728292
(2, 5389) 0.3866530551182615
(2, 3103) 0.46097489583229645
: :
(20797, 13122) 0.2482526352197606
(20797, 12344) 0.27263457663336677
(20797, 12138) 0.24778257724396507
(20797, 10306) 0.08038079000566466
(20797, 9588) 0.174553480255222
(20797, 9518) 0.2954204003420313
(20797, 8988) 0.36160868928090795
(20797, 8364) 0.22322585870464118
(20797, 7042) 0.21799048897828688
(20797, 3643) 0.21155500613623743
(20797, 1287) 0.33538056804139865
(20797, 699) 0.30685846079762347
(20797, 43) 0.29710241860700626
(20798, 13046) 0.22363267488270608
(20798, 11052) 0.4460515589182236
(20798, 10177) 0.3192496370187028
(20798, 6889) 0.32496285694299426
(20798, 5032) 0.4083701450239529
(20798, 1125) 0.4460515589182236
(20798, 588) 0.3112141524638974
(20798, 350) 0.28446937819072576
(20799, 14852) 0.5677577267055112
(20799, 8036) 0.45983893273780013
(20799, 3623) 0.37927626273066584
(20799, 377) 0.5677577267055112

You might also like