SMS Spam Prediction
SMS Spam Prediction
In [13]: print(len(df))
5572
In [15]: df.lable.value_counts()
Out[15]: lable
ham 4825
spam 747
Name: count, dtype: int64
In [17]: df.duplicated().sum()
Out[17]: 403
In [19]: df=df.drop_duplicates(keep='first')
In [21]: df.duplicated().sum()
Out[21]: 0
In [23]: df.describe()
unique 2 5169
freq 4516 1
(5169, 3)
C:\Users\chitt\AppData\Local\Temp\ipykernel_12664\2601625900.py:1: SettingWithCop
yWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
C:\Users\chitt\AppData\Local\Temp\ipykernel_12664\2372758055.py:1: SettingWithCop
yWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
In [43]: df
In [47]: df[['num_characters','num_words','num_sentences']].describe()
In [57]: plt.figure(figsize=(12,6))
sns.histplot(df[df['label'] == 0]['num_characters'])
sns.histplot(df[df['label'] == 1]['num_characters'],color='red')
In [61]: plt.figure(figsize=(12,6))
sns.histplot(df[df['label'] == 0]['num_words'])
sns.histplot(df[df['label'] == 1]['num_words'],color='red')
In [63]: sns.pairplot(df,hue='label')
Data Preprocessing
Lower case
Tokenization
stemming
In [69]: df['sms'][10]
Out[69]: "I'm gonna be home soon and i don't want to talk about this stuff anymore tonig
ht, k? I've cried enough today."
Out[71]: 'love'
Out[73]: True
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
In [79]: df
Go until
jurong
go jurong
point,
0 ham 0 111 24 2 crazi avail b
crazy..
great w
Available
only ...
Ok lar...
ok lar joke
1 ham Joking wif 0 29 8 2
u oni...
Free entry
in 2 a
free entri
wkly
2 spam 1 155 37 2 comp win f
comp to
final tk
win FA
Cup fina...
U dun say
so early
u dun say
3 ham hor... U c 0 49 13 1
hor u c alrea
already
then say...
Nah I
don't
nah think go
think he
4 ham 0 61 15 1 live a
goes to
th
usf, he
lives aro...
This is the
2nd time
2nd tim
we have
5567 spam 1 161 35 4 contact u p
tried 2
prize 2 cla
contact
u...
Will Ì_ b
going to b go espla
5568 ham 0 37 9 1
esplanade
fr home?
Pity, * was
in mood
5569 ham for that. 0 57 15 2 piti mood su
So...any
other s...
The guy
did some
guy bitch a
bitching
5570 ham 0 125 27 1 interes
but I
someth els
acted like
i'd...
In [90]: plt.figure(figsize=(15,6))
plt.imshow(spam_wc)
In [92]: df.head()
Go until
jurong
go jurong poin
point,
0 ham 0 111 24 2 crazi avail bugi n
crazy..
great world..
Available
only ...
Ok lar...
Joking ok lar joke wif u
1 ham 0 29 8 2
wif u on
oni...
Free
entry in
2 a wkly free entri 2 wkl
2 spam comp to 1 155 37 2 comp win fa cup
win FA final tkt 21..
Cup
fina...
U dun
say so
early
u dun say earl
3 ham hor... U c 0 49 13 1
hor u c alreadi say
already
then
say...
Nah I
don't
think he nah think goe us
4 ham goes to 0 61 15 1 live around
usf, he though
lives
aro...
In [94]: spam_corpus = []
for msg in df[df['label'] == 1]['transformed_text'].tolist():
for word in msg.split():
spam_corpus.append(word)
In [96]: len(spam_corpus)
Out[96]: 9939
In [102… ham_corpus = []
for msg in df[df['label'] == 0]['transformed_text'].tolist():
for word in msg.split():
ham_corpus.append(word)
Go until
jurong
go jurong poin
point,
0 ham 0 111 24 2 crazi avail bugi n
crazy..
great world..
Available
only ...
Ok lar...
Joking ok lar joke wif u
1 ham 0 29 8 2
wif u on
oni...
Free
entry in
2 a wkly free entri 2 wkl
2 spam comp to 1 155 37 2 comp win fa cup
win FA final tkt 21..
Cup
fina...
U dun
say so
early
u dun say earl
3 ham hor... U c 0 49 13 1
hor u c alreadi say
already
then
say...
Nah I
don't
think he nah think goe us
4 ham goes to 0 61 15 1 live around
usf, he though
lives
aro...
Model Building
In [111… from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)
In [113… X = tfidf.fit_transform(df['transformed_text']).toarray()
In [115… X.shape
In [117… y = df['label'].values
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Initialize GaussianNB
gnb = GaussianNB()
Accuracy: 0.8694390715667312
Confusion Matrix:
[[788 108]
[ 27 111]]
Precision: 0.5068493150684932
Accuracy: 0.9709864603481625
Confusion Matrix:
[[896 0]
[ 30 108]]
Precision: 1.0
Accuracy: 0.9835589941972921
Confusion Matrix:
[[895 1]
[ 16 122]]
Precision: 0.991869918699187
Accuracy 0.9816247582205029
Precision 0.9917355371900827
Completed
In [ ]: