0% found this document useful (0 votes)
37 views8 pages

Spam Sms Detection 2

The document discusses implementing a bag-of-words approach for SMS spam detection. It loads an SMS dataset, preprocesses the text by removing punctuation and tokenizing, then applies CountVectorizer to transform the text into feature vectors indicating word counts. It also covers splitting the dataset into training and test sets for model evaluation.

Uploaded by

soham pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views8 pages

Spam Sms Detection 2

The document discusses implementing a bag-of-words approach for SMS spam detection. It loads an SMS dataset, preprocesses the text by removing punctuation and tokenizing, then applies CountVectorizer to transform the text into feature vectors indicating word counts. It also covers splitting the dataset into training and test sets for model evaluation.

Uploaded by

soham pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

spam-sms-detection-2

March 2, 2024

[1]: import numpy as np


import pandas as pd

Checking the Length of SMS


[4]: import pandas
df_sms = pd.read_csv('spam.csv',encoding='latin-1')
df_sms.head()

[4]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

Dropping the unwanted columns Unnamed:2, Unnamed: 3 and Unnamed:4


[5]: df = df_sms.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df_sms.rename(columns={"v1":"label", "v2":"sms"})

[6]: df_sms.head()

[6]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4

1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

[8]: #Checking the maximum length of SMS


print (len(df))

5572

[9]: df_sms.tail()

[9]: v1 v2 Unnamed: 2 \
5567 spam This is the 2nd time we have tried 2 contact u… NaN
5568 ham Will Ì_ b going to esplanade fr home? NaN
5569 ham Pity, * was in mood for that. So…any other s… NaN
5570 ham The guy did some bitching but I acted like i'd… NaN
5571 ham Rofl. Its true to its name NaN

Unnamed: 3 Unnamed: 4
5567 NaN NaN
5568 NaN NaN
5569 NaN NaN
5570 NaN NaN
5571 NaN NaN

[10]: #Number of observations in each label spam and ham


df.label.value_counts()

[10]: ham 4825


spam 747
Name: label, dtype: int64

[12]: df_sms.describe()

[12]: v1 v2 \
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30

Unnamed: 2 \
count 50
unique 43
top bt not his girlfrnd… G o o d n i g h t . . .@"
freq 3

2
Unnamed: 3 Unnamed: 4
count 12 6
unique 10 5
top MK17 92H. 450Ppw 16" GNT:-)"
freq 2 2

[15]: df['length'] = df['sms'].apply(len)


df.head()

[15]: label sms Unnamed: 2 \


0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4 length


0 NaN NaN 111
1 NaN NaN 29
2 NaN NaN 155
3 NaN NaN 49
4 NaN NaN 61

[16]: import matplotlib.pyplot as plt


import seaborn as sns

df['length'].plot(bins=50, kind='hist')

[16]: <AxesSubplot: ylabel='Frequency'>

3
[17]: df.hist(column='length', by='label', bins=50,figsize=(10,4))

[17]: array([<AxesSubplot: title={'center': 'ham'}>,


<AxesSubplot: title={'center': 'spam'}>], dtype=object)

4
[18]: df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 6)
C:\Users\ADMIN\AppData\Local\Temp\ipykernel_8320\933899167.py:1:
DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt
to set the values inplace instead of always setting a new array. To retain the
old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-
unique, `df.isetitem(i, newvals)`
df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})

[18]: label sms Unnamed: 2 \


0 0 Go until jurong point, crazy.. Available only … NaN
1 0 Ok lar… Joking wif u oni… NaN
2 1 Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 0 U dun say so early hor… U c already then say… NaN
4 0 Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4 length


0 NaN NaN 111
1 NaN NaN 29
2 NaN NaN 155
3 NaN NaN 49
4 NaN NaN 61

Bag of Words Approach


Implementation of Bag of Words Approach
Step 1: Convert all strings to their lower case form.
[19]: documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']

lower_case_documents = []
lower_case_documents = [d.lower() for d in documents]
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello,
call hello you tomorrow?']

[20]: sans_punctuation_documents = []
import string

for i in lower_case_documents:

5
sans_punctuation_documents.append(i.translate(str.maketrans("","", string.
↪punctuation)))

sans_punctuation_documents

[20]: ['hello how are you',


'win money win from home',
'call me now',
'hello call hello you tomorrow']

Step 3: Tokenization
[21]: preprocessed_documents = [[w for w in d.split()] for d in␣
↪sans_punctuation_documents]

preprocessed_documents

[21]: [['hello', 'how', 'are', 'you'],


['win', 'money', 'win', 'from', 'home'],
['call', 'me', 'now'],
['hello', 'call', 'hello', 'you', 'tomorrow']]

Step 4: Count frequencies


[22]: frequency_list = []
import pprint
from collections import Counter

frequency_list = [Counter(d) for d in preprocessed_documents]


pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),


Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
Counter({'call': 1, 'me': 1, 'now': 1}),
Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]
Implementing Bag of Words in scikit-learn
documents = [‘Hello, how are you!’, ‘Win money, win from home.’, ‘Call me now.’, ‘Hello, Call hello
you tomorrow?’]

[23]: from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
count_vector.fit(documents)
feature_names = count_vector.get_feature_names_out()

[ ]:

6
[24]: doc_array = count_vector.transform(documents).toarray()
doc_array

[24]: array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],


[0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

[25]: import pandas as pd


from sklearn.feature_extraction.text import CountVectorizer

# Assuming 'documents' is a list of text documents


documents = ["This is document 1.", "Another document.", "And one more document.
↪"]

count_vector = CountVectorizer()
doc_array = count_vector.fit_transform(documents).toarray()

# Creating a DataFrame with document-term frequencies


frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.
↪get_feature_names_out())

# Now you can use 'frequency_matrix' or continue with further processing

[26]: from sklearn.model_selection import train_test_split

# Assuming df_sms is your DataFrame containing 'sms' and 'label' columns


X_train, X_test, y_train, y_test = train_test_split(df['sms'],
df['label'],
test_size=0.20,
random_state=1)

[28]: # Instantiate the CountVectorizer method


count_vector = CountVectorizer()

# Fit the training data and then return the matrix


training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix.


testing_data = count_vector.transform(X_test)

naive_bayes algorithm
[29]: from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)

7
[29]: MultinomialNB()

[30]: predictions = naive_bayes.predict(testing_data)

[31]: from sklearn.metrics import accuracy_score, precision_score, recall_score,␣


↪f1_score

print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))


print('Precision score: {}'.format(precision_score(y_test, predictions)))
print('Recall score: {}'.format(recall_score(y_test, predictions)))
print('F1 score: {}'.format(f1_score(y_test, predictions)))

Accuracy score: 0.9847533632286996


Precision score: 0.9420289855072463
Recall score: 0.935251798561151
F1 score: 0.9386281588447652

[ ]:

You might also like