0% found this document useful (0 votes)

37 views8 pages

Spam Sms Detection 2

The document discusses implementing a bag-of-words approach for SMS spam detection. It loads an SMS dataset, preprocesses the text by removing punctuation and tokenizing, then applies CountVectorizer to transform the text into feature vectors indicating word counts. It also covers splitting the dataset into training and test sets for model evaluation.

Uploaded by

soham pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views8 pages

Spam Sms Detection 2

Uploaded by

soham pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

spam-sms-detection-2

March 2, 2024

[1]: import numpy as np

import pandas as pd

Checking the Length of SMS

[4]: import pandas
df_sms = pd.read_csv('spam.csv',encoding='latin-1')
df_sms.head()

[4]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

Dropping the unwanted columns Unnamed:2, Unnamed: 3 and Unnamed:4

[5]: df = df_sms.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df_sms.rename(columns={"v1":"label", "v2":"sms"})

[6]: df_sms.head()

[6]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4

1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

[8]: #Checking the maximum length of SMS

print (len(df))

5572

[9]: df_sms.tail()

[9]: v1 v2 Unnamed: 2 \
5567 spam This is the 2nd time we have tried 2 contact u… NaN
5568 ham Will Ì_ b going to esplanade fr home? NaN
5569 ham Pity, * was in mood for that. So…any other s… NaN
5570 ham The guy did some bitching but I acted like i'd… NaN
5571 ham Rofl. Its true to its name NaN

Unnamed: 3 Unnamed: 4
5567 NaN NaN
5568 NaN NaN
5569 NaN NaN
5570 NaN NaN
5571 NaN NaN

[10]: #Number of observations in each label spam and ham

df.label.value_counts()

[10]: ham 4825

spam 747
Name: label, dtype: int64

[12]: df_sms.describe()

[12]: v1 v2 \
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30

Unnamed: 2 \
count 50
unique 43
top bt not his girlfrnd… G o o d n i g h t . . .@"
freq 3

2
Unnamed: 3 Unnamed: 4
count 12 6
unique 10 5
top MK17 92H. 450Ppw 16" GNT:-)"
freq 2 2

[15]: df['length'] = df['sms'].apply(len)

df.head()

[15]: label sms Unnamed: 2 \

0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4 length

0 NaN NaN 111
1 NaN NaN 29
2 NaN NaN 155
3 NaN NaN 49
4 NaN NaN 61

[16]: import matplotlib.pyplot as plt

import seaborn as sns

df['length'].plot(bins=50, kind='hist')

[16]: <AxesSubplot: ylabel='Frequency'>

3
[17]: df.hist(column='length', by='label', bins=50,figsize=(10,4))

[17]: array([<AxesSubplot: title={'center': 'ham'}>,

<AxesSubplot: title={'center': 'spam'}>], dtype=object)

4
[18]: df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 6)
C:\Users\ADMIN\AppData\Local\Temp\ipykernel_8320\933899167.py:1:
DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt
to set the values inplace instead of always setting a new array. To retain the
old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-
unique, `df.isetitem(i, newvals)`
df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})

[18]: label sms Unnamed: 2 \

0 0 Go until jurong point, crazy.. Available only … NaN
1 0 Ok lar… Joking wif u oni… NaN
2 1 Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 0 U dun say so early hor… U c already then say… NaN
4 0 Nah I don't think he goes to usf, he lives aro… NaN

Unnamed: 3 Unnamed: 4 length

0 NaN NaN 111
1 NaN NaN 29
2 NaN NaN 155
3 NaN NaN 49
4 NaN NaN 61

Bag of Words Approach

Implementation of Bag of Words Approach
Step 1: Convert all strings to their lower case form.
[19]: documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']

lower_case_documents = []
lower_case_documents = [d.lower() for d in documents]
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello,
call hello you tomorrow?']

[20]: sans_punctuation_documents = []
import string

for i in lower_case_documents:

5
sans_punctuation_documents.append(i.translate(str.maketrans("","", string.
↪punctuation)))

sans_punctuation_documents

[20]: ['hello how are you',

'win money win from home',
'call me now',
'hello call hello you tomorrow']

Step 3: Tokenization
[21]: preprocessed_documents = [[w for w in d.split()] for d in␣
↪sans_punctuation_documents]

preprocessed_documents

[21]: [['hello', 'how', 'are', 'you'],

['win', 'money', 'win', 'from', 'home'],
['call', 'me', 'now'],
['hello', 'call', 'hello', 'you', 'tomorrow']]

Step 4: Count frequencies

[22]: frequency_list = []
import pprint
from collections import Counter

frequency_list = [Counter(d) for d in preprocessed_documents]

pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),

Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
Counter({'call': 1, 'me': 1, 'now': 1}),
Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]
Implementing Bag of Words in scikit-learn
documents = [‘Hello, how are you!’, ‘Win money, win from home.’, ‘Call me now.’, ‘Hello, Call hello
you tomorrow?’]

[23]: from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
count_vector.fit(documents)
feature_names = count_vector.get_feature_names_out()

[ ]:

6
[24]: doc_array = count_vector.transform(documents).toarray()
doc_array

[24]: array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],

[0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

[25]: import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# Assuming 'documents' is a list of text documents

documents = ["This is document 1.", "Another document.", "And one more document.
↪"]

count_vector = CountVectorizer()
doc_array = count_vector.fit_transform(documents).toarray()

# Creating a DataFrame with document-term frequencies

frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.
↪get_feature_names_out())

# Now you can use 'frequency_matrix' or continue with further processing

[26]: from sklearn.model_selection import train_test_split

# Assuming df_sms is your DataFrame containing 'sms' and 'label' columns

X_train, X_test, y_train, y_test = train_test_split(df['sms'],
df['label'],
test_size=0.20,
random_state=1)

[28]: # Instantiate the CountVectorizer method

count_vector = CountVectorizer()

# Fit the training data and then return the matrix

training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix.

testing_data = count_vector.transform(X_test)

naive_bayes algorithm
[29]: from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)

7
[29]: MultinomialNB()

[30]: predictions = naive_bayes.predict(testing_data)

[31]: from sklearn.metrics import accuracy_score, precision_score, recall_score,␣

↪f1_score

print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))

print('Precision score: {}'.format(precision_score(y_test, predictions)))
print('Recall score: {}'.format(recall_score(y_test, predictions)))
print('F1 score: {}'.format(f1_score(y_test, predictions)))

Accuracy score: 0.9847533632286996

Precision score: 0.9420289855072463
Recall score: 0.935251798561151
F1 score: 0.9386281588447652

[ ]:

Assignment 61
100% (2)
Assignment 61
4 pages
Final Cbse Practicals
60% (5)
Final Cbse Practicals
21 pages
O'Connor - Matlab's Floating Point System
No ratings yet
O'Connor - Matlab's Floating Point System
17 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
23 pages
SMS Spam Prediction
No ratings yet
SMS Spam Prediction
18 pages
Emotion Classification With DistilBERT
No ratings yet
Emotion Classification With DistilBERT
25 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
Machine Learning Project Spam SMS Classification 1684945672
No ratings yet
Machine Learning Project Spam SMS Classification 1684945672
18 pages
ML Practical 2D
No ratings yet
ML Practical 2D
6 pages
Manual
No ratings yet
Manual
48 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
No ratings yet
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
9 pages
Numpy Dataframe
No ratings yet
Numpy Dataframe
12 pages
Practical of R
No ratings yet
Practical of R
38 pages
697e9176-7141-4407-ac59-183e04ddf458
No ratings yet
697e9176-7141-4407-ac59-183e04ddf458
44 pages
Manual
No ratings yet
Manual
21 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Latent Dirichlet Allocation
No ratings yet
Latent Dirichlet Allocation
44 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Lab Manual Data Science
No ratings yet
Lab Manual Data Science
24 pages
DV Lab Manual Modified
No ratings yet
DV Lab Manual Modified
31 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
Week1 Numpy, Pandas (178) .Ipynb Colab
No ratings yet
Week1 Numpy, Pandas (178) .Ipynb Colab
6 pages
Assignments IP Class 12
No ratings yet
Assignments IP Class 12
9 pages
Modue 3 - Dictionaries
No ratings yet
Modue 3 - Dictionaries
41 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Fundamentals of Data Science Lab Manual-5-26
No ratings yet
Fundamentals of Data Science Lab Manual-5-26
22 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Py4Inf 09 Dictionaries
No ratings yet
Py4Inf 09 Dictionaries
32 pages
Python Exps Questions
No ratings yet
Python Exps Questions
10 pages
Pythonlearn 09 Dictionaries 1
No ratings yet
Pythonlearn 09 Dictionaries 1
31 pages
Ai - Phase 3
No ratings yet
Ai - Phase 3
9 pages
Pyq Solution
No ratings yet
Pyq Solution
12 pages
Py4Inf 09 Dictionaries
No ratings yet
Py4Inf 09 Dictionaries
30 pages
Untitled
No ratings yet
Untitled
3 pages
Unit 2-Dictionaries
No ratings yet
Unit 2-Dictionaries
52 pages
(L4) Programming With Python (Intermediate Level)
No ratings yet
(L4) Programming With Python (Intermediate Level)
17 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Python Data Science Toolbox
No ratings yet
Python Data Science Toolbox
14 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Module 3 4 PLC
No ratings yet
Module 3 4 PLC
16 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
8 Inform Atica (Teor Ia) : Python: Diccionarios
No ratings yet
8 Inform Atica (Teor Ia) : Python: Diccionarios
39 pages
Sowmi DS
No ratings yet
Sowmi DS
27 pages
Ip Practical File
No ratings yet
Ip Practical File
23 pages
09 Dictionaries
No ratings yet
09 Dictionaries
33 pages
Chapter-13 Dictionaries
No ratings yet
Chapter-13 Dictionaries
11 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Spam Detection 1
No ratings yet
Spam Detection 1
31 pages
Data Science Fundamentals Lab
No ratings yet
Data Science Fundamentals Lab
24 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
54 pages
Art Integrated Project (CS) ) Xi-A
100% (2)
Art Integrated Project (CS) ) Xi-A
18 pages
Python Assignment 3 AMAN GAUTAM 039
No ratings yet
Python Assignment 3 AMAN GAUTAM 039
5 pages
The Chameleon Variation
From Everand
The Chameleon Variation
Carsten Hansen
No ratings yet
Black Knight Repertoire 1.e4 Nc6
From Everand
Black Knight Repertoire 1.e4 Nc6
Marek Soszynski
No ratings yet
The Lang Lang Piano Method Level 4
From Everand
The Lang Lang Piano Method Level 4
Lang Lang
No ratings yet
S Detection Using Machine Learning
No ratings yet
S Detection Using Machine Learning
24 pages
Legal Document Analysis
No ratings yet
Legal Document Analysis
2 pages
Practical No 01
No ratings yet
Practical No 01
9 pages
Gradient Minimalist Business Slides
No ratings yet
Gradient Minimalist Business Slides
19 pages
Capr-I 6271
No ratings yet
Capr-I 6271
18 pages
Capr-Ii 6271
No ratings yet
Capr-Ii 6271
18 pages
Introduction To EDA Method in Machine Learning: by 60 - Soham Pawar
No ratings yet
Introduction To EDA Method in Machine Learning: by 60 - Soham Pawar
10 pages
BXE U1 (MCQS) 31-05
No ratings yet
BXE U1 (MCQS) 31-05
27 pages
Program No
No ratings yet
Program No
20 pages
Chapter 11 - Dynamic-Object-Modeling
No ratings yet
Chapter 11 - Dynamic-Object-Modeling
32 pages
Invoice Template
No ratings yet
Invoice Template
5 pages
Python-Unit-6 R16 PDF
No ratings yet
Python-Unit-6 R16 PDF
19 pages
IEC Publication No. 533, Electromagnetic Compatibility of Electrical and Electronic Installation On Ships
No ratings yet
IEC Publication No. 533, Electromagnetic Compatibility of Electrical and Electronic Installation On Ships
33 pages
Zedboard Ubuntu
No ratings yet
Zedboard Ubuntu
11 pages
Compaq Armada m300
No ratings yet
Compaq Armada m300
102 pages
Catalogo Serie E1049
No ratings yet
Catalogo Serie E1049
12 pages
Rundown Pelatihan Threat Hunting - Beta Dan Charlie (WIB)
No ratings yet
Rundown Pelatihan Threat Hunting - Beta Dan Charlie (WIB)
3 pages
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
No ratings yet
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
11 pages
How To Bypass or Remove A BIOS Password
No ratings yet
How To Bypass or Remove A BIOS Password
5 pages
Ilovepdf Merged 4c7bdd33 159a 4e13 8c6b De1ec358b3ca
No ratings yet
Ilovepdf Merged 4c7bdd33 159a 4e13 8c6b De1ec358b3ca
75 pages
Project Management Book1
100% (1)
Project Management Book1
25 pages
Turnitin Originality Report FELEKI
No ratings yet
Turnitin Originality Report FELEKI
74 pages
Prelim Intro To Multimedia Chap 1
No ratings yet
Prelim Intro To Multimedia Chap 1
38 pages
Extra Worksheets 1st Year
No ratings yet
Extra Worksheets 1st Year
41 pages
CHM121 - Module 2 - Significant Figures
No ratings yet
CHM121 - Module 2 - Significant Figures
26 pages
CN Impq - R22
No ratings yet
CN Impq - R22
3 pages
Structured Network Cabling Baguio
No ratings yet
Structured Network Cabling Baguio
5 pages
SAP Abap Quiz Part 2
No ratings yet
SAP Abap Quiz Part 2
10 pages
Module-1: Web Programming
100% (1)
Module-1: Web Programming
50 pages
ICT Assignment 4 Bachelors
No ratings yet
ICT Assignment 4 Bachelors
4 pages
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
No ratings yet
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
2 pages
Handbook ON Direct Benefit Transfer For LPG Consumer (DBTL) : Ministry of Petroleum and Natural Gas Government of India
No ratings yet
Handbook ON Direct Benefit Transfer For LPG Consumer (DBTL) : Ministry of Petroleum and Natural Gas Government of India
61 pages
Julia: Fresh Approach To Numerical Computing
No ratings yet
Julia: Fresh Approach To Numerical Computing
34 pages
Performance Best Practices For VMware Vsphere 6.7 VMware ESXi 6.7
No ratings yet
Performance Best Practices For VMware Vsphere 6.7 VMware ESXi 6.7
220 pages
Esaimen Ooop
No ratings yet
Esaimen Ooop
9 pages
Project Charter Comparison
No ratings yet
Project Charter Comparison
1 page
Performance Metrics: - Bandwidth (Throughput) - Latency (Delay) - Bandwidth
No ratings yet
Performance Metrics: - Bandwidth (Throughput) - Latency (Delay) - Bandwidth
17 pages
ANSI Codes
No ratings yet
ANSI Codes
12 pages

Spam Sms Detection 2

Uploaded by

Spam Sms Detection 2

Uploaded by

spam-sms-detection-2

[1]: import numpy as np

Checking the Length of SMS

Dropping the unwanted columns Unnamed:2, Unnamed: 3 and Unnamed:4

[8]: #Checking the maximum length of SMS

[10]: #Number of observations in each label spam and ham

[10]: ham 4825

[15]: df['length'] = df['sms'].apply(len)

[15]: label sms Unnamed: 2 \

Unnamed: 3 Unnamed: 4 length

[16]: import matplotlib.pyplot as plt

[16]: <AxesSubplot: ylabel='Frequency'>

[17]: array([<AxesSubplot: title={'center': 'ham'}>,

[18]: label sms Unnamed: 2 \

Unnamed: 3 Unnamed: 4 length

Bag of Words Approach

[20]: ['hello how are you',

[21]: [['hello', 'how', 'are', 'you'],

Step 4: Count frequencies

frequency_list = [Counter(d) for d in preprocessed_documents]

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),

[23]: from sklearn.feature_extraction.text import CountVectorizer

[24]: array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],

[25]: import pandas as pd

# Assuming 'documents' is a list of text documents

# Creating a DataFrame with document-term frequencies

# Now you can use 'frequency_matrix' or continue with further processing

[26]: from sklearn.model_selection import train_test_split

# Assuming df_sms is your DataFrame containing 'sms' and 'label' columns

[28]: # Instantiate the CountVectorizer method

# Fit the training data and then return the matrix

# Transform testing data and return the matrix.

[30]: predictions = naive_bayes.predict(testing_data)

[31]: from sklearn.metrics import accuracy_score, precision_score, recall_score,␣

print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))

Accuracy score: 0.9847533632286996

You might also like