0% found this document useful (0 votes)

24 views7 pages

Building A Powered Ai and Spam Caller

Uploaded by

pavithrarmp19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views7 pages

Building A Powered Ai and Spam Caller

Uploaded by

pavithrarmp19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Introduction

The upsurge in the volume of unwanted emails called spam has created an intense need for
the development of more dependable and robust antispam filters. Any promotional messages
or advertisements that end up in our inbox can be categorised as spam as they don't provide
any value and often irritates us.

Overview of the Dataset used

We will make use of the SMS spam classification data.

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS
Spam research. It contains one set of SMS messages ln English of 5,574 messages, tagged
according to being ham (legitimate) or spam.

The data was obtained from UCI's Machine LearningRepository, alternatively, I have also
uploaded the dataset and completed Jupiter notebook onto my GitHub repo.

In this article, we'll discuss:

Data processing

Import the required packages

-
Loading the Dataset
Remove the unwanted data columns
Preprocessing and Exploring the Dataset
Build word cloud to see which message is spam and which is not.

Remove the stop words and punctuations

Convert the text data into vectors

Buildinga sms spam classification model

Split the data into train and test sets

Use Sklearn built-in classifiers to build the models
Train the data on the model
Make predictions on new data

Import the required packages

matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from text import CountVectorizer, TfidfTrans former
sklearn. feature_extraction.
from sklearn. tree import DecisionTreeClassifier
from sklearn. model_selection import GridSearchCV, train_test_split,StratifiedKFo.

please note! You might find that I have reimported some of these packages again later in the
article, it is just for ease of use if I ever have to use those code blocks again in future projects,
you may omit those.
Loading the Dataset

data pd.read_csv('dataset/spam.csv', encoding- latin-1')

data.head()

v2 Unnamed: 2
Unnamed: 3 Unnamed: 4

0 ham Go until jurong point, crazy.. Available ony NaN NaN NaN

ham Ok lar... Joking wif u on. NaN NaN NaN

spam Frae entry in 2 a wky comp to win FA Cup fina.. NaN NaN NaN

3 ham U dun say so early hor... Uc already then say.. NaN NaN NaN

4 ham Nah I dont think he goes to ust, he lves aro.. NaN NaN NaN

Removing unwanted columns

From the above figure, we can see that there are some unnamed columns and the label and
text column name is not intuitive so let's fix those in this step.

data data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis-1)

= :
data data.rename (columns-{"v2" "text", "v1";"label "})
data[1990:2000]

label text
1990 ham HI DARLIN IVE JUST GOT BACK AND HAD A REALLY... I

1991 ham No other Valentines huh? The proof is on your

1992 spam Free tones Hope you enjoyed your new content...
1993 ham Eh den sat u booke kb liao huh...

1994 ham Have you been practising your curtsey?

1995 ham Shall icome to get pickle

1996 ham Lol boo I was hoping for a laugh

1997 ham YEH I AM DEF UP4 SOMETHING SAT

1998 ham Well, I have to leave for my class babe.. Yo.

1999 ham LMAO where's your fish memory when need it? I

now that the data is looking pretty, let's move on.

data['label'].value_counts )

# OUTPUT
ham 4825
spam 747
Name: 1label, dtype: int64

Preprocessing and Exploring the Dataset

If you are completely new to NLTK and Natural Language Processing(NLP) I would recommend
checking out this short article before continuing. Introduction to Word Frequencies in NLP

# Import nltk packages and Punkt Tokenizer Models

import nltk
nltk.download("punkt")
import warnings
warnings.filterwarnings('ignore')

Build wOrd clou

Build word cloud to see which message is spam and which is
not
ham words are the opposite of spam in this dataset,yeah I also don't have any clue why it is

So.

ham_words
spam_words

# Creating a corpus of spam messages

= 'spam'].text:
for val in data [data[' label']
text val.lower ()
=

tokens= nltk. word tokenize( text)

for words in tokens:
Words
+

spam_words spam_words

# Creating corpus of ham messages

a
==
for val in data[data['label'] 'ham'].text:
text text.lower ()
=

tokens nltk.word_tokenize(text)
for words in tokens:
=
ham_words ham_words Words

let's use the above functions to create Spam word cloud and ham word cloud.

spam_wordcloud = WordCloud (width=500, height-300).generate( spam_words)

ham_wordcloud WordCloud(width=500, height-300).generate (ham_words)

#Spam Word cloud

plt.figure( figsize-(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout (pad-0)
plt.show()

noWi50p) orange noWpoly W1n

en Customer service
Line claH
P lease call"
150ppm join
private |rait mS

he, tera prize ctr

every week

Sms
Contact
landline
chattop
Send s
u
Tecelve

nokia
tried
su
dost
text
Willreply uk
Send .
Co tate
contact
Clain
ringtone next Uphone
info

mob
national rate
eurgent Won
messag tone
Winner

#Creating Ham wordcloud

plt.figure( figsize-(10,8), facecolor='g')
plt.imshow(ham_wordcloud )
plt.axis("off")
plt.tight_layout (pad-0)
plt.show()
from the spam word cloud, we can see that "free" is most often used in spam.

Now, we can convert the 'spam and ham' into 0 and 1 respectively so that the machine can
understand.

data data.replace (['ham,'spam'], [0, 1])

data.head(10)

label text

0 Go until jurong point, crazy.. Available only..

Ok lar... Joking wif u oni..

2 1 Free entry in 2 a wkly comp to win FA Cup fina...

3 0 dun say so early hor... Uc alreadythen say.

4 Nah don't think he goes to usf, he lives aro..

5 1
FreelMsg Hey there darling it's been 3 week's n.

6 0 Even my brother is not like to speak with me

7 As per your request 'Melle Melle (Oru Minnamin..

8 1 WINNERII As a valued network customer you have.

9 Had your mobile 11 months or more? UR entitle..

Removing punctuation and stopwords from the messages

Punctuation and stop words do not contribute anything to our model, so we have to remove
them. Using NLTK library we can easily do it.

import nltk
nltk.download("'stopwords')

#remove the punctuations and stopwords

import string
def text_process (text):

text =
text. translate (str .maketrans(",", string.punctuation))
=
text [word for word in text.split() if word.lower () not in stopwords.word:

return " ".join(text)

=
data['text'] data[' text'].apply(text_process)
data.head()

label text
Go jurong point crazy Available bugis n great..

Ok lar Joking wif u oni

2 1
Free entry 2 wkly comp win FA Cup final tkts 2.

3 dun say early hor Uc already say

4 Nah dont think goes usf lives around though

Now, create a data frame from the processed data before moving to the next step.

text pd.DataFrame (data[' text')

label pd. DataFrame (data[' label'])
Converting words to vectors using Count Vectorizer

## Counting how many times a word appears in the dataset

from collections import Counter

total_counts Counter (O
for i in range(len (text) ) :
.
for word in text.values [i] [o] split(" "):
total_counts[word] += 1

print("Total words in data set: ", len(total_counts) )

# OUTPUT
Total words in data set: 11305

# Sorting in decreasing order (Word with highest frequency appears first)

vocab = sorted( total_counts, key-total_counts.get, reverse-True)
: )
print (vocab[ 60]

# OUTPUT
['u', '2', 'call', 'U', 'get', 'Im', 'ur', '4, 'ltgt', "know', 'go', like', '

# Mapping from words to index

vocab_size len(vocab)
=
word2idx (}
#print vocab_size
for i, word in enumerate(vocab) :
word2idx [word] I

# Text to Vector
def text_to_vector (text) :
word_vector = np.zeros(vocab_size)
for word in text.split(" "):
if word2idx.get (word) is None:
continue
else:
=
word_vector [word2idx .get (word)] 1

return np. array(word_vector)

# Convert all titles to vectors

word_vectors np.zeros ((len(text), len(vocab) ), dtype -np.int_)
=

, :
for i, text_) in enumerate(text.iterrows () )
=
word_vectors[i] text_to_vector (text_[0])

word_vectors.shape

# OUTPUT
(5572, 11305)

Converting words to vectors using TF-IDF Vectorizer

#convert the text data into vectors

from sklearn. feature_extraction. text import TfidfVectorizer
=
vectorizer TfidfVectorizer ()
=
vectors vectorizer.fit_transform(data[" text'])
vectors.shape

# OUTPUT
(5572, 9376)
Splitting into training and test set

#split the dataset into train and test set

X_train, X_test, y_train, y_test train_test_split (features, data['label'], te:

Classifying using sklearn's pre-built classifiers

In this step we will use some of the most popular classifiers out there and compare their
results.

Classifiers used:
1. spam classifier using logistic regression

2. email spam classification using Support Vector Machine(SVM)

3. spam classifier using naive bayes

4. spam classifier using decision tree

5. spam classifier using K-Nearest Neighbor(KNN)
6. spam classifier using Random Forest Classifier

We will make use of sklearn library. This amazing library has all of the above algorithms we just
have to import them and it is as easy as that. No need to worry about all the maths and
statistics behind it.

#import sklearn packages for building classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
.
from sklearn naive_bayes import MultinomialNB
.
from sklearn tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

#initialize multiple classification models

sVc SVC(kernel='sigmoid', gamma=1.0)
=
knc KNeighborsClassifier(n_neighbors-49)
mnb MultinomialNB(a lpha-0.2)
=
dte DecisionTreeClassifier (min_samples_split-7, random_state-111)
=
Irc LogisticRegression (solver='liblinear', penalty='11')
=
rfc RandomForestClassifier (n_estimators-31, random_state-111)

#create a dictionary of variables and models

clfs
=
{'SVC' : svc, 'KN': knc, 'NB': mnb, 'DT': dtc, ' LR': lrc, 'RF': rfc}

#fit the data onto the models

def train(clf, features, targets):
clf. fit(features, targets)

def predict (clf, features):

return (clf.predict (features))

pred_scores_word_vectors
for k,v in clfs.items () :
train(v, X_train, y_train)
pred X_test)
predict(V,
,
pred_scores_word_vectors.append( (k, [accuracy_score(y_test pred)1))
Predictions using TFIDF Vectorizer algorithm

pred_scores_word_vectors

# OUTPUT
[('SVC', [0.978468899521 5312]),
("KN', [0.9330143540669856] ),
('NB', [0.988038277511 9617] ),
('DT', [0.9605263157894737]),
('LR', [O.9533492822966507),
('RF', [O.9796650717703349])]

Model predictions

#write functions to detect if the message is spam or not

def find (x) :
if x
=
1:
print ("Message is SPAM")
else:
print ("Message is NOT Spam" )

=
newtext ["Free entry"]
=
integers vectorizer.transform(newtext)

X mnb.predict(integers)
find (x)

# OUTPUT
Message is SPAM

Checking Classification Results with Confusion Matrix

If you are confused about the confusion matrix, read this small article before proceeding - The
ultimate guide to confusion matrix in machine learning

from sklearn.metrics import confusion_matrix

import seaborn as sns
# Naive Bayes
y_pred_nb mnb.predict (X_test)
y_true_nb = y_test
Cm confusion_matrix(y_true_nb, y_pred_nb)
f, ax =
plt.subplots (figsize (5,5))
=
sns.hea tmap(cm, annot True, linewidths=0.5,linecolor-"red", fmt =".0f", ax-ax)
plt.xlabel("y_pred_nb")
plt.ylabel("y_true_nb")
plt.show()

Aws Certified Ai Practitioner Aif c01
No ratings yet
Aws Certified Ai Practitioner Aif c01
40 pages
HW4 Text-1
No ratings yet
HW4 Text-1
8 pages
FYP Defense Presentation (Final)
No ratings yet
FYP Defense Presentation (Final)
14 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
NLP - Colaboratory
No ratings yet
NLP - Colaboratory
14 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Spam Sms Detection 2
No ratings yet
Spam Sms Detection 2
8 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Spam Detector
No ratings yet
Spam Detector
4 pages
Unstructured
No ratings yet
Unstructured
37 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
23 pages
Spam Detection 1
No ratings yet
Spam Detection 1
31 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
Spam Classification2
No ratings yet
Spam Classification2
21 pages
SMS Spam Prediction
No ratings yet
SMS Spam Prediction
18 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
Project Report
No ratings yet
Project Report
11 pages
Data Mining Final Group13
No ratings yet
Data Mining Final Group13
13 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
ML6 Naive Bayes Spam Filter
No ratings yet
ML6 Naive Bayes Spam Filter
11 pages
Preprocessing in Python
No ratings yet
Preprocessing in Python
50 pages
Mail Type Spam Classifier: Abstarct
No ratings yet
Mail Type Spam Classifier: Abstarct
9 pages
Machine Learning Project Spam SMS Classification 1684945672
No ratings yet
Machine Learning Project Spam SMS Classification 1684945672
18 pages
SMS Spam Detection Presentation
No ratings yet
SMS Spam Detection Presentation
8 pages
ML Assignment 4
No ratings yet
ML Assignment 4
10 pages
Bayesian Inference
No ratings yet
Bayesian Inference
20 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
7.email Spam Filtering Using Naive Bayes Classifier
No ratings yet
7.email Spam Filtering Using Naive Bayes Classifier
14 pages
Ijresm V6 I9 3 2
No ratings yet
Ijresm V6 I9 3 2
5 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Email Class
No ratings yet
Email Class
39 pages
Saurabh
No ratings yet
Saurabh
26 pages
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
No ratings yet
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
4 pages
Fam PR-10
No ratings yet
Fam PR-10
4 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
DM Chapter 3
No ratings yet
DM Chapter 3
6 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Ai - Phase 3
No ratings yet
Ai - Phase 3
9 pages
Document 1
No ratings yet
Document 1
1 page
Ass 3
No ratings yet
Ass 3
2 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
Python An Introduction
From Everand
Python An Introduction
Renier Engelbrecht
No ratings yet
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
From Everand
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
Vere salazar
No ratings yet
Phishing Website Detection Using Machine Learning Techniques
0% (1)
Phishing Website Detection Using Machine Learning Techniques
17 pages
DPTX 2016 1 11230 0 519394 0 185577
No ratings yet
DPTX 2016 1 11230 0 519394 0 185577
100 pages
Types of Learning Approach: Supervised and Semi-Supervised Learning
No ratings yet
Types of Learning Approach: Supervised and Semi-Supervised Learning
2 pages
ConSultantBERT Fine-Tuned Siamese Sentence-BERT For
No ratings yet
ConSultantBERT Fine-Tuned Siamese Sentence-BERT For
8 pages
Text Mining Applications and Theory
100% (1)
Text Mining Applications and Theory
5 pages
Final Project Report
No ratings yet
Final Project Report
34 pages
MLStackCafe QAS 1672810525772
No ratings yet
MLStackCafe QAS 1672810525772
12 pages
Unit 1
No ratings yet
Unit 1
27 pages
Classification of Survey
No ratings yet
Classification of Survey
4 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
51 pages
IoT Based Health Monitoring System Using Machine Learning
No ratings yet
IoT Based Health Monitoring System Using Machine Learning
6 pages
Performances of Machine Learning Algorithms For Bi
No ratings yet
Performances of Machine Learning Algorithms For Bi
9 pages
Real Time Face Detection and Tracking Using Opencv: Mamata S.Kalas
No ratings yet
Real Time Face Detection and Tracking Using Opencv: Mamata S.Kalas
4 pages
Explainable AI
100% (1)
Explainable AI
16 pages
Caf1201 Cost Accounting
No ratings yet
Caf1201 Cost Accounting
248 pages
AI Presentation Machine Learning
100% (2)
AI Presentation Machine Learning
42 pages
Detecting Jute Plant Disease Using Image Processing and Machine Learning
No ratings yet
Detecting Jute Plant Disease Using Image Processing and Machine Learning
6 pages
Machine Learning
No ratings yet
Machine Learning
133 pages
Cours - Machine Learning
No ratings yet
Cours - Machine Learning
20 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
Bolivar VR Diszlexia
No ratings yet
Bolivar VR Diszlexia
7 pages
Data Science Project
No ratings yet
Data Science Project
25 pages
STM R16 - Unit-3
No ratings yet
STM R16 - Unit-3
56 pages
Tutor Test and Home Assignment Questions For de
No ratings yet
Tutor Test and Home Assignment Questions For de
4 pages
Upath ER (SP1) 1021934EN - 1
No ratings yet
Upath ER (SP1) 1021934EN - 1
22 pages
Pattern Recognition Chip With 1024 Neurons in Parallel: Data Sheet
No ratings yet
Pattern Recognition Chip With 1024 Neurons in Parallel: Data Sheet
2 pages
Object Detection and Ship Classification Using YOLOv5
No ratings yet
Object Detection and Ship Classification Using YOLOv5
10 pages
Insult Detection in Hindi: Course Project On Artificial Intelligence
No ratings yet
Insult Detection in Hindi: Course Project On Artificial Intelligence
8 pages

Building A Powered Ai and Spam Caller

Uploaded by

Building A Powered Ai and Spam Caller

Uploaded by

Introduction

Overview of the Dataset used

In this article, we'll discuss:

Import the required packages

Remove the stop words and punctuations

Buildinga sms spam classification model

Split the data into train and test sets

Import the required packages

data pd.read_csv('dataset/spam.csv', encoding- latin-1')

ham Ok lar... Joking wif u on. NaN NaN NaN

Removing unwanted columns

data data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis-1)

1991 ham No other Valentines huh? The proof is on your

1994 ham Have you been practising your curtsey?

1996 ham Lol boo I was hoping for a laugh

1998 ham Well, I have to leave for my class babe.. Yo.

now that the data is looking pretty, let's move on.

Preprocessing and Exploring the Dataset

# Import nltk packages and Punkt Tokenizer Models

Build wOrd clou

# Creating a corpus of spam messages

tokens= nltk. word tokenize( text)

# Creating corpus of ham messages

spam_wordcloud = WordCloud (width=500, height-300).generate( spam_words)

#Spam Word cloud

noWi50p) orange noWpoly W1n

he, tera prize ctr

#Creating Ham wordcloud

data data.replace (['ham,'spam'], [0, 1])

0 Go until jurong point, crazy.. Available only..

2 1 Free entry in 2 a wkly comp to win FA Cup fina...

3 0 dun say so early hor... Uc alreadythen say.

4 Nah don't think he goes to usf, he lives aro..

6 0 Even my brother is not like to speak with me

7 As per your request 'Melle Melle (Oru Minnamin..

9 Had your mobile 11 months or more? UR entitle..

Removing punctuation and stopwords from the messages

#remove the punctuations and stopwords

return " ".join(text)

Ok lar Joking wif u oni

3 dun say early hor Uc already say

4 Nah dont think goes usf lives around though

text pd.DataFrame (data[' text')

## Counting how many times a word appears in the dataset

from collections import Counter

print("Total words in data set: ", len(total_counts) )

# Sorting in decreasing order (Word with highest frequency appears first)

# Mapping from words to index

return np. array(word_vector)

# Convert all titles to vectors

Converting words to vectors using TF-IDF Vectorizer

#convert the text data into vectors

#split the dataset into train and test set

Classifying using sklearn's pre-built classifiers

2. email spam classification using Support Vector Machine(SVM)

3. spam classifier using naive bayes

4. spam classifier using decision tree

#import sklearn packages for building classifiers

#initialize multiple classification models

#create a dictionary of variables and models

#fit the data onto the models

def predict (clf, features):

#write functions to detect if the message is spam or not

Checking Classification Results with Confusion Matrix

from sklearn.metrics import confusion_matrix

You might also like