0% found this document useful (0 votes)

31 views11 pages

AI Phash3

This document summarizes the process of building a smarter AI-powered spam classifier. It discusses loading and preprocessing an SMS spam dataset, building word clouds to analyze spam and ham messages, converting text to vectors using count and TF-IDF vectorization, training various classifiers on the training data and evaluating their performance on test data. Key classifiers tested include logistic regression, SVM, naive bayes, decision trees, KNN and random forest. Accuracies above 95% were achieved for most models.

Uploaded by

techusama4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

AI Phash3

Uploaded by

techusama4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Building a Smarter AI-Powered Spam Classifier

TEAM MEMBER: M.MOHAMED USMAN

Phase-3 Document Submission

Introduction
The upsurge in the volume of unwanted emails called spam has created an
intense need for the development of more dependable and robust antispam
filters. Any promotional messages or advertisements that end up in our
inbox can be categorised as spam as they don't provide any value and
often irritates us.

Overview of the Dataset used

We will make use of the SMS spam classification data.

The SMS Spam Collection is a set of SMS tagged messages that have
been collected for SMS Spam research. It contains one set of SMS
messages in English of 5,574 messages, tagged according to being ham
(legitimate) or spam.

The data was obtained from UCI’s Machine Learning Repository,

alternatively, I have also uploaded the dataset and completed Jupiter
notebook onto my GitHub repo.
In this article, we'll discuss:
Data processing

 Import the required packages

 Loading the Dataset
 Remove the unwanted data columns
 Preprocessing and Exploring the Dataset
 Build word cloud to see which message is spam and which is not.
 Remove the stop words and punctuations
 Convert the text data into vectors

Building a sms spam classification model

 Split the data into train and test sets

 Use Sklearn built-in classifiers to build the models
 Train the data on the model
 Make predictions on new data

Import the required packages

%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,
TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import
GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_
curve
please note! You might find that I have reimported some of these packages
again later in the article, it is just for ease of use if I ever have to use those
code blocks again in future projects, you may omit those.😉
Loading the Dataset
data = pd.read_csv('dataset/spam.csv', encoding='latin-1')
data.head()

Removing unwanted columns

From the above figure, we can see that there are some unnamed columns
and the label and text column name is not intuitive so let's fix those in this
step.

data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)

data = data.rename(columns={"v2" : "text", "v1":"label"})
data[1990:2000]

now that the data is looking pretty, let's move on.

data['label'].value_counts()

# OUTPUT
ham 4825
spam 747
Name: label, dtype: int64
Preprocessing and Exploring the Dataset
If you are completely new to NLTK and Natural Language Processing(NLP)
I would recommend checking out this short article before
continuing. Introduction to Word Frequencies in NLP

# Import nltk packages and Punkt Tokenizer Models

import nltk
nltk.download("punkt")
import warnings
warnings.filterwarnings('ignore')

Build word cloud to see which message is

spam and which is not
ham words are the opposite of spam in this dataset, yeah I also don't have
any clue why it is so.
ham_words = ''
spam_words = ''
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
text = val.lower()
tokens = nltk.word_tokenize(text)
for words in tokens:
spam_words = spam_words + words + ' '

# Creating a corpus of ham messages

for val in data[data['label'] == 'ham'].text:
text = text.lower()
tokens = nltk.word_tokenize(text)
for words in tokens:
ham_words = ham_words + words + ' '

let's use the above functions to create Spam word cloud and ham word
cloud.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)

ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

from the spam word cloud, we can see that "free" is most often used in
spam.

Now, we can convert the spam and ham into 0 and 1 respectively so that the
machine can understand.

data = data.replace(['ham','spam'],[0, 1])

data.head(10)
Removing punctuation and stop words from
the messages
Punctuation and stop words do not contribute anything to our model, so we
have to remove them. Using NLTK library we can easily do it.

import nltk
nltk.download('stopwords')

#remove the punctuations and stopwords

import string
def text_process(text):

text = text.translate(str.maketrans('', '', string.punctuation))

text = [word for word in text.split() if word.lower() not in
stopwords.words('english')]

return " ".join(text)

data['text'] = data['text'].apply(text_process)
data.head()

Now, create a data frame from the processed data before moving to the
next step.
text = pd.DataFrame(data['text'])
label = pd.DataFrame(data['label'])

Converting words to vectors

we can convert words to vectors using either Count Vectorizer or by using
TF-IDF Vectorizer.

TF-IDF is better than Count Vectorizers because it not only focuses on the
frequency of words present in the corpus but also provides the importance
of the words. We can then remove the words that are less important for
analysis, hence making the model building less complex by reducing the
input dimensions.

I have included both methods for your reference.

Converting words to vectors using Count Vectorizer

## Counting how many times a word appears in the dataset

from collections import Counter

total_counts = Counter()
for i in range(len(text)):
for word in text.values[i][0].split(" "):
total_counts[word] += 1

print("Total words in data set: ", len(total_counts))

# OUTPUT
Total words in data set: 11305
# Sorting in decreasing order (Word with highest frequency appears
first)
vocab = sorted(total_counts, key=total_counts.get, reverse=True)
print(vocab[:60])

# OUTPUT
['u', '2', 'call', 'U', 'get', 'Im', 'ur', '4', 'ltgt', 'know', 'go',
'like', 'dont', 'come', 'got', 'time', 'day', 'want', 'Ill', 'lor',
'Call', 'home', 'send', 'going', 'one', 'need', 'Ok', 'good', 'love',
'back', 'n', 'still', 'text', 'im', 'later', 'see', 'da', 'ok',
'think', 'Ì', 'free', 'FREE', 'r', 'today', 'Sorry', 'week', 'phone',
'mobile', 'cant', 'tell', 'take', 'much', 'night', 'way', 'Hey',
'reply', 'work', 'make', 'give', 'new']
# Mapping from words to index

vocab_size = len(vocab)
word2idx = {}
#print vocab_size
for i, word in enumerate(vocab):
word2idx[word] = I
# Text to Vector
def text_to_vector(text):
word_vector = np.zeros(vocab_size)
for word in text.split(" "):
if word2idx.get(word) is None:
continue
else:
word_vector[word2idx.get(word)] += 1
return np.array(word_vector)
# Convert all titles to vectors
word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
for i, (_, text_) in enumerate(text.iterrows()):
word_vectors[i] = text_to_vector(text_[0])

word_vectors.shape

# OUTPUT
(5572, 11305)

Converting words to vectors using TF-IDF Vectorizer

#convert the text data into vectors
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape

# OUTPUT
(5572, 9376)
#features = word_vectors
features = vectors
Splitting into training and test set
#split the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(features,
data['label'], test_size=0.15, random_state=111)

Classifying using sklearn's pre-built

classifiers
In this step we will use some of the most popular classifiers out there and
compare their results.

Classifiers used:
1. spam classifier using logistic regression
2. email spam classification using Support Vector Machine(SVM)
3. spam classifier using naive bayes
4. spam classifier using decision tree
5. spam classifier using K-Nearest Neighbor(KNN)
6. spam classifier using Random Forest Classifier

We will make use of sklearn library. This amazing library has all of the
above algorithms we just have to import them and it is as easy as that. No
need to worry about all the maths and statistics behind it.

#import sklearn packages for building classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
#initialize multiple classification models
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
#create a dictionary of variables and models
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF':
rfc}
#fit the data onto the models
def train(clf, features, targets):
clf.fit(features, targets)

def predict(clf, features):

return (clf.predict(features))
pred_scores_word_vectors = []
for k,v in clfs.items():
train(v, X_train, y_train)
pred = predict(v, X_test)
pred_scores_word_vectors.append((k, [accuracy_score(y_test ,
pred)]))

Predictions using TFIDF Vectorizer algorithm

pred_scores_word_vectors

# OUTPUT
[('SVC', [0.9784688995215312]),
('KN', [0.9330143540669856]),
('NB', [0.9880382775119617]),
('DT', [0.9605263157894737]),
('LR', [0.9533492822966507]),
('RF', [0.9796650717703349])]

Model predictions
#write functions to detect if the message is spam or not
def find(x):
if x == 1:
print ("Message is SPAM")
else:
print ("Message is NOT Spam")
newtext = ["Free entry"]
integers = vectorizer.transform(newtext)
x = mnb.predict(integers)
find(x)

# OUTPUT
Message is SPAM
Checking Classification Results with
Confusion Matrix
If you are confused about the confusion matrix, read this small article
before proceeding .
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Naive Bayes
y_pred_nb = mnb.predict(X_test)
y_true_nb = y_test
cm = confusion_matrix(y_true_nb, y_pred_nb)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt =
".0f",ax=ax)
plt.xlabel("y_pred_nb")
plt.ylabel("y_true_nb")
plt.show()

output

Glove
100% (1)
Glove
10 pages
The Six Principles of Digital Advertising
No ratings yet
The Six Principles of Digital Advertising
10 pages
Wakala Mauzo / Sales Agent: Juice Ya Mua: Email Address Name Phone Number Jina La Mdhamini Simu Ya Mdhamini
100% (1)
Wakala Mauzo / Sales Agent: Juice Ya Mua: Email Address Name Phone Number Jina La Mdhamini Simu Ya Mdhamini
49 pages
Microsoft Exchange Tips and Tricks
No ratings yet
Microsoft Exchange Tips and Tricks
7 pages
English For Secretarial and Office
No ratings yet
English For Secretarial and Office
33 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Aras Innovator 120 Tree Grid View Administrator Guide
No ratings yet
Aras Innovator 120 Tree Grid View Administrator Guide
71 pages
Reminder Submission of Initial Application at CDRRHR 022122
No ratings yet
Reminder Submission of Initial Application at CDRRHR 022122
2 pages
IFX Expo Limassol - Exhibitor Manual
No ratings yet
IFX Expo Limassol - Exhibitor Manual
19 pages
NLP - Colaboratory
No ratings yet
NLP - Colaboratory
14 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
Artificial Intelligence (18Csc305J) Lab: EXPERIMENT 13: Implementation of NLP Problem
No ratings yet
Artificial Intelligence (18Csc305J) Lab: EXPERIMENT 13: Implementation of NLP Problem
9 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Wimate IoT Platform, Product, and Solutions Catalog
No ratings yet
Wimate IoT Platform, Product, and Solutions Catalog
7 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Description of Citizen Liaison Volunteer Duties - 2019-04-08
No ratings yet
Description of Citizen Liaison Volunteer Duties - 2019-04-08
3 pages
Front Office Duties and Responsibilities
No ratings yet
Front Office Duties and Responsibilities
8 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
No ratings yet
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
4 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
SERVICE POLICY Cessna
No ratings yet
SERVICE POLICY Cessna
4 pages
Lesson 9 Written Modes of Professional Communication
No ratings yet
Lesson 9 Written Modes of Professional Communication
5 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Unstructured
No ratings yet
Unstructured
37 pages
MGO GUTALAC ZAMBOANGA DEL NORTE-Administrative Assistant II Bookkeeper PDF
No ratings yet
MGO GUTALAC ZAMBOANGA DEL NORTE-Administrative Assistant II Bookkeeper PDF
1 page
Mail Type Spam Classifier: Abstarct
No ratings yet
Mail Type Spam Classifier: Abstarct
9 pages
First & Last Name: Place of Birth: Place of Birth
No ratings yet
First & Last Name: Place of Birth: Place of Birth
3 pages
Comparison of Email Marketing Platforms Mailchimp Vs Constant Contact
No ratings yet
Comparison of Email Marketing Platforms Mailchimp Vs Constant Contact
10 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Spam Detection Using Tensorflow
No ratings yet
Spam Detection Using Tensorflow
13 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
Database Login Form Task Rutuja Shejul
No ratings yet
Database Login Form Task Rutuja Shejul
7 pages
Vitality Wellness Network For Doctors
No ratings yet
Vitality Wellness Network For Doctors
3 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Lab 78
No ratings yet
Lab 78
6 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
BC101 Business Communication
No ratings yet
BC101 Business Communication
4 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Dropship Guide
No ratings yet
Dropship Guide
4 pages
VDI Self Help Document (Contractors) - V1.2.3
No ratings yet
VDI Self Help Document (Contractors) - V1.2.3
18 pages
Code
No ratings yet
Code
6 pages
Divergence Indicators
No ratings yet
Divergence Indicators
29 pages
2023.06 Silverlight Research Linkedin Marketing Team.2
No ratings yet
2023.06 Silverlight Research Linkedin Marketing Team.2
17 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
NLP Assignment2
No ratings yet
NLP Assignment2
7 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Com 327 Week 3 Practical Activity
No ratings yet
Com 327 Week 3 Practical Activity
8 pages
امتحان الوحدة الاولى المنهاج الجديد Merged
No ratings yet
امتحان الوحدة الاولى المنهاج الجديد Merged
8 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
3 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
The Cold Email Handbook - Za-Zu
No ratings yet
The Cold Email Handbook - Za-Zu
48 pages
PHILIPPINE NATIONAL POLICE-Administrative Assistant I
No ratings yet
PHILIPPINE NATIONAL POLICE-Administrative Assistant I
2 pages
CyberSprinters Nano's Secret Codes
No ratings yet
CyberSprinters Nano's Secret Codes
6 pages
How To Enroll in ISRO - START
No ratings yet
How To Enroll in ISRO - START
14 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
NLP Record
No ratings yet
NLP Record
16 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Methodology
No ratings yet
Methodology
9 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
VSES3 User Manual - Visitors (16 1 2025)
No ratings yet
VSES3 User Manual - Visitors (16 1 2025)
14 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Document
No ratings yet
Document
11 pages
AI Chapter 2 Class 10 (Part 2)
No ratings yet
AI Chapter 2 Class 10 (Part 2)
6 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages

AI Phash3

Uploaded by

AI Phash3

Uploaded by

Building a Smarter AI-Powered Spam Classifier

TEAM MEMBER: M.MOHAMED USMAN

Phase-3 Document Submission

Overview of the Dataset used

The data was obtained from UCI’s Machine Learning Repository,

 Import the required packages

Building a sms spam classification model

 Split the data into train and test sets

Import the required packages

Removing unwanted columns

data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)

now that the data is looking pretty, let's move on.

# Import nltk packages and Punkt Tokenizer Models

Build word cloud to see which message is

# Creating a corpus of ham messages

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)

data = data.replace(['ham','spam'],[0, 1])

#remove the punctuations and stopwords

text = text.translate(str.maketrans('', '', string.punctuation))

return " ".join(text)

Converting words to vectors

I have included both methods for your reference.

Converting words to vectors using Count Vectorizer

from collections import Counter

print("Total words in data set: ", len(total_counts))

Converting words to vectors using TF-IDF Vectorizer

Classifying using sklearn's pre-built

#import sklearn packages for building classifiers

def predict(clf, features):

Predictions using TFIDF Vectorizer algorithm

You might also like