0% found this document useful (0 votes)
1K views37 pages

Report Minor Project PDF

Uploaded by

Supratibh Saikia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views37 pages

Report Minor Project PDF

Uploaded by

Supratibh Saikia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Classification of Spam Email

Using Support Vector Machine


A Minor Project Report Submitted in Partial Fulfillment of
Requirements for the Degree of

Bachelor of Technology in Information Technology

by
Deep Upadhyaya(BT/IT/1612)
Kangkan Jyoti Baishya(BT/IT/1626)
Biswadeep Saikia(BT/IT/1606)
Kaushik Hajong(BT/IT/1627)

Under the Supervision of


Dr. Debdatta Kandar (Associate Professor)
IT Department

Department of Information Technology


School of Technology
North-Eastern Hill University, Shillong
December 2019
Abstract

Spam emails are increasing day-by-day and they cause a lot of problems like
chances of missing important mails, wastage of storage space, frauds, malware
attack etc. So we have tried to design a model which will be predicting in
advance if the mail received is a spam or non-spam mail and then we can
take action according to it. For the classification of the mails we have use
Support Vector Machine a useful algorithm for classification of text. Here
in our project we are considering the body portion of the email from which
we will be extracting the words that are included excluding the stop words
and train our model with it. As soon as the body of the email is fed to our
model in a string format it will predict whether it is a spam or non-spam.
For classification we have gone through phases of pre-processing, feature
extraction, and then training and testing of the email’s. We have found from
testing and training that our model have misclassified 31 emails of the total
test set of emails. We have used python 3 as our programming language for
implementation of our model.

i
Acknowledgements

This project would not have been possible without the kind support and
help of many individuals and organizations. We would like to extend our
sincere thanks to all of them. We are highly indebted to our Projects
Guide Dr.Debdatta Kandar (Associate Professor) for his guidance and
constant supervision as well as for providing necessary information regarding
the project and also for his support in completing the project. We would also
like to extend our thanks of gratitude to all the teachers of Department of
Information Technology, NORTH EASTERN HILL UNIVERSITY for
their kind co-operation and encouragement which helped us in completion of
this project. We would like to express our special thanks to our project co-
ordinator smt. Sangita Neog (Associate Professor). Above all we thank our
family and friends who have extended their help in completion of our project
work. Our thanks and appreciation also goes to our whole team members
in developing the project and people who have willingly helped us out with
their abilities.

ii
Declaration

This is to certify that we have properly cited any material taken from other
sources and have obtained permission for any copyrighted material included
in this report. We take full responsibility for any code submitted as part of
this project and the contents of this report.

Deep Upadhyaya(BT/IT/1612)

Kangkan Jyoti Baishya(BT/IT/1626)

Biswadeep Saikia(BT/IT/1606)

Kaushik Hajong(BT/IT/1627)

iii
Certificate

This is to certify thatDeep Upadhyaya(BT/IT/1612), Kangkan Jyoti Baishya


(BT/IT/1626), Biswadeep Saikia (BT/IT/1606), Kaushik Hajong (BT/IT/1627)
worked in the project Classification of Spam Email Using Support Vec-
tor Machine from August to November 2019 and has successfully completed
the minor/major project, in order to partial fulfillment of the requirements
for the award of the degree of Bachelor of Technology in Information Tech-
nology under my supervision and guidance.

Dr. Debdatta Kandar


(Associate Professor)
Department of Information Technology
North-Eastern Hill University
Shillong-793022, Meghalaya, India

iv
Certificate

This is to certify thatDeep Upadhyaya (BT/IT/1612), Kangkan Jyoti Baishya


(BT/IT/1626), Biswadeep Saikia (BT/IT/1606), Kaushik Hajong (BT/IT/1627)
and worked in the project Classification of Spam Email Using Support
Vector Machine from August to November 2019 and has successfully com-
pleted the minor/major project, in order to partial fulfillment of the require-
ments for the award of the degree of Bachelor of Technology in Information
Technology.

External Examiner Head

Department of Information Technology


North-Eastern Hill University
Shillong-793022, Meghalaya, India

v
Contents
Abstract i

Declaration iii

Certificate from the guide iv

Certificate from the head v

1 Introduction

1.1 What is Support Vector Machine? 1


1.2 Where SVM is useful? 1
1.3 Project Objective 2
2 Spam Emails
2.1 What are spam emails and problems caused by it 3
2.2 Why spam emails need classification 3
3 Background Study
3.1 Need for classification 4
3.2 Different approaches for filtering 5
3.3 Support Vector Machine 6
3.4 Algorithm 6
3.5 Detailed study of SVM 7
3.6 Advantages and disadvantages of SVM 9
3.6.3 Advantages 9
3.6.2 Disadvantages 9
3.7 SVM v/s other classifiers 10
4 Process of implementation
4.1 Collection of Dataset 11
4.2 Exploring Dataset 11
4.3 Pre-processing of Data 12
4.4 Text Processing 13
4.5 Splitting of training and test data 14
4.6 Training and testing of data 14
4.7 Predicting new emails 15
5 Implementation in python
5.1 Step-by-step implementation with code 16
5.1.1 Step 1 16
5.1.2 Step 2 17
5.1.3 Step 3 18
5.1.4 Step 4 19
5.1.5 Step 5 20
5.1.6 Step 6 21
5.1.7 Step 7 21
5.1.8 Step 8 22
5.1.9 Step 9 23
5.1.10 Step 10 24
6 Results
6.1 Calculated scores of the model 26
7 Conclusion
7.1 Conclusion 27
7.2 Future Work. 27
8 References
List of figures

3.1 Support Vector Machine 8

3.2 Relative rank with other classifiers 11

3.3 Difference in mean accuracy with other classifiers 11

5.1 Data set 17

5.2 Bar-graph of email count 18

5.3 Pie chart of email count 19

5.4 Frequency of occurrence of words 20

5.5 Features in the form of sparse matrix 21

5.6 Training and testing data scores 23

5.7 Best Index 23

5.8 Misclassification count 24

5.9 Predicting new emails 25


CHAPTER 1

INTRODUCTION

1.1 What is Support Vector Machine?

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a


separating hyper plane. It is a supervised learning model with associated learning
algorithms that analyze data used for classification and regression analysis. The goal
of the SVM is to train a model that assigns new unseen objects into a particular
category. It achieves this by creating a linear partition of the feature space into two
categories. Based on the features in the new unseen objects (e.g. documents/emails), it
places an object "above" or "below" the separation plane, leading to a categorization
(e.g. spam or non-spam). This makes it an example of a non-probabilistic linear
classifier. It is non-probabilistic, because the features in the new objects fully
determine its location in feature space and there is no stochastic element involved.

1.2 Where SVM is useful?

SVM depends on supervised-learning algorithms. The aim of using SVM is to


correctly classify unseen data. SVMs have a number of applications in several fields.
Some common applications of SVM are-
• Face detection – SVM classify parts of the image as a face and non-face and
create a square boundary around the face.
• Classification of images – Use of SVMs provides better search accuracy for
image classification. It provides better accuracy in comparison to the traditional
query-based searching techniques.
• Spam email classification –Emails are classified as per the training data i.e.
words used in the SVM.
• Bioinformatics – It includes protein classification and cancer classification. We
use SVM for identifying the classification of genes, patients on the basis of genes
and other biological problems.

1
• Protein fold and remote homology detection – Apply SVM algorithms for
protein remote homology detection.
• Handwriting recognition – We use SVMs to recognize handwritten characters
used widely.

1.3 Project Objective

The objective of our project is to design a model which can predict the new incoming
mail by reading its body that it might be spam email or not. It will be done on the
basis of training and testing of collected emails which are labeled as spam or non-
spam. If it is found to be spam we can just exclude it from the inbox.

2
CHAPTER 2

SPAM EMAILS

2.1 What are spam emails and problems caused by it?

Email spam, also known as junk email, is unsolicited bulk messages sent through
email. Email spam comes in various forms, the most popular being to promote
outright scams or marginally legitimate business schemes. Spam typically is used to
promote access to inexpensive pharmaceutical drugs, weight loss programs, online
degrees, job opportunities and online gambling.

Here are some of the problems caused by spam emails:-

• Communications overload.
• Waste of time.
• Irritation and discontent.
• Criminalization of spam.
• Loss of important and urgent emails.

2.2 Why spam emails need classification?

Spam emails have been growing in popularity since the last decade and are a problem
faced by most email users. Email IDs of users who receive email spam are usually
obtained by spam bots (automated software that crawls the internet for email
addresses).

Email spam is still a problem even today, and spammers still approach it the
spam way. Spam accounts for billions of emails sent every day which makes up 98%
of all emails. Spam causes businesses billions of dollars every year.

Even though antivirus software has come a long way, infected PCs, Trojans
and bots are still the major sources of spam. There are billions of public IPs available
for use; each one could have thousands of PCs behind it including potentially infected
Trojans and bots. Spammers use spam mails to perform email frauds. Fraudulent
spam comes in the form of phishing emails mostly like a formal communication from
banks or any other online payment processors. Phishing emails are crafted to direct
victims to a fake organization’s website that is malicious while the user ends up
sharing all the personal information like login credentials, financial details to
spammer who is having access to the malicious website.

3
CHAPTER 3

BACKGROUNG STUDY

3.1 Need for classification


The upsurge in the volume of unwanted emails called spam has created an intense
need for the development of more dependable and robust anti-spam filters. Machine
learning methods of recent are being used to successfully detect and filter spam
emails.

In recent times, unwanted commercial bulk emails called spam has become a
huge problem on the internet. The person sending the spam messages is referred to as
the spammer. Such a person gathers email addresses from different websites, chat
rooms, and viruses. Spam prevents the user from making full and good use of time,
storage capacity and network bandwidth. The huge volume of spam mails flowing
through the computer networks have destructive effects on the memory space of email
servers, communication bandwidth, CPU power and user time.

To effectively handle the threat posed by email spam's, leading email


providers such as Gmail, Yahoo mail and Outlook have employed the combination of
different machine learning (ML) techniques such as Neural Networks in its spam
filters. These ML techniques have the capacity to learn and identify spam mails and
phishing messages by analyzing loads of such messages throughout a vast collection
of computers. Since machine learning have the capacity to adapt to varying
conditions, Gmail and Yahoo mail spam filters do more than just checking junk
emails using pre-existing rules. They generate new rules themselves based on what
they have learnt as they continue in their spam filtering operation. The machine
learning model used by Google have now advanced to the point that it can detect and
filter out spam and phishing emails with about 99.9 percent accuracy. The implication
of this is that one out of a thousand messages succeed in evading their email spam
filter. Statistics from Google revealed that between 50-70 percent of emails that
Gmail receives are unsolicited mail. Google's detection models have also incorporated
tools called Google Safe Browsing for identifying websites that have malicious URLs.
The phishing-detection performance of Google have been enhanced by introduction of
a system that delay the delivery of some Gmail messages for a while to carry out
additional comprehensive scrutiny of the phishing messages since they are easier to
detect when they are analyzed collectively. The purpose of delaying the delivery of
some of these suspicious emails is to conduct a deeper examination while more
messages arrives in due course of time and the algorithms are updated in real time.
Only about 0.05 percent of emails are affected by this deliberate delay.

4
3.2 Different approaches for filtering
Though there are several email spam filtering methods in existence, the state-of-the-
art approaches are discussed in this paper. We explained below the different
categories of spam filtering techniques that have been widely applied to overcome the
problem of email spam.

Content Based Filtering Technique: Content based filtering is usually used


to create automatic filtering rules and to classify emails using machine learning
approaches, such as Naïve Bayesian classification, Support Vector Machine, K
Nearest Neighbor, Neural Networks. This method normally analyses words, the
occurrence, and distributions of words and phrases in the content of emails and used
then use generated rules to filter the incoming email spam's.

Case Base Spam Filtering Method: Case base or sample base filtering is one
of the popular spam filtering methods. Firstly, all emails both non-spam and spam
emails are extracted from each user's email using collection model. Subsequently, pre-
processing steps are carried out to transform the email using client interface, feature
extraction, and selection, grouping of email data, and evaluating the process. The data
is then classified into two vector sets. Lastly, the machine learning algorithm is used
to train datasets and test them to decide whether the incoming mails are spam or non-
spam.

Heuristic or Rule Based Spam Filtering Technique: This approach uses


already created rules or heuristics to assess a huge number of patterns which are
usually regular expressions against a chosen message. Several similar patterns
increase the score of a message. In contrast, it deducts from the score if any of the
patterns did not correspond. Any message's score that surpasses a specific threshold is
filtered as spam; else it is counted as valid. While some ranking rules do not change
over time, other rules require constant updating to be able to cope effectively with the
menace of spammers who continuously introduce new spam messages that can easily
escape without been noticed from email filters. A good example of a rule based spam
filter is Spam Assassin.

Previous Likeness Based Spam Filtering Technique: This approach uses


memory-based, or instance-based, machine learning methods to classify incoming
emails based to their resemblance to stored examples (e.g. training emails). The
attributes of the email are used to create a multi-dimensional space vector, which is
used to plot new instances as points. The new instances are afterward allocated to the
most popular class of its K-closest training instances. This approach uses the k-nearest
neighbor (KNN) for filtering spam emails.

Adaptive Spam Filtering Technique: The method detects and filters spam by
grouping them into different classes. It divides an email corpus into various groups,
each group has an emblematic text. A comparison is made between each incoming

5
email and each group, and a percentage of similarity is produced to decide the
probable group the email belongs to.

3.3 Support Vector Machine


Support Vector Machines (SVM) are supervised learning algorithms that have been
proven to perform better than some other attendant learning algorithms. SVM is a
group of algorithms proposed by for solving classification and regression problems.
SVM has find application in providing solution to quadratic programming problems
that have inequality constraints and linear equality by differentiating different groups
by means of a hyper plane. It takes full advantage of the boundary. Though the SVM
might not be as fast as other classification methods, the algorithm draws its strength
from its high accuracy because of its capacity to model multidimensional borderlines
that are not sequential or straightforward. SVM is not easily susceptible to a situation
where a model is disproportionately complex such as having numerous parameters
comparative to the number of observations. These qualities make SVM the ideal
algorithm for application in the areas of digital handwriting recognition, text
categorization, speaker recognition, and so on. We briefly describe the binary C-SVM
classifier which was explained in. Here C denote the cost parameter to regulate
modeling error which arises when a function is too closely fit to a limited set of data
points by penalizing the error ξ. During training, assuming we have a set of data to be
trained, hypothetically there is only a merger of parameter (C, γ) which have the
ability to produce the most superior SVM classifier. Grid-search on
parameter C and γ is the only viable technique usually applied in SVM training to
obtain this merger of parameter. The k-fold rotation estimation is employed in grid
search to choose the SVM classifier with the most ideal rotation estimation prediction
of accuracy.

3.4 Algorithm
The SVM training and classification algorithm for spam emails is presented in the
algorithm below:

1: Input Sample Email Message x to classify

2: A training set S, a kernel function, {c1, c2, …cnum} and {γ1, γ2, … γnum}.

3: Number of nearest neighbors k.

4: for i = 1 to num

5: set C=Ci;

6: for j = 1 to q

6
7: set γ=γ;

8: produce a trained SVM classifier f (x) through the current merger parameter (C, γ);

9: if (f (x) is the first produced discriminant function) then

10: keep f (x) as the most ideal SVM classifier f∗(x);

11: else

12: compare classifier f (x) and the current best SVM classifier f∗(x) using k-fold
cross-validation

13: keep classifier with a better accuracy.

14: end if

15: end for

16: end for

17: return Final Email Message Classification (Spam/Non-spam email)

18: end

3.5 Detailed study of SVM


SVM offers very high accuracy compared to other classifiers such as logistic
regression, and decision trees. It is known for its kernel trick to handle nonlinear input
spaces. It is used in a variety of applications such as face detection, intrusion
detection, classification of emails, news articles and web pages, classification of
genes, and handwriting recognition.

SVM is an exciting algorithm and the concepts are relatively simple. The
classifier separates data points using a hyper-plane with the largest amount of margin.
That's why an SVM classifier is also known as a discriminative classifier. SVM finds
an optimal hyper-plane which helps in classifying new data points. Generally, Support
Vector Machines is considered to be a classification approach, it but can be employed
in both types of classification and regression problems. It can easily handle multiple
continuous and categorical variables. SVM constructs a hyper-plane in
multidimensional space to separate different classes. SVM generates optimal hyper-
plane in an iterative manner, which is used to minimize an error. The core idea of
SVM is to find a maximum marginal hyper-plane(MMH) that best divides the dataset
into classes.

7
Fig 3.1 Support Vector Machine.

Support Vectors

Support vectors are the data points, which are closest to the hyper plane. These points
will define the separating line better by calculating margins. These points are more
relevant to the construction of the classifier.

Hyper-plane

A hyper plane is a decision plane which separates between a set of objects having
different class memberships.

Margin

A margin is a gap between the two lines on the closest class points. This is calculated
as the perpendicular distance from the line to support vectors or closest points. If the
margin is larger in between the classes, then it is considered a good margin, a smaller
margin is a bad margin.

8
3.6 Advantages and Des-advantages of Support Vector Machine
(SVM)

3.6.1 Advantages
1. Regularization capabilities: SVM has L2 Regularization feature. So, it has good
generalization So, it has good generalization capabilities which prevent it from over-
fitting. Capabilities which prevent it from over-fitting.
2. Handles non-linear data efficiently: SVM can efficiently handle non-linear data
using Kernel trick.
3. Solves both Classification and Regression problems: SVM can be used to solve
both classification and regression problems. SVM is used for classification problems
while SVR (Support Vector Regression) is used for regression problems.
4. Stability: A small change to the data does not greatly affect the hyper plane and
hence the SVM. So the SVM model is stable.

3.6.2 Disadvantages
1. Choosing an appropriate Kernel function is difficult: Choosing an appropriate
Kernel function (to handle the non-linear data) is not an easy task. It could be tricky
and complex. In case of using a high dimension Kernel, you might generate too many
support vectors which reduce the training speed drastically.
2. Extensive memory requirement: Algorithmic complexity and memory requirements
of SVM are very high. You need a lot of memory since you have to store all the
support vectors in the memory and this number grows abruptly with the training
dataset size.
3. Requires Feature Scaling: One must do feature scaling of variables before applying
SVM.
4. Long training time: SVM takes a long training time on large datasets.
5. Difficult to interpret: SVM model is difficult to understand and interpret by human
beings unlike Decision Trees.

9
3.7 SVM v/s other classifiers

Fig 3.2 Relative rank with other classifiers

Fig 3.3 Difference in mean accuracy with other classifiers

10
CHAPTER 4

PROCESS OF IMPLIMENTATION

4.1 Collection of dataset


We have collected our data set from GitHub. It supports a variety of dataset
publication formats, but we strongly encourage dataset publishers to share their data
in an accessible, non-proprietary format if possible. Not only are open, accessible data
formats better supported on the platform, they are also easier to work with for more
people regardless of their tools. The simplest and best-supported file type available is
the “Comma-Separated List”, or CSV, for tabular data.
Data Set source link:
https://drive.google.com/file/d/1okHygWlJNeGxSeisZ9RxtXtartBwcNN/view?usp=sharing

4.2 Exploring Dataset


The whole point of Exploring Data Set is to just take a step back and look at the
dataset before doing anything with it. It is just as important as any part of a data
project because real datasets are really messy and lots of things can go wrong. If we
don’t know our data well enough, how are you going to know where to look for those
sources of error or confusion? Some of the things we need to keep in mind while
exploring the data set so that we can use it properly for our experiment.

1. Was the data imported correctly?


We have observed the data set properly for any kind of things that may create error
while importing the data to our machine. Also we have checked that whether the
columns, which states whether the body of the emails are from "ham" (which means
non spam email) and "spam" (which means spam emails) are properly inputted for all
the email bodies that are present in the data set. Each row in our data set consist of
two columns with the heading v1 and v2. v1 is for identifying whether the email is
"ham" or "spam" and v2 column is for the body of the emails.

2. Are there missing and unusual values?


Missing values are a kind of nightmare while running the machine because they can

11
cause many kinds of unknown errors which may lead to the faulty output of the
machine. So we have checked the values of the columns which contains the identifier
"ham" and "spam", that each and every column have their respective identifier. Also
we have checked whether there is any missing email body in the email body column
of our data set.

3. Visualizations.
Visualizing data in various ways can help seeing things we may have missed out on in
your early stages of exploration. Here we have calculated the frequencies and based
on that we are selecting the twenty most common words occurring in our data set and
then we plot a bar graph on the basis of that. We have counted the occurrence of most
common words in both spam and non-spam emails. While pre-processing of the data
we have excluded the stop words.
Stop Words: In computer search engines, a stop word is a commonly used word
(such as "the") that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result of a search
query.

4.3 Pre-Processing of data

A. Data Pre-Processing Steps: In filtering of spam, the pre-processing of the textual


information is very critical and important. Main objective of text data preprocessing is
to remove data which do not give useful information regarding the class of the
document. Furthermore we also want to remove data that is redundant. Most widely
used data cleaning steps in the textual retrieval tasks are removing of stop words and
performing stemming to reduce the vocabulary.

B. Representation of Data: The Next main task was the representation of data. The
data representation step is needed because it’s very hard to do computations with the
textual data. The representation should be such that it should reveal the actual
statistics of the textual data. Data representation should be in a manner so that the
actual statistics of the textual data is converted to proper numbers. Furthermore it

12
should facilitate the classification tasks and should be simple enough to implement.
There exist many term weighting methods which will calculate the weight for term
differently such as Boolean Weighting, Term frequency, Term Document Frequency
inverse document frequency (TF-IDF).

C. Classification: In Simple terms classification is a task of learning data patterns


that are present in the data from the previous known instances and associating those
data patterns with the classes. Later on when given an unknown instance it will search
for data patterns and thus will predict the class based on the absence or presence of
data patterns.

4.4 Text Processing

Step 1 : Data Preprocessing

• Tokenization — convert sentences to words


• Removing unnecessary punctuation, tags
• Removing stop words — frequent words such as ”the”, ”is”, etc. that do not
have specific semantic
• Stemming — words are reduced to a root by removing inflection through
dropping unnecessary characters, usually a suffix.
• Lemmatization — Another approach to remove inflection by determining the
part of speech and utilizing detailed database of the language.

Step 2: Feature Extraction


In text processing, words of the text represent discrete, categorical features. So we
encode such data in a way which is ready to be used by the algorithms. The mapping
from textual data to real valued vectors is called feature extraction. One of the
simplest techniques to numerically represent text is Bag of Words.

Bag of Words (BOW): Here we make the list of unique words in the text corpus
called vocabulary. First approach is that we can represent each sentence or document

13
as a vector with each word represented as 1 for present and 0 for absent from the
vocabulary. Second approach is that representation can be count the number of times
each word appears in a document. Third approach using the Term Frequency-Inverse
Document Frequency (TF-IDF) technique. In our model we have used the second
approach which is counting the number of times each word appears in a document and
using this method we will be further process and train our model.

4.5 Splitting of training and test data


Before we can feed the Support Vector Machine (SVM) classifier with the data that
was loaded for predictive analytics, we must split the full dataset into a training set
and test set.
Training Dataset: The sample of data used to fit the model. The model sees and
learns from this data.
Test Dataset: The sample of data used to test whether the model is performing well
or not and what are its precision rate.

While splitting the make sure that your test set meets the following two conditions:
• The sliced portion of the data set for training should be enough to yield
statistically meaningful results.
• Also the test data set should be representative of the data set as a whole. In
other words, we need not pick a test set with different characteristics than the
training set.

We have taken test data size 0.33 of the whole data set, which means 33% of the total
data set we have.

4.6 Training and Testing of data

Training of the data is done by the SVM. In our model we have imported SVM from
the scikit-learn library which makes the task easier. Also it is just a code of 15-20
lines for the training function(of training set of data ). We will import he SVM
function from scikit-learn library and then use the training function to train the
processed input. After the training is done by the training function with the help of

14
SVM, we proceed to the testing phase of the whole process. For testing also we do the
same thing as the training procedure we will again use a testing function which will
automatically test the given processed test data as input or we can say processed test
inputs. These functions used to train and test the data gives as output the score of the
training or testing process, which are then stored in a variable and then can be
displayed.
Taking about the C parameter we have used, starting from one to one thousand we
trained and tested the data set for different intervals like fifty, hundred, ten. then we
will get the list of all the c parameters with their training and testing accuracy and
then we can select the best for further predictions.

4.7 Predicting new emails

After our analysis model is ready we will now feed a new body of a spam or non
spam email into our model as a string. Then our model will itself do the data pre
processing and feature extraction and all the other things and will give us a numerical
output as 0 or 1. If the output is 0 it is a non-spam email, if it is 1 then it is a spam
email. We were successfully predict the emails which we put as input one by one.

15
CHAPTER 5

IMPLIMENTATION IN PYTHON

5 .1 Step-by-step implementation with code

5.1.1
Step 1:-First of all we have imported all the library functions we need for the quality
performance of our spam email classification model.

These are the libraries and functions we have used here are:

1.NumPy: Is the fundamental package for scientific computing with Python. It


contains among other things: like a powerful N-dimensional array object,
sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities.

2. Pandas: Is the most popular python library that is used for data analysis. It provides
highly optimized performance with back-end source code is purely written in C or
Python.

3. Matplotlib.pyplot: Is a state-based interface to matplotlib. It provides a MATLAB


like way of plotting. pyplot is mainly intended for interactive plots and simple cases
of programmatic plot generation.

4.Counter from collections: A counter is a container that stores elements as dictionary


keys, and their counts are stored as dictionary values..

5.Scikit-learn: Is a Python module for machine learning. It is a simple and efficient


tools for data mining and data analysis, accessible to everybody, and reusable in
various contexts Built on NumPy, SciPy, and matplotlib, open source, commercially
usable - BSD license

16
6.Warnings: Warning messages are typically issued in situations where it is useful to
alert the user of some condition in a program, where that condition (normally) doesn’t
warrant raising an exception and terminating the program. For example, one might
want to issue a warning when a program uses an obsolete module.

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, metrics, svm
from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

5.1.2
Step 2: Then we imported the data set upon which we will train and test our modle.

Code:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(n=10)

17
Fig 5.1 Data Set.

5.1.3
Step 3: Using pandas and function value_counts we counted the number of data
(email body) we have in the data set and the plotted a bar graph showing spam and
non spam emails. Again using the same counted data we also plotted a pie chart. For
plotting the bar graph and pie chart we have used matplotlib.pyplot() .

Code:
#for potting bar graph
count_Class=pd.value_counts(data["v1"], sort= True)
count_Class.plot(kind= 'bar', color= ["red", "green"])
plt.title('Bar chart')
plt.show()

#for plotting pie chart


count_Class.plot(kind = 'pie', autopct='%1.0f%%')
plt.title('Pie chart')
plt.ylabel('')
plt.show()

Fig 5.2 Bar-graph of email count.

18
Fig 5.3 Pie chart of email count.

5.1.4
Step 4: Then we have used the function counter() and pandas to count the frequencies
of the words that occur in the body of the emails. We have selected most common
twenty words for both spam and non-spam emails and plotted them in bar graph using
the function matplotlib.pyplot()

Code:
#Counting the frequency of occurrence
count1 = Counter(" ".join(data[data['v1']=='ham']["v2"]).split()).most_common(20)
df1 = pd.DataFrame.from_dict(count1)
df1 = df1.rename(columns={0: "words in non-spam", 1 : "count"})
count2 = Counter("
".join(data[data['v1']=='spam']["v2"]).split()).most_common(20)
df2 = pd.DataFrame.from_dict(count2)
df2 = df2.rename(columns={0: "words in spam", 1 : "count_"})

#Plotting the bar graph for non-spam emails


df1.plot.bar(legend = False)
y_pos = np.arange(len(df1["words in non-spam"]))
plt.xticks(y_pos, df1["words in non-spam"])
plt.title('More frequent words in non-spam messages')

19
plt.xlabel('words')
plt.ylabel('number')
plt.show()

#Plotting the bar graph for spam emails


df2.plot.bar(legend = False, color = 'orange')
y_pos = np.arange(len(df2["words in spam"]))
plt.xticks(y_pos, df2["words in spam"])
plt.title('More frequent words in spam messages')
plt.xlabel('words')
plt.ylabel('number')
plt.show()

Fig 5.4 Frequency of occurrence of words.

5.1.5
Step 5: Then we have extracted the features from the data using the function
feature_extraction.text.CountVectorizer() excluding the stop words (defined earlier).
After extraction of feature we have done fitting of the features into our model using
the function fit_transform().

Code:
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])

20
Fig 5.5 Features in the form of sparse matrix.

5.1.6
Step 6: Then we have perform split operation on the processed data to separate the
training and the test data using the function model_selection.train_test_split().

Code:
data["v1"]=data["v1"].map({'spam':1,'ham':0})
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,data['v1'],test_size=
0.33 ,random_state=42)

5.1.7
Step 7: Then we have performed training and testing of the data using SVM and
storing the output (test score) in respective assigned variables. Before training we
have done fitting of the processed data using fit(). We have trained and tested the data
using the function score() using SVM which we have imported as svc and the data is
trained and tested for various C parameters starting from fifty to one thousand with
the interval of fifty.

Code:
list_C = np.arange(50, 1000, 50)

21
score_train = np.zeros(len(list_C))
score_test = np.zeros(len(list_C))
recall_test = np.zeros(len(list_C))
precision_test= np.zeros(len(list_C))
count = 0
#For multiple C parameter
for C in list_C:
svc = svm.SVC(C=C)
svc.fit(X_train, y_train)
score_train[count] = svc.score(X_train, y_train)
score_test[count]= svc.score(X_test, y_test)
recall_test[count] = metrics.recall_score(y_test, svc.predict(X_test))
precision_test[count] = metrics.precision_score(y_test, svc.predict(X_test))
count = count + 1

5.1.8
Step 8: Then we have displayed the score of training and testing for training accuracy,
testing accuracy, test recall and test precession. Also we found out the best case of C
parameter for which the model works with great efficiency.

Code:
#for displaying the score
matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns =
['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=20)
#for finding the best case
best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()
svc = svm.SVC(C=list_C[best_index])
svc.fit(X_train, y_train)
models.iloc[best_index, :]

22
Fig 5.6 Training and Test scores.

Fig 5.7 Best index.

5.1.9
Step 9: Using the function metrics.confusion_matrix() we have estimated that how
many emails we have misclassified during this process and it was found the we have
misclassified only 31 emails, which states that the model is efficient.

23
Code:
m_confusion_test = metrics.confusion_matrix(y_test, svc.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted Non-Spam',
'Predicted Spam'],
index = ['Actual Non-Spam', 'Actual Spam'])

Fig 5.8 Misclassification count.

5.1.10
Step 10: Predicting a new email. Here we have given the input of the body of email as
a string (mytest) and then using our model to test it and predict if it is spam or not.

Code:
mytest = "you have won a lottery of $2000. to claim it reply to this email "
Y = [mytest] #mytest is a new email in string format
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
f.fit(data["v2"]) # fitting
X = f.transform(Y) # mapping
res=svc.predict(X)
if res == 0:
print("This is a non spam email")
else :
print("This is a spam email")

24
Fig 5.9 Predicting new emails.

25
CHAPTER 6

RESULTS
6.1 Calculated scores of the model
After calculating the scores of training and testing with various values of C we have
found that our model works best for the case where value of C = 650. Here we get the
results as follows:
C 650.000000
Train Accuracy 0.996518
Test Accuracy 0.983143
Test Recall 0.876984
Test Precision 1.000000

So these is the best case of effective output by our model. Also we found that as we
go on increasing the value of C we get the test accuracy 1. But we have not
considered those values as there may be miss classification using those values.

Also using C = 650 we have successfully classified 1587 emails as non-spam which
were originally non-spam and 221 emails as spam which were originally spam. But
31 spam emails were classified as non-spam while those were spam emails. So we can
conclude that the model is efficient as it has misclassified very less number of emails.

26
CHAPTER 7

CONCLUSION
7.1 Conclusion

With the advance of new technologies and investment possibilities, the statistical or
machine learning methods, once reserved exclusively to the professional financial
institutions, can be also beneficial to the amateur investors.

The method of support vector machines as an alternative to the conservative logistic


regression models was studied and its performance compared on the real credit data
sets. Especially in combination with the non-linear kernel, SVM proved itself as a
competitive approach and provided a slight edge on top of the logistic regression
model.

Here in our work we have used SVM as the main classification algorithm for the
implementation of spam email classification. We have chosen SVM over other
algorithms as SVM is good in text classification and in our work we have used text
part of the email as our input.

7.2 Future Work


As stated above that we have worked over classification of email's as spam or not
and further this classification can be more accurate if other algorithms are also used
for the same. Also using these other algorithms we can personalize the descriptions of
spam emails for group of user's.

27
REFERENCES

1. https://www.sciencedirect.com/science/article/pii/S2405844018353404

2. https://github.com/topics/spam-classification

3. https://github.com/nishi1612/Email-Spam-Classification-using-SVM

4. https://github.com/ishmav16/Email-Classification-Spam-or-Ham

5. https://towardsdatascience.com/spam-classifier-in-python-from-scratch-
27a98ddd8e73

6. https://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-
with-python-and-scikit-learn.html

7. https://hackernoon.com/a-simple-spam-classifier-193a23666570

8. http://www.computerscijournal.org/vol10no3/a-theoretical-comparative-analysis-
of-classification-techniques-in-spam-mail-filtering/

9. http://svm.michalhaltuf.cz

10. https://link.springer.com/article/10.1007/s10462-010-9166-x

28

You might also like