Report Minor Project PDF
Report Minor Project PDF
by
Deep Upadhyaya(BT/IT/1612)
Kangkan Jyoti Baishya(BT/IT/1626)
Biswadeep Saikia(BT/IT/1606)
Kaushik Hajong(BT/IT/1627)
Spam emails are increasing day-by-day and they cause a lot of problems like
chances of missing important mails, wastage of storage space, frauds, malware
attack etc. So we have tried to design a model which will be predicting in
advance if the mail received is a spam or non-spam mail and then we can
take action according to it. For the classification of the mails we have use
Support Vector Machine a useful algorithm for classification of text. Here
in our project we are considering the body portion of the email from which
we will be extracting the words that are included excluding the stop words
and train our model with it. As soon as the body of the email is fed to our
model in a string format it will predict whether it is a spam or non-spam.
For classification we have gone through phases of pre-processing, feature
extraction, and then training and testing of the email’s. We have found from
testing and training that our model have misclassified 31 emails of the total
test set of emails. We have used python 3 as our programming language for
implementation of our model.
i
Acknowledgements
This project would not have been possible without the kind support and
help of many individuals and organizations. We would like to extend our
sincere thanks to all of them. We are highly indebted to our Projects
Guide Dr.Debdatta Kandar (Associate Professor) for his guidance and
constant supervision as well as for providing necessary information regarding
the project and also for his support in completing the project. We would also
like to extend our thanks of gratitude to all the teachers of Department of
Information Technology, NORTH EASTERN HILL UNIVERSITY for
their kind co-operation and encouragement which helped us in completion of
this project. We would like to express our special thanks to our project co-
ordinator smt. Sangita Neog (Associate Professor). Above all we thank our
family and friends who have extended their help in completion of our project
work. Our thanks and appreciation also goes to our whole team members
in developing the project and people who have willingly helped us out with
their abilities.
ii
Declaration
This is to certify that we have properly cited any material taken from other
sources and have obtained permission for any copyrighted material included
in this report. We take full responsibility for any code submitted as part of
this project and the contents of this report.
Deep Upadhyaya(BT/IT/1612)
Biswadeep Saikia(BT/IT/1606)
Kaushik Hajong(BT/IT/1627)
iii
Certificate
iv
Certificate
v
Contents
Abstract i
Declaration iii
1 Introduction
INTRODUCTION
1
• Protein fold and remote homology detection – Apply SVM algorithms for
protein remote homology detection.
• Handwriting recognition – We use SVMs to recognize handwritten characters
used widely.
The objective of our project is to design a model which can predict the new incoming
mail by reading its body that it might be spam email or not. It will be done on the
basis of training and testing of collected emails which are labeled as spam or non-
spam. If it is found to be spam we can just exclude it from the inbox.
2
CHAPTER 2
SPAM EMAILS
Email spam, also known as junk email, is unsolicited bulk messages sent through
email. Email spam comes in various forms, the most popular being to promote
outright scams or marginally legitimate business schemes. Spam typically is used to
promote access to inexpensive pharmaceutical drugs, weight loss programs, online
degrees, job opportunities and online gambling.
• Communications overload.
• Waste of time.
• Irritation and discontent.
• Criminalization of spam.
• Loss of important and urgent emails.
Spam emails have been growing in popularity since the last decade and are a problem
faced by most email users. Email IDs of users who receive email spam are usually
obtained by spam bots (automated software that crawls the internet for email
addresses).
Email spam is still a problem even today, and spammers still approach it the
spam way. Spam accounts for billions of emails sent every day which makes up 98%
of all emails. Spam causes businesses billions of dollars every year.
Even though antivirus software has come a long way, infected PCs, Trojans
and bots are still the major sources of spam. There are billions of public IPs available
for use; each one could have thousands of PCs behind it including potentially infected
Trojans and bots. Spammers use spam mails to perform email frauds. Fraudulent
spam comes in the form of phishing emails mostly like a formal communication from
banks or any other online payment processors. Phishing emails are crafted to direct
victims to a fake organization’s website that is malicious while the user ends up
sharing all the personal information like login credentials, financial details to
spammer who is having access to the malicious website.
3
CHAPTER 3
BACKGROUNG STUDY
In recent times, unwanted commercial bulk emails called spam has become a
huge problem on the internet. The person sending the spam messages is referred to as
the spammer. Such a person gathers email addresses from different websites, chat
rooms, and viruses. Spam prevents the user from making full and good use of time,
storage capacity and network bandwidth. The huge volume of spam mails flowing
through the computer networks have destructive effects on the memory space of email
servers, communication bandwidth, CPU power and user time.
4
3.2 Different approaches for filtering
Though there are several email spam filtering methods in existence, the state-of-the-
art approaches are discussed in this paper. We explained below the different
categories of spam filtering techniques that have been widely applied to overcome the
problem of email spam.
Case Base Spam Filtering Method: Case base or sample base filtering is one
of the popular spam filtering methods. Firstly, all emails both non-spam and spam
emails are extracted from each user's email using collection model. Subsequently, pre-
processing steps are carried out to transform the email using client interface, feature
extraction, and selection, grouping of email data, and evaluating the process. The data
is then classified into two vector sets. Lastly, the machine learning algorithm is used
to train datasets and test them to decide whether the incoming mails are spam or non-
spam.
Adaptive Spam Filtering Technique: The method detects and filters spam by
grouping them into different classes. It divides an email corpus into various groups,
each group has an emblematic text. A comparison is made between each incoming
5
email and each group, and a percentage of similarity is produced to decide the
probable group the email belongs to.
3.4 Algorithm
The SVM training and classification algorithm for spam emails is presented in the
algorithm below:
2: A training set S, a kernel function, {c1, c2, …cnum} and {γ1, γ2, … γnum}.
4: for i = 1 to num
5: set C=Ci;
6: for j = 1 to q
6
7: set γ=γ;
8: produce a trained SVM classifier f (x) through the current merger parameter (C, γ);
11: else
12: compare classifier f (x) and the current best SVM classifier f∗(x) using k-fold
cross-validation
14: end if
18: end
SVM is an exciting algorithm and the concepts are relatively simple. The
classifier separates data points using a hyper-plane with the largest amount of margin.
That's why an SVM classifier is also known as a discriminative classifier. SVM finds
an optimal hyper-plane which helps in classifying new data points. Generally, Support
Vector Machines is considered to be a classification approach, it but can be employed
in both types of classification and regression problems. It can easily handle multiple
continuous and categorical variables. SVM constructs a hyper-plane in
multidimensional space to separate different classes. SVM generates optimal hyper-
plane in an iterative manner, which is used to minimize an error. The core idea of
SVM is to find a maximum marginal hyper-plane(MMH) that best divides the dataset
into classes.
7
Fig 3.1 Support Vector Machine.
Support Vectors
Support vectors are the data points, which are closest to the hyper plane. These points
will define the separating line better by calculating margins. These points are more
relevant to the construction of the classifier.
Hyper-plane
A hyper plane is a decision plane which separates between a set of objects having
different class memberships.
Margin
A margin is a gap between the two lines on the closest class points. This is calculated
as the perpendicular distance from the line to support vectors or closest points. If the
margin is larger in between the classes, then it is considered a good margin, a smaller
margin is a bad margin.
8
3.6 Advantages and Des-advantages of Support Vector Machine
(SVM)
3.6.1 Advantages
1. Regularization capabilities: SVM has L2 Regularization feature. So, it has good
generalization So, it has good generalization capabilities which prevent it from over-
fitting. Capabilities which prevent it from over-fitting.
2. Handles non-linear data efficiently: SVM can efficiently handle non-linear data
using Kernel trick.
3. Solves both Classification and Regression problems: SVM can be used to solve
both classification and regression problems. SVM is used for classification problems
while SVR (Support Vector Regression) is used for regression problems.
4. Stability: A small change to the data does not greatly affect the hyper plane and
hence the SVM. So the SVM model is stable.
3.6.2 Disadvantages
1. Choosing an appropriate Kernel function is difficult: Choosing an appropriate
Kernel function (to handle the non-linear data) is not an easy task. It could be tricky
and complex. In case of using a high dimension Kernel, you might generate too many
support vectors which reduce the training speed drastically.
2. Extensive memory requirement: Algorithmic complexity and memory requirements
of SVM are very high. You need a lot of memory since you have to store all the
support vectors in the memory and this number grows abruptly with the training
dataset size.
3. Requires Feature Scaling: One must do feature scaling of variables before applying
SVM.
4. Long training time: SVM takes a long training time on large datasets.
5. Difficult to interpret: SVM model is difficult to understand and interpret by human
beings unlike Decision Trees.
9
3.7 SVM v/s other classifiers
10
CHAPTER 4
PROCESS OF IMPLIMENTATION
11
cause many kinds of unknown errors which may lead to the faulty output of the
machine. So we have checked the values of the columns which contains the identifier
"ham" and "spam", that each and every column have their respective identifier. Also
we have checked whether there is any missing email body in the email body column
of our data set.
3. Visualizations.
Visualizing data in various ways can help seeing things we may have missed out on in
your early stages of exploration. Here we have calculated the frequencies and based
on that we are selecting the twenty most common words occurring in our data set and
then we plot a bar graph on the basis of that. We have counted the occurrence of most
common words in both spam and non-spam emails. While pre-processing of the data
we have excluded the stop words.
Stop Words: In computer search engines, a stop word is a commonly used word
(such as "the") that a search engine has been programmed to ignore, both when
indexing entries for searching and when retrieving them as the result of a search
query.
B. Representation of Data: The Next main task was the representation of data. The
data representation step is needed because it’s very hard to do computations with the
textual data. The representation should be such that it should reveal the actual
statistics of the textual data. Data representation should be in a manner so that the
actual statistics of the textual data is converted to proper numbers. Furthermore it
12
should facilitate the classification tasks and should be simple enough to implement.
There exist many term weighting methods which will calculate the weight for term
differently such as Boolean Weighting, Term frequency, Term Document Frequency
inverse document frequency (TF-IDF).
Bag of Words (BOW): Here we make the list of unique words in the text corpus
called vocabulary. First approach is that we can represent each sentence or document
13
as a vector with each word represented as 1 for present and 0 for absent from the
vocabulary. Second approach is that representation can be count the number of times
each word appears in a document. Third approach using the Term Frequency-Inverse
Document Frequency (TF-IDF) technique. In our model we have used the second
approach which is counting the number of times each word appears in a document and
using this method we will be further process and train our model.
While splitting the make sure that your test set meets the following two conditions:
• The sliced portion of the data set for training should be enough to yield
statistically meaningful results.
• Also the test data set should be representative of the data set as a whole. In
other words, we need not pick a test set with different characteristics than the
training set.
We have taken test data size 0.33 of the whole data set, which means 33% of the total
data set we have.
Training of the data is done by the SVM. In our model we have imported SVM from
the scikit-learn library which makes the task easier. Also it is just a code of 15-20
lines for the training function(of training set of data ). We will import he SVM
function from scikit-learn library and then use the training function to train the
processed input. After the training is done by the training function with the help of
14
SVM, we proceed to the testing phase of the whole process. For testing also we do the
same thing as the training procedure we will again use a testing function which will
automatically test the given processed test data as input or we can say processed test
inputs. These functions used to train and test the data gives as output the score of the
training or testing process, which are then stored in a variable and then can be
displayed.
Taking about the C parameter we have used, starting from one to one thousand we
trained and tested the data set for different intervals like fifty, hundred, ten. then we
will get the list of all the c parameters with their training and testing accuracy and
then we can select the best for further predictions.
After our analysis model is ready we will now feed a new body of a spam or non
spam email into our model as a string. Then our model will itself do the data pre
processing and feature extraction and all the other things and will give us a numerical
output as 0 or 1. If the output is 0 it is a non-spam email, if it is 1 then it is a spam
email. We were successfully predict the emails which we put as input one by one.
15
CHAPTER 5
IMPLIMENTATION IN PYTHON
5.1.1
Step 1:-First of all we have imported all the library functions we need for the quality
performance of our spam email classification model.
These are the libraries and functions we have used here are:
2. Pandas: Is the most popular python library that is used for data analysis. It provides
highly optimized performance with back-end source code is purely written in C or
Python.
16
6.Warnings: Warning messages are typically issued in situations where it is useful to
alert the user of some condition in a program, where that condition (normally) doesn’t
warrant raising an exception and terminating the program. For example, one might
want to issue a warning when a program uses an obsolete module.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, metrics, svm
from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
5.1.2
Step 2: Then we imported the data set upon which we will train and test our modle.
Code:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(n=10)
17
Fig 5.1 Data Set.
5.1.3
Step 3: Using pandas and function value_counts we counted the number of data
(email body) we have in the data set and the plotted a bar graph showing spam and
non spam emails. Again using the same counted data we also plotted a pie chart. For
plotting the bar graph and pie chart we have used matplotlib.pyplot() .
Code:
#for potting bar graph
count_Class=pd.value_counts(data["v1"], sort= True)
count_Class.plot(kind= 'bar', color= ["red", "green"])
plt.title('Bar chart')
plt.show()
18
Fig 5.3 Pie chart of email count.
5.1.4
Step 4: Then we have used the function counter() and pandas to count the frequencies
of the words that occur in the body of the emails. We have selected most common
twenty words for both spam and non-spam emails and plotted them in bar graph using
the function matplotlib.pyplot()
Code:
#Counting the frequency of occurrence
count1 = Counter(" ".join(data[data['v1']=='ham']["v2"]).split()).most_common(20)
df1 = pd.DataFrame.from_dict(count1)
df1 = df1.rename(columns={0: "words in non-spam", 1 : "count"})
count2 = Counter("
".join(data[data['v1']=='spam']["v2"]).split()).most_common(20)
df2 = pd.DataFrame.from_dict(count2)
df2 = df2.rename(columns={0: "words in spam", 1 : "count_"})
19
plt.xlabel('words')
plt.ylabel('number')
plt.show()
5.1.5
Step 5: Then we have extracted the features from the data using the function
feature_extraction.text.CountVectorizer() excluding the stop words (defined earlier).
After extraction of feature we have done fitting of the features into our model using
the function fit_transform().
Code:
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
20
Fig 5.5 Features in the form of sparse matrix.
5.1.6
Step 6: Then we have perform split operation on the processed data to separate the
training and the test data using the function model_selection.train_test_split().
Code:
data["v1"]=data["v1"].map({'spam':1,'ham':0})
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,data['v1'],test_size=
0.33 ,random_state=42)
5.1.7
Step 7: Then we have performed training and testing of the data using SVM and
storing the output (test score) in respective assigned variables. Before training we
have done fitting of the processed data using fit(). We have trained and tested the data
using the function score() using SVM which we have imported as svc and the data is
trained and tested for various C parameters starting from fifty to one thousand with
the interval of fifty.
Code:
list_C = np.arange(50, 1000, 50)
21
score_train = np.zeros(len(list_C))
score_test = np.zeros(len(list_C))
recall_test = np.zeros(len(list_C))
precision_test= np.zeros(len(list_C))
count = 0
#For multiple C parameter
for C in list_C:
svc = svm.SVC(C=C)
svc.fit(X_train, y_train)
score_train[count] = svc.score(X_train, y_train)
score_test[count]= svc.score(X_test, y_test)
recall_test[count] = metrics.recall_score(y_test, svc.predict(X_test))
precision_test[count] = metrics.precision_score(y_test, svc.predict(X_test))
count = count + 1
5.1.8
Step 8: Then we have displayed the score of training and testing for training accuracy,
testing accuracy, test recall and test precession. Also we found out the best case of C
parameter for which the model works with great efficiency.
Code:
#for displaying the score
matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns =
['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=20)
#for finding the best case
best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()
svc = svm.SVC(C=list_C[best_index])
svc.fit(X_train, y_train)
models.iloc[best_index, :]
22
Fig 5.6 Training and Test scores.
5.1.9
Step 9: Using the function metrics.confusion_matrix() we have estimated that how
many emails we have misclassified during this process and it was found the we have
misclassified only 31 emails, which states that the model is efficient.
23
Code:
m_confusion_test = metrics.confusion_matrix(y_test, svc.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted Non-Spam',
'Predicted Spam'],
index = ['Actual Non-Spam', 'Actual Spam'])
5.1.10
Step 10: Predicting a new email. Here we have given the input of the body of email as
a string (mytest) and then using our model to test it and predict if it is spam or not.
Code:
mytest = "you have won a lottery of $2000. to claim it reply to this email "
Y = [mytest] #mytest is a new email in string format
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
f.fit(data["v2"]) # fitting
X = f.transform(Y) # mapping
res=svc.predict(X)
if res == 0:
print("This is a non spam email")
else :
print("This is a spam email")
24
Fig 5.9 Predicting new emails.
25
CHAPTER 6
RESULTS
6.1 Calculated scores of the model
After calculating the scores of training and testing with various values of C we have
found that our model works best for the case where value of C = 650. Here we get the
results as follows:
C 650.000000
Train Accuracy 0.996518
Test Accuracy 0.983143
Test Recall 0.876984
Test Precision 1.000000
So these is the best case of effective output by our model. Also we found that as we
go on increasing the value of C we get the test accuracy 1. But we have not
considered those values as there may be miss classification using those values.
Also using C = 650 we have successfully classified 1587 emails as non-spam which
were originally non-spam and 221 emails as spam which were originally spam. But
31 spam emails were classified as non-spam while those were spam emails. So we can
conclude that the model is efficient as it has misclassified very less number of emails.
26
CHAPTER 7
CONCLUSION
7.1 Conclusion
With the advance of new technologies and investment possibilities, the statistical or
machine learning methods, once reserved exclusively to the professional financial
institutions, can be also beneficial to the amateur investors.
Here in our work we have used SVM as the main classification algorithm for the
implementation of spam email classification. We have chosen SVM over other
algorithms as SVM is good in text classification and in our work we have used text
part of the email as our input.
27
REFERENCES
1. https://www.sciencedirect.com/science/article/pii/S2405844018353404
2. https://github.com/topics/spam-classification
3. https://github.com/nishi1612/Email-Spam-Classification-using-SVM
4. https://github.com/ishmav16/Email-Classification-Spam-or-Ham
5. https://towardsdatascience.com/spam-classifier-in-python-from-scratch-
27a98ddd8e73
6. https://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-
with-python-and-scikit-learn.html
7. https://hackernoon.com/a-simple-spam-classifier-193a23666570
8. http://www.computerscijournal.org/vol10no3/a-theoretical-comparative-analysis-
of-classification-techniques-in-spam-mail-filtering/
9. http://svm.michalhaltuf.cz
10. https://link.springer.com/article/10.1007/s10462-010-9166-x
28