0% found this document useful (0 votes)
47 views8 pages

Email Spam Detection (Research Paper)

Uploaded by

pp8743994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views8 pages

Email Spam Detection (Research Paper)

Uploaded by

pp8743994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344050184

Email Spam Detection Using Machine Learning Algorithms


Conference Paper · July 2022
DOI: 10.1109/ICIRCA48905.2020.9183098

CITATIONS READS

79 20,810

3 authors, including:

Sanket Sonowal
Indian Institute of Technology Guwahati
1 PUBLICATION 79 CITATIONS

SEE PROFILE
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Email Spam Detection Using Machine


Learning
Algorithms
Nikhil Kumar Nishant
Computer Science and Engineering Department Computer Science and Engineering Department
Delhi Technological University Delhi Technological University
New Delhi, India New Delhi, India
[email protected] [email protected]
Sanket Sonowal m
Computer Science and Engineering Department
Delhi Technological University
New Delhi, India
[email protected]

Abstract—- Email Spam has become a major problem well-known algorithms applied in these procedures. However,
nowadays, with Rapid growth of internet users, Email spams is also rejecting sends essentially dependent on content examination
increasing. People are using them for illegal and unethical can be a difficult issue in the event of bogus positives. Regularly
conducts, phishing and fraud. Sending malicious link through spam
emails which can harm our system and can also seek in into your
clients and organizations would not need any legitimate
system. Creating a fake profile and email account is much easy for messages to be lost. The boycott approach has been probably the
the spammers, they pretend like a genuine person in their spam soonest technique pursued for the separating of spams. The
emails, these spammers target those peoples who are not aware technique is to acknowledge all the sends other than those from
about these frauds. So, it is needed to Identify those spam mails the area/electronic mail ids. Expressly boycotted. With more up
which are fraud, this project will identify those spam by using to date areas coming into the classification of spamming space
techniques of machine learning, this paper will discuss the machine names this technique keeps an eye on no longer work so well.
learning algorithms and apply all these algorithm on our data sets
and best algorithm is selected for the email spam detection having The white list approach is the approach of accepting the mails
best precision and accuracy . from the domain names/addresses openly whitelisted and place
others in a much less importance queue, that is delivered most
Keywords: Machine learning, Naïve Bayes, support vector effectively after the sender responds to an affirmation request
machine-nearest neighbor, random forest, bagging, boosting, neural sent through the “junk mail filtering system”.
networks.
Spam and Ham: According to Wikipedia “the use of
electronic messaging systems to send unsolicited bulk
I. INTRODUCTION messages, especially mass advertisement, malicious links etc.”
Email or electronic mail spam refers to the “using of email are called as spam. “Unsolicited means that those things which
to send unsolicited emails or advertising emails to a group of you didn’t asked for messages from the sources. So, if you do
recipients. Unsolicited emails mean the recipient has not granted not know about the sender the mail can be spam. People
permission for receiving those emails. “The popularity of using generally don’t realize they just signed in for those mailers
spam emails is increasing since last decade. Spam has become a when they download any free services, software or while
big misfortune on the internet. Spam is a waste of storage, time updating the software. “Ham” this term was given by Spam
and message speed. Automatic email filtering may be the most Bayes around 2001 and it is defined as “Emails that are not
effective method of detecting spam but nowadays spammers can generally desired and is not considered spam”.
easily bypass all these spam filtering applications easily. Several
years ago, most of the spam can be blocked manually coming
from certain email addresses. Machine learning approach will be
used for spam detection. Major approaches adopted closer to
junk mail filtering encompass “text analysis, white and blacklists
of domain names, and community-primarily based techniques”.
Text assessment of contents of mails is an extensively used
method to the spams. Many answers deployable on server and
purchaser aspects are available. Naive Bayes is one of the utmost Fig.1. Classification into Spam and non-spam

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 108

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
Machine learning approaches are more efficient, a set of training “Stop words are the English words that do not add much
data is used, these samples are the set of email which are pre meaning to a sentence.” They can be safely ignored without
classified. Machine learning approaches have a lot of algorithms forgoing the sense of the sentence.
that can be used for email filtering. These algorithms include For example if it is tried to search a query like” How to make
“Naïve Bayes, support vector machines, Neural Networks, K- a veg cheese sandwich”, the search engine will try to search the
nearest neighbor, Random Forests etc.” web pages that contains the term “how”, “to” ,”make”, “a”
,”veg”, “cheese” ,”sandwich”. The search engine tries to find
II. LITERATURE REVIEW the web pages that contains the term “how” ,”to”, ”a” than page
containing the recipes of veg cheese sandwich because the
There is some related work that apply machine learning methods terms ” how” ,”to”, “a” are so commonly used in English
in email spam detection, A. Karim, S. Azam, B. Shanmugam, K. language ,If these three words are removed or stopped and
Kannoorpatti and M. Alazab.[ii] They describe a focused actually focuses on retrieving pages that contains the keyword
literature survey of Artificial Intelligence Revised (AI) and ” veg”, “cheese”, “sandwich” – that would give the result of
Machine learning methods for email spam detection. K. Agarwal interest.
[3] and T. Kumar. Harisinghaney et al. (2014) [4]and Mohamad
& Selamat (2015) [v] have used the “image and textual dataset
2. Tokenization:
for the e-mail spam detection with the use of various methods.
“Tokenization is the process of splitting a stream of manuscript
Harisinghaney et al. (2014) [iv] have used methods of KNN
into phrase, symbols, words, or any expressive elements named
algorithm, Naïve Bayes, and Reverse DBSCAN algorithm with
as tokens.” The rundown of token further utilized for
experimentation on dataset. For the text recognition, OCR
contribution for additional handling, for example, content
library” [iii] is employed but this OCR doesn't perform well.
mining and parsing. Tokenization is valuable in both semantics
Mohamad & Selamat (2015) [v] uses the feature selection hybrid
(where it is as content division), and as lexical examination in
approach of TF-IDF (Term Frequency Inverse Document
software engineering and building. It is occasionally hard to
Frequency) and Rough pure mathematics.
define what is intended by the term
A. Data Set “word”. As tokenization happens at the word level. Frequently
This model has used email data sets from different online a token trusts on modest heuristics, for instance: Tokens are
websites like Kaggle, sklearn and some data sets are created by parted by whitespaces characters, like “line break” or “space”,
own. A spam email data set from Kaggle is used to train our or by “punctuation characters”.
model and then other email data set is used for getting result Every single neighboring string of alphabetic characters are a
“spam.csv” data set contains 5573 lines and 2 columns and other piece of one token; similarly, with numbers.
data sets contains 574,1001,956 lines of email data set in text White spaces and punctuations might or might not involve in the
format. resulting lists of tokens.

III. METHODOLOGY 3. Bag of words


“Bag of Words (BOW) is a method of extracting features from
A. Data preprocessing: text documents. Further these features can be uses for training
When the data is considered, always a very large data sets with machine learning algorithms. Bag of Words creates a
large no. of rows and columns will be noted. But it is not always vocabulary of all the unique words present in all the document
the case the data could be in many forms such as Images, Audio in the Training dataset.”
and Video files Structured tables etc. Machine doesn’t
understand images or video, text data as it is, Machine only
understand 1s and 0s. B. CLASSIC CLASSIFIERS
Steps in Data Preprocessing: Classification is a form of data analysis that extracts the
Data cleaning: In this step the work like filling of “missing models describing important data classes. A classifier or a
values”, “smoothing of noisy data”, “identifying or removing model is constructed for prediction of class labels for example:
outliers “, and “resolving of inconsistencies is done.” Data “A loan application as risky or safe.”
Integration: In this step addition of several databases,
information files or information set is performed. Data classification is a two-step
Data transformation: Aggregation and normalization is - learning step (construction of classification model.) and
performed to scale to a specific value
- a classification step
Data reduction: This section obtains a summary of the dataset
which is very small in size but so far produces the same 1. NAÏVE BAYES:
analytical result Naïve Bayes classifier was used in 1998 for spam recognition.
1. Stop words: The Naïve Bayes classifier algorithm is an algorithm which is
used for supervised learning. The Bayesian classifier works on

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 109

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
the dependent events and works on the probability of the event Branch = shows outcome of the test
which is going to occur in the future that can be detected from
Leaf node= holds a class label Top
the same event which occurred previously. Naïve Bayes was
made on the Bayes theorem which assumes that features are node is called root node. examining
autonomous of each other. Naïve Bayes classifier technique
can be used for classifying spam emails as word probability plays
main role here. If there is any word which occurs often in spam but
not in ham, then that email is spam. Naive Bayes classifier algorithm
has become a best technique for email filtering. For this the model is
trained using the Naïve Bayes filter very well to work effectively. The
Naive Bayes always calculates the probability of each class and the
class having the maximum probability is then chosen as an output.
Naïve Bayes always provide an accurate result. It is used in many
fields like spam filtering.

(1)
Fig.3. Decision Tree Structure

Decision tree Induction:


(2)
The building of “decision tree classifiers” doesn’t need “any
domain knowledge or parameter setting that is suitable for
2. SUPPORT VECTOR MACHINE knowledge. “It handles
“The Support Vector Machine (SVM) is a popular multidimensional information. the
Supervised Learning algorithm, the Support Vector model is
learning and classification phases of
used for classification problems in Machine Learning
techniques. “The Support Vector Machines totally founded on decision tree induction are simple
the idea of Decision points. The Main resolution of Support
and fast. Characteristic choice
Vector Machine algorithm is to create the line or decision
boundary. The Support Vector Machine algorithm gives events are utilized to choose the
hyperplane as a output which classifies new samples. In characteristic that top parcel the
2dimensional space “hyperplane is line dividing a plane into 2
parts where each class is present in one side.” tuple into particular classes. At the
point when choice tree is
manufactured a significant number
of the branches may result may
reflect commotion and anomalies in
the preparation information. tree
pruning endeavors to recognize and
evacuate such branches, with the
objective of improving classifier
precision on an inconspicuous

Fig.2 Support Vector Machine information.

3. DECISION TREE Entropy using the frequency table of one attribute:


“Decision tree induction is the learning of decision tree from
class labeled training tuples”. A decision tree is a flow chart like
construction, where. (3)
Internal node or non- leaf node= Test on attribute

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 110

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
Entropy using the frequency table of two attributes: 3. BOOSTING AND ADABOOST CLASSIFIER
“Boosting is a ensemble method that is use to create a strong
(4) classifier using a number of weak classifier. Boosting is
complete by creation a model from a training data sets, then
create another model that will precise the faults of the first
4. K- NEAREST NEIGBOUR model.” [8] In Boosting Model are added till the training set is
“K-nearest neighbors is a supervised classification algorithm. predicted properly.
This algorithm has some data point and data vector that are AdaBoost= Adaptive Boosting
separated into several classes to predict the classification of
new sample point.” AdaBoost is a first fruitful boosting algorithm that was settled
for binary classification. The boosting is understood by using
K- Nearest neighbor is a LAZY algorithm LAZY algorithm AdaBoost.
means it tries to only memorize the process it doesn’t learn by
itself. It doesn’t take its own decision by itself.
IV. ALGORITHMS
K- Nearest neighbor algorithm classifies new point based on a 1.1. Insert the dataset or file for training or testing.
similarity measure that can be Euclidian distance. 1.2. Check the dataset for supported encoding.
The Euclidean distance measure Euclidian distance and 1.2.1. If one of the supported encodings, then go to step
identifies who are its neighbors. 1.4.
1.2.2. If not one of the supported encoding, then go to
dist((x, y), (a, b)) = √(x - a)² + (y - b)² (5) step 1.3.
C. ENSEMBLE LEARNING METHODS 1.3. Change the encoding of the inserted file into one of the
“Ensemble methods in machine learning is a method that takes supported encodings. Then try again for reading.
several base model to produce a predictive model in order to
decrease. “variance by using bagging bias by using boosting 1.4. Select whether you want to “Train”, “Test’ or
predictions using stacking. Two Types Sequential- here base “Compare” the models using the dataset. 1.4.1. If
classifier are created sequentially Parallel- here base classifiers “Train” is selected, then go to step 1.5.
are in parallel. 1.4.2. If “Test” is selected, then go to step 1.6.
1.4.3. If “Compare” is selected, then go to step 1.7.
1. RANDOM FOREST CLASSIFIER
Random forest classifier is an ensemble tree classifier consisting 1.5. “Train” selected:
of different types of decision trees that are of different shape and 1.5.1. Select which classifier to train using the inserted
sizes. dataset.
The random sampling of the training data when building a tree. 1.5.2. Check for duplicates and NAN values.
A random subgroups of input features when splitting at node in 1.5.3. Find the values from Hyperparameter Tuning.
a tree. If you have randomness, the randomization will make 1.5.4. Process the text for feature transform.
look the decision tree less corelated so that generalization error 1.5.5. Train the model
(features of the tree should not look same) of ensemble can be 1.5.6. Save the model and features. Show the results.
improved. 1.5.7. Select which classifier to test using the inserted
dataset.
2. BAGGING 1.5.8. Check for duplicates and NAN values.
“Bagging classifier is an ensemble classifier that fits base 1.5.9. Load the model and features saved in the training
classifiers each on random sub sets of the original data sets and phase of the model.
then combined their individual calculations by voting or by 1.5.10. Using the loaded values for testing the dataset.
averaging) to form a final prediction. “Bagging is a mixture of 1.5.11. Show the results
bootstrapping and aggregating.
1.6. “Compare” selected:
Bagging= Bootstrap AGGregatING
1.6.1. Compare all the classifiers using the inserted
Bootstrapping helps to lessening the variance of the classifier dataset.
and it also decline the overfitting by just resampling the data 1.6.2. Show the results of the classifiers.
from the training data with same cardinality as in original data
set. High variance is not good for the model. Bagging is very
effective method for limited data, and by just using samples you A. Implementation
are able to get estimate by aggregating the scores. Visual studio code platform is used to implement the model
and, in this module, a dataset from “Kaggle” website is used

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 111

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
as a training dataset. The inserted dataset is first checked for dataset is also passed for “hyperparameter tuning” to find
duplicates and null values for better performance of the optimal values for the classifier to use according to the dataset.
machine. Then, the dataset is split into 2 sub-datasets; say After acquiring the values from the “hyperparameter tuning”,
“train dataset” and “test dataset” in the proportion of 70:30. the machine is fitted using those values with a random state.
Then the “train” and “test” dataset is then passed as parameters The state of the trained model and features are saved for future
for text-processing. In text-processing, punctuation symbols use for testing unseen data.
and words that are in the stop words list are removed and Using classifiers from module sklearn in python, the machines
returned as clean words. These clean words are then passed for are trained using the values obtained from above.
“Feature Transform”. In feature transform, the clean words
which are returned from the text-processing are then used for
‘fit’ and ‘transform’ to create a vocabulary for the machine. The
B. FlowChart of the model

Fig.4. Flow Chart of Model

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 112

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE
Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

V. RESULT
Our model has been trained using multiple classifiers to check
and compare the results for greater accuracy. Each classifier will the name of your paper. In this newly created file, highlight all
give its evaluated results to the user. After all the classifiers of the contents and import your prepared text file. You are now
return its result to the user; then the user can compare it with ready to style your paper; use the scroll down window on the
other results to see whether the data is “spam” or “ham”. Each left of the MS Word Formatting toolbar.
classifier result will be shown in graphs and tables for better
understanding. The dataset is obtained from “Kaggle” website
for training. The name of the dataset used is “spam.csv”. To test
the trained machine, a different CSV file is developed with
unseen data i.e. data which is not used for the training of the
machine; named “emails.csv”. After the text edit has been
completed, the paper is ready for the template. Duplicate the
template file by using the Save As command, and use the
naming convention prescribed by your conference for

TABLE I. COMPARISION TABLE


Classifiers Score 1 Score 2 Score 3 Score 4
1 Support Vector Classifier 0.81 0.92 0.95 0.92

2 K-Nearest Neighbour 0.92 0.88 0.87 0.88


3 Naïve Bayes 0.87 0.98 0.98 0.98

4 Decision Tree 0.94 0.95 0.93 0.95

5 Random Forest 0.90 0.92 0.92 0.92

6 AdaBoost Classifier 0.95 0.94 0.95 0.94


Fig.5 Comparison of all algorithms
7 Bagging Classifier 0.94 0.94 0.95 0.94
VI. CONCLUSION
With this result, it can be concluded that the Multinomial Naïve Bayes
a. score 1: using default parameters
gives the best outcome but has limitation due to class-conditional
b. score 2: using hyperparameter tuning independence which makes the machine to misclassify some tuples.
c. score 3: using stemmer and hyperparameter tuning Ensemble methods on the other hand proven to be useful as they using
d. score 4: using length, stemmer and hyperparameter tuning multiple classifiers for class prediction. Nowadays, lots of emails are
sent and received and it is difficult as our project is only able to test
emails using a limited amount of corpus. Our project, thus spam
detection is proficient of filtering mails giving to the content of the
email and not according to the domain names or any other criteria.
Therefore, at this it is an only limited body of the email. There is a wide
possibility of improvement in our project. The subsequent
improvements can be done:
“Filtering of spams can be done on the basis of the trusted and verified
domain names.”
“The spam email classification is very significant in categorizing e-
mails and to distinct e-mails that are spam or non-spam.”
“This method can be used by the big body to differentiate decent mails
that are only the emails they wish to obtain.”

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 113

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

REFERENCES Detection. IEEE Access, 7, 168261-168295.


[08907831]. https://doi.org/10.1109/ACCESS.2019.2954791
1. Suryawanshi, Shubhangi & Goswami,
Anurag & Patil, Pramod. (2019). Email 3. K. Agarwal and T. Kumar, "Email Spam Detection Using
Spam Detection: An Empirical Integrated Approach of Naïve Bayes and Particle Swarm Optimization,"
Comparative Study of 2018 Second International Conference on Intelligent Computing and
Control Systems (ICICCS), Madurai, India, 2018, pp. 685-690.
Different ML and Ensemble
Classifiers. 69-74. 4. Harisinghaney, Anirudh, Aman Dixit, Saurabh Gupta, and Anuja
10.1109/IACC48062.2019.8971582. Arora. "Text and image-based spam email classification using
2. Karim, A., Azam, S., Shanmugam, B., KNN, Naïve Bayes and Reverse DBSCAN algorithm." In
Krishnan, K., & Alazab, Optimization, Reliabilty, and Information Technology (ICROIT),
M. (2019). A Comprehensive Survey for 2014 International Conference on, pp.153-155. IEEE, 2014
Intelligent Spam Email
5. Mohamad, Masurah, and Ali Selamat. "An evaluation on the
efficiency of hybrid feature selection in spam email classification." In
Computer, Communications, and Control Technology (I4CT),
2015 International Conference on, pp. 227-231. IEEE, 2015
6. Shradhanjali, Prof. Toran Verma “E-Mail Spam Detection and
Classification Using SVM and Feature Extraction”in International
Jouranl Of Advance Reasearch, Ideas and Innovation In
Technology,2017 ISSN: 2454-132X Impact factor: 4.295
7. W.A, Awad & S.M, ELseuofi. (2011). Machine Learning Methods
for Spam E-Mail Classification. International Journal of Computer Science
& Information Technology. 3. 10.5121/ijcsit.2011.3112.
8. A. K. Ameen and B. Kaya, "Spam detection in online social
networks by deep learning," 2018 International Conference on Artificial
Intelligence and Data Processing (IDAP), Malatya, Turkey, 2018, pp. 1-4.
9. Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T. (2019). Root
Cause Detection with an Ensemble Machine Learning Approach in the
Multivariate Manufacturing Process.
10. Tasnim Kabir, Abida Sanjana Shemonti, Atif Hasan Rahman.
"Notice of Violation of IEEE Publication Principles: Species
Fig.6. Comparison Graph Identification Using Partial DNA Sequence: A Machine
Learning Approach”, 2018 IEEE 18th International Conference
on Bioinformatics and Bioengineering (BIBE), 2018.

View publication stats

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 114

uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

You might also like