Email Spam Detection (Research Paper)
Email Spam Detection (Research Paper)
net/publication/344050184
CITATIONS READS
79 20,810
3 authors, including:
Sanket Sonowal
Indian Institute of Technology Guwahati
1 PUBLICATION 79 CITATIONS
SEE PROFILE
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
Abstract—- Email Spam has become a major problem well-known algorithms applied in these procedures. However,
nowadays, with Rapid growth of internet users, Email spams is also rejecting sends essentially dependent on content examination
increasing. People are using them for illegal and unethical can be a difficult issue in the event of bogus positives. Regularly
conducts, phishing and fraud. Sending malicious link through spam
emails which can harm our system and can also seek in into your
clients and organizations would not need any legitimate
system. Creating a fake profile and email account is much easy for messages to be lost. The boycott approach has been probably the
the spammers, they pretend like a genuine person in their spam soonest technique pursued for the separating of spams. The
emails, these spammers target those peoples who are not aware technique is to acknowledge all the sends other than those from
about these frauds. So, it is needed to Identify those spam mails the area/electronic mail ids. Expressly boycotted. With more up
which are fraud, this project will identify those spam by using to date areas coming into the classification of spamming space
techniques of machine learning, this paper will discuss the machine names this technique keeps an eye on no longer work so well.
learning algorithms and apply all these algorithm on our data sets
and best algorithm is selected for the email spam detection having The white list approach is the approach of accepting the mails
best precision and accuracy . from the domain names/addresses openly whitelisted and place
others in a much less importance queue, that is delivered most
Keywords: Machine learning, Naïve Bayes, support vector effectively after the sender responds to an affirmation request
machine-nearest neighbor, random forest, bagging, boosting, neural sent through the “junk mail filtering system”.
networks.
Spam and Ham: According to Wikipedia “the use of
electronic messaging systems to send unsolicited bulk
I. INTRODUCTION messages, especially mass advertisement, malicious links etc.”
Email or electronic mail spam refers to the “using of email are called as spam. “Unsolicited means that those things which
to send unsolicited emails or advertising emails to a group of you didn’t asked for messages from the sources. So, if you do
recipients. Unsolicited emails mean the recipient has not granted not know about the sender the mail can be spam. People
permission for receiving those emails. “The popularity of using generally don’t realize they just signed in for those mailers
spam emails is increasing since last decade. Spam has become a when they download any free services, software or while
big misfortune on the internet. Spam is a waste of storage, time updating the software. “Ham” this term was given by Spam
and message speed. Automatic email filtering may be the most Bayes around 2001 and it is defined as “Emails that are not
effective method of detecting spam but nowadays spammers can generally desired and is not considered spam”.
easily bypass all these spam filtering applications easily. Several
years ago, most of the spam can be blocked manually coming
from certain email addresses. Machine learning approach will be
used for spam detection. Major approaches adopted closer to
junk mail filtering encompass “text analysis, white and blacklists
of domain names, and community-primarily based techniques”.
Text assessment of contents of mails is an extensively used
method to the spams. Many answers deployable on server and
purchaser aspects are available. Naive Bayes is one of the utmost Fig.1. Classification into Spam and non-spam
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
Machine learning approaches are more efficient, a set of training “Stop words are the English words that do not add much
data is used, these samples are the set of email which are pre meaning to a sentence.” They can be safely ignored without
classified. Machine learning approaches have a lot of algorithms forgoing the sense of the sentence.
that can be used for email filtering. These algorithms include For example if it is tried to search a query like” How to make
“Naïve Bayes, support vector machines, Neural Networks, K- a veg cheese sandwich”, the search engine will try to search the
nearest neighbor, Random Forests etc.” web pages that contains the term “how”, “to” ,”make”, “a”
,”veg”, “cheese” ,”sandwich”. The search engine tries to find
II. LITERATURE REVIEW the web pages that contains the term “how” ,”to”, ”a” than page
containing the recipes of veg cheese sandwich because the
There is some related work that apply machine learning methods terms ” how” ,”to”, “a” are so commonly used in English
in email spam detection, A. Karim, S. Azam, B. Shanmugam, K. language ,If these three words are removed or stopped and
Kannoorpatti and M. Alazab.[ii] They describe a focused actually focuses on retrieving pages that contains the keyword
literature survey of Artificial Intelligence Revised (AI) and ” veg”, “cheese”, “sandwich” – that would give the result of
Machine learning methods for email spam detection. K. Agarwal interest.
[3] and T. Kumar. Harisinghaney et al. (2014) [4]and Mohamad
& Selamat (2015) [v] have used the “image and textual dataset
2. Tokenization:
for the e-mail spam detection with the use of various methods.
“Tokenization is the process of splitting a stream of manuscript
Harisinghaney et al. (2014) [iv] have used methods of KNN
into phrase, symbols, words, or any expressive elements named
algorithm, Naïve Bayes, and Reverse DBSCAN algorithm with
as tokens.” The rundown of token further utilized for
experimentation on dataset. For the text recognition, OCR
contribution for additional handling, for example, content
library” [iii] is employed but this OCR doesn't perform well.
mining and parsing. Tokenization is valuable in both semantics
Mohamad & Selamat (2015) [v] uses the feature selection hybrid
(where it is as content division), and as lexical examination in
approach of TF-IDF (Term Frequency Inverse Document
software engineering and building. It is occasionally hard to
Frequency) and Rough pure mathematics.
define what is intended by the term
A. Data Set “word”. As tokenization happens at the word level. Frequently
This model has used email data sets from different online a token trusts on modest heuristics, for instance: Tokens are
websites like Kaggle, sklearn and some data sets are created by parted by whitespaces characters, like “line break” or “space”,
own. A spam email data set from Kaggle is used to train our or by “punctuation characters”.
model and then other email data set is used for getting result Every single neighboring string of alphabetic characters are a
“spam.csv” data set contains 5573 lines and 2 columns and other piece of one token; similarly, with numbers.
data sets contains 574,1001,956 lines of email data set in text White spaces and punctuations might or might not involve in the
format. resulting lists of tokens.
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
the dependent events and works on the probability of the event Branch = shows outcome of the test
which is going to occur in the future that can be detected from
Leaf node= holds a class label Top
the same event which occurred previously. Naïve Bayes was
made on the Bayes theorem which assumes that features are node is called root node. examining
autonomous of each other. Naïve Bayes classifier technique
can be used for classifying spam emails as word probability plays
main role here. If there is any word which occurs often in spam but
not in ham, then that email is spam. Naive Bayes classifier algorithm
has become a best technique for email filtering. For this the model is
trained using the Naïve Bayes filter very well to work effectively. The
Naive Bayes always calculates the probability of each class and the
class having the maximum probability is then chosen as an output.
Naïve Bayes always provide an accurate result. It is used in many
fields like spam filtering.
(1)
Fig.3. Decision Tree Structure
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
Entropy using the frequency table of two attributes: 3. BOOSTING AND ADABOOST CLASSIFIER
“Boosting is a ensemble method that is use to create a strong
(4) classifier using a number of weak classifier. Boosting is
complete by creation a model from a training data sets, then
create another model that will precise the faults of the first
4. K- NEAREST NEIGBOUR model.” [8] In Boosting Model are added till the training set is
“K-nearest neighbors is a supervised classification algorithm. predicted properly.
This algorithm has some data point and data vector that are AdaBoost= Adaptive Boosting
separated into several classes to predict the classification of
new sample point.” AdaBoost is a first fruitful boosting algorithm that was settled
for binary classification. The boosting is understood by using
K- Nearest neighbor is a LAZY algorithm LAZY algorithm AdaBoost.
means it tries to only memorize the process it doesn’t learn by
itself. It doesn’t take its own decision by itself.
IV. ALGORITHMS
K- Nearest neighbor algorithm classifies new point based on a 1.1. Insert the dataset or file for training or testing.
similarity measure that can be Euclidian distance. 1.2. Check the dataset for supported encoding.
The Euclidean distance measure Euclidian distance and 1.2.1. If one of the supported encodings, then go to step
identifies who are its neighbors. 1.4.
1.2.2. If not one of the supported encoding, then go to
dist((x, y), (a, b)) = √(x - a)² + (y - b)² (5) step 1.3.
C. ENSEMBLE LEARNING METHODS 1.3. Change the encoding of the inserted file into one of the
“Ensemble methods in machine learning is a method that takes supported encodings. Then try again for reading.
several base model to produce a predictive model in order to
decrease. “variance by using bagging bias by using boosting 1.4. Select whether you want to “Train”, “Test’ or
predictions using stacking. Two Types Sequential- here base “Compare” the models using the dataset. 1.4.1. If
classifier are created sequentially Parallel- here base classifiers “Train” is selected, then go to step 1.5.
are in parallel. 1.4.2. If “Test” is selected, then go to step 1.6.
1.4.3. If “Compare” is selected, then go to step 1.7.
1. RANDOM FOREST CLASSIFIER
Random forest classifier is an ensemble tree classifier consisting 1.5. “Train” selected:
of different types of decision trees that are of different shape and 1.5.1. Select which classifier to train using the inserted
sizes. dataset.
The random sampling of the training data when building a tree. 1.5.2. Check for duplicates and NAN values.
A random subgroups of input features when splitting at node in 1.5.3. Find the values from Hyperparameter Tuning.
a tree. If you have randomness, the randomization will make 1.5.4. Process the text for feature transform.
look the decision tree less corelated so that generalization error 1.5.5. Train the model
(features of the tree should not look same) of ensemble can be 1.5.6. Save the model and features. Show the results.
improved. 1.5.7. Select which classifier to test using the inserted
dataset.
2. BAGGING 1.5.8. Check for duplicates and NAN values.
“Bagging classifier is an ensemble classifier that fits base 1.5.9. Load the model and features saved in the training
classifiers each on random sub sets of the original data sets and phase of the model.
then combined their individual calculations by voting or by 1.5.10. Using the loaded values for testing the dataset.
averaging) to form a final prediction. “Bagging is a mixture of 1.5.11. Show the results
bootstrapping and aggregating.
1.6. “Compare” selected:
Bagging= Bootstrap AGGregatING
1.6.1. Compare all the classifiers using the inserted
Bootstrapping helps to lessening the variance of the classifier dataset.
and it also decline the overfitting by just resampling the data 1.6.2. Show the results of the classifiers.
from the training data with same cardinality as in original data
set. High variance is not good for the model. Bagging is very
effective method for limited data, and by just using samples you A. Implementation
are able to get estimate by aggregating the scores. Visual studio code platform is used to implement the model
and, in this module, a dataset from “Kaggle” website is used
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
as a training dataset. The inserted dataset is first checked for dataset is also passed for “hyperparameter tuning” to find
duplicates and null values for better performance of the optimal values for the classifier to use according to the dataset.
machine. Then, the dataset is split into 2 sub-datasets; say After acquiring the values from the “hyperparameter tuning”,
“train dataset” and “test dataset” in the proportion of 70:30. the machine is fitted using those values with a random state.
Then the “train” and “test” dataset is then passed as parameters The state of the trained model and features are saved for future
for text-processing. In text-processing, punctuation symbols use for testing unseen data.
and words that are in the stop words list are removed and Using classifiers from module sklearn in python, the machines
returned as clean words. These clean words are then passed for are trained using the values obtained from above.
“Feature Transform”. In feature transform, the clean words
which are returned from the text-processing are then used for
‘fit’ and ‘transform’ to create a vocabulary for the machine. The
B. FlowChart of the model
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE
Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
V. RESULT
Our model has been trained using multiple classifiers to check
and compare the results for greater accuracy. Each classifier will the name of your paper. In this newly created file, highlight all
give its evaluated results to the user. After all the classifiers of the contents and import your prepared text file. You are now
return its result to the user; then the user can compare it with ready to style your paper; use the scroll down window on the
other results to see whether the data is “spam” or “ham”. Each left of the MS Word Formatting toolbar.
classifier result will be shown in graphs and tables for better
understanding. The dataset is obtained from “Kaggle” website
for training. The name of the dataset used is “spam.csv”. To test
the trained machine, a different CSV file is developed with
unseen data i.e. data which is not used for the training of the
machine; named “emails.csv”. After the text edit has been
completed, the paper is ready for the template. Duplicate the
template file by using the Save As command, and use the
naming convention prescribed by your conference for
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part
Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
uthorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.